Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy - PowerPoint PPT Presentation

Observation ω + - A[4], A[5] During NTT loop m = 2 A[2], A[3] 1. Read {A[0], A[1]} [1 Cycle] A[0], A[1] 2. Butterfly(A[0], A[1]) [1 Cycle] 3. Write {A[0], A[1]} [1 Cycle] 55

Observation ω + - A[4], A[5] A[2], A[3] A[2] Problem happens next, when m=4. A[0] A[0], A[1] Butterfly(A[0], A[2]), Butterfly(A[4], A[6]) … 56

Solution Process 4 coefficients together ω + - A[4], A[5] A[2], A[3] A[0], A[1] 57

Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2], A[3] A[0], A[1] 2R 58

Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2] A[3] A[2], A[3] A[0] A[1] A[0], A[1] 2R + 2C 59

Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2] A[3] A[1], A[3] A[0] A[1] A[0], A[2] 2R + 2C + 2W 60

Results Before O(n) + O(n) + O(n log n) + O(n)

Results Before O(n) + O(n) + O(n log n) + O(n) Optimization 1 Zero-cost prescaling Now O(n) + O(n) + O(n log n) + O(n)

Results Before O(n) + O(n) + O(n log n) + O(n) Optimization 2 Memory access reduction Now O(n) + O(n) + ½ O(n log n) + O(n)

Architecture of NTT-based polynomial multiplier 64

Lattice-based public-key instruction-set encryption processor Instruction Set Throughput: 1. LOAD 50,000 encryptions/sec 2. ENCODE-LOAD 3. GAUSSIAN-LOAD 100,000 decryptions/sec 4. FFT Area: 1349 LUT, 860 FF, 5. INV-FFT 1 DSPMULT, 2 BRAM18 6. ADD 7. CMULT (polynomial degree 256) 8. REARRANGE 9. READ Publication: CHES 2014

Instruction-set ring-LWE cryptoprocessor Throughput: 50,000 encryption/sec 100,000 decryption/sec CHES 2014 Area: 1.3K LUT, 860 FF, 1 DSPMULT, 2 BRAM18 CHES 2012 Throughput: 40,000 encryption/sec < 80,000 decryption/sec < Area: 18349 LUT, 5644 FF ECC 66

Ring-LWE encryption: followup works Software implementations • On 32-bit ARM R. de Clercq, S. Sinha Roy, F. Vercauteren, I.Verbauwhede, "Efficient software implementation of ring-LWE encryption", DATE 2015 Encryptions 121,166 cycles Decryptions 43,324 cycles Orders of magnitude faster than ECC • On 8-bit AVR Z. Liu, H. Seo, S. Sinha Roy, J. Großschädl, H. Kim, I. Verbauwhede, "Efficient Ring-LWE Encryption on 8-Bit AVR Processors", CHES 2015 Encryptions 671,628 cycles Decryptions 275,646 cycles 67

Ring-LWE encryption: followup works Side channel security: masking scheme O. Reparaz, S. Sinha Roy, F. Vercauteren, I. Verbauwhede, "A masked ring-LWE implementation", in CHES 2015 68

Hardware accelerators for Homomorphic Computation 69

Homomorphic computation Interesting applications : • Machine learning on encrypted data • Prediction from consumption data in smart electricity meters • Health-care applications • Encrypted web-search engine

Lattice-based Homomorphic encryption 1. Encrypt 2. Decrypt public keys 𝒒 𝟏 , 𝒒 𝟐 private key 𝒕 𝒆𝒃𝒖𝒃 𝒒 𝟐 𝒅𝒖 𝟐 𝒆𝒃𝒖𝒃 𝒕 𝒒 𝟏 𝒅𝒖 𝟏 71

Lattice-based Homomorphic encryption 1. Encrypt locally 2. Process on cloud 3. Decrypt locally public keys 𝒒 𝟏 , 𝒒 𝟐 private key 𝒕 𝒆𝒃𝒖𝒃 ∗ 𝒒 𝟐 𝒅𝒖 𝟐 𝒅𝒖 𝟐 𝒆𝒃𝒖𝒃 ∗ 𝒕 ∗ 𝒅𝒖 𝟏 𝒒 𝟏 𝒅𝒖 𝟏 Evaluated many times 72

Homomorphic Multiplication 𝒅𝒖 𝑩,𝟏 𝒅𝒖 𝑫,𝟏 𝒅𝒖 𝑩,𝟐 𝒅𝒖 𝑫,𝟐 𝒅𝒖 𝑪,𝟏 𝒅𝒖 𝑪,𝟐 Uses multiple computation blocks: • Lift • Polynomial multiplication • Scale 73

Homomorphic Multiplication. How complex? 𝒅𝒖 𝑩,𝟏 𝒅𝒖 𝑫,𝟏 𝒅𝒖 𝑩,𝟐 Two challenges 𝒅𝒖 𝑫,𝟐 • Coefficient size 𝒅𝒖 𝑪,𝟏 • Polynomial length 𝒅𝒖 𝑪,𝟐 Low complexity applications: M edium complexity applications: • Polynomials have 4,000 coeffs. • Polynomials have 32,000 coeffs. • Coeffs are ~180 bit wide • Coeffs are ~1200 bit wide Lattice-based key exchange schemes • Polynomials with 256 or 512 coeffs • Coeff size ~10 bits 74

Hardware accelerators for Homomorphic Computation → Arit rithmetic ic of f lar large coeffic ficie ients 75

Application of Residue Number System • We need to compute arithmetic modulo q • Let q = ∏q i where q i are coprime • Then we can work with Residue Number System (RNS) Chinese Arithmetic mod q 0 Remainder Arithmetic mod q 1 Arithmetic mod q Theorem … (CRT) Arithmetic mod q L RNS arithmetic Result mod q • Small coefficients • Parallel computation 76

Example: polynomial multiplication Let q = q 0 ∙q 1 where q 0 and q 1 are of equal bit-length Input: a(x), b(x) mod q Overhead1: Splitting into residues a(x) * b(x) mod q 0 a(x) * b(x) mod q 1 Advantages: • Parallel multiplications • Smaller ALU width due to smaller coefficient size Overhead2: Chinese Remainder Theorem Reconstruction from residues Output: a(x) * b(x) mod q 77

On Hardware Parallel processing using multiple Residue Polynomial Arithmetic Unit (RPAU) RPAU L RPAU 0 Memory Memory Core Core File File Number or RPAUs is a design parameter

Hardware accelerators for Homomorphic Computation → Arit rithmetic ic of f lar large coeffic ficie ients → Arit rithmetic ic of f lar large poly lynomia ials ls 79

Polynomial multiplication : multiple butterfly cores BRAM … Single core NTT too slow! Design Challenges: BRAM • Long routing • Memory access conflicts BRAM 80

Memory access parallelism m=2048 m=4096 m=8 m=2 m=4 Upper Upper Upper BRAM Upper Upper #2047 … … … NTT Core 2 #1024 Lower Lower BRAM Lower Lower Lower #1023 … … … NTT Core 1 #0 COSIC - KU Leuven 81

Block Level Pipelining Lift • Separate building blocks for block-level pipeline • Realize a resource shared architecture • Reduces the area requirement • Increase the computation time 82

Execution Units • Two parallel cores for Lift and Scale • Seven Residue Polynomial Arithmetic Unit (RPAU) Lift & Scale RPAU 0 ... RPAU 6 Core Core Core 0 0 0 Memory Memory File Core File Core Core 1 1 1 Parameter: • Ciphertext polynomial degree 4096 • Ciphertext coefficient size 180 83

Arm rm + FPGA Im Imple lementatio ion Zynq UltraScale+ MPSoC ZCU102 FPGA Arm 0 Arm 1 DMA AXI Coprocessor 0 Interface Arm 2 Arm 3 Cache AXI Coprocessor 1 Interface Mem. Controller Source code public on Github 84

Performance of High-Level Operations Speed Operation (cycles) (msec) Add in HW 31,339 0.026 Multiply in HW 5,349,567 4.458 Send two ciphertext to HW 434,013 0.362 Receive result ciphertext from HW 216,697 0.180 Measurements are in cycles of CPU clocked at 1200 MHz Coprocessor is clocked at 200 MHz Publication: HPCA 2019 400 homomorphic multiplications per sec (2 cores) Faster than Tesla K80 GPU 85

Reso source Utiliz ilizatio ion LUTs REGs BRAMs DSPs # of used instances % utilization 133,692 60,312 815 416 Two Coprocessors & Interface 49 11 89 16 63,522 25,622 388 208 A Single Coprocessor & Interface 23 5 43 8 86

Conclusions so far • Ring-LWE is efficient in hardware and software • But, there are security concerns due to special structure a 0 -a 3 -a 2 -a 1 s 0 e 0 b 0 a 1 a 0 -a 3 -a 2 s 1 e 1 b 1 + ≈ * (mod q) a 2 a 1 a 0 -a 3 s 2 e 2 b 2 a 3 a 2 a 1 a 0 s 3 e 3 b 3 Special structure in matrix 87

Interpolating LWE and ring-LWE: Module LWE e 0 a 8 -a 11 -a 10 -a 9 s 0 b 0 a 0 -a 3 -a 2 -a 1 e 1 a 9 a 8 -a 11 -a 10 s 1 b 1 a 1 a 0 -a 3 -a 2 e 2 a 10 a 9 a 8 -a 11 s 2 b 2 a 2 a 1 a 0 -a 3 e 3 b 3 a 11 a 10 a 7 a 8 s 3 a 3 a 2 a 1 a 0 + ≈ * e 4 s 4 b 4 a 12 -a 15 -a 14 -a 13 a 4 -a 7 -a 6 -a 5 e 5 s 5 b 5 a 5 a 4 -a 7 -a 6 a 13 a 12 -a 15 -a 14 e 6 b 6 s 6 a 6 a 5 a 4 -a 7 a 14 a 13 a 12 -a 15 e 7 b 7 s 7 a 7 a 6 a 5 a 4 a 15 a 14 a 13 a 12 a 0,0 (x) a 0,1 (x) s 0 (x) e 0 (x) b 0 (x) + ≈ (mod q) (mod x 4 + 1) * a 1,0 (x) a 1,1 (x) s 1 (x) e 1 (x) b 1 (x) 88

Saber: Module-LWR based key exchange, CPA-secure encryption and CCA-secure KEM a lattice-based candidate for NIST standardization moved to second round! Jointly designed by EE and Math team! 89

SABER: flexibility and efficiency • Saber uses module-LWR problem • Polynomials are always of 256 coefficients [Efficient pol. arithmetic] • Flexibility : matrix dimensions is parameterizable ➢ 2-by-2 for 115-bit post-quantum security Light SABER ➢ 3-by-3 for 180-bit post-quantum security SABER ➢ 4-by-4 for 245-bit post-quantum security Fire SABER 90

SABER: Parameter set a 0,0 (x) ... a 0,k-1 (x) s 0 (x) b 0 (x) p (mod x 256 + 1) ... ≈ … * q a k-1,0 (x) ... a k-1,k-1 (x) s k-1 (x) b k-1 (x) • Polynomials of fixed size 256 coefficients • Flexible dimension k = 2, 3 or 4 • How to choose p and q? 91

Learning with rounding (LWR) A problem with rounding: where p < q Uniform in [0, q-1] Prime q introduces rounding bias - Cannot use prime q  - Hence, no NTT-based fast polynomial multiplication + No modular reduction + Easy rounding → We need to use generic polynomial multiplication algorithm

Next best polynomial multiplication algorithms • Karatsuba multiplication O(n log 2 3 ) 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 2 1 3 2 3 . . . . . . . . . . . . . . . . . . . . 128 128 128 128

Next best polynomial multiplication algorithms • Toom-Cook multiplication 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 2 1 4 2 4 . . . . . . . . . . . . . . . . . . . . 64 64 64 64 Toom-Cook 4 Way needs 7 multiplications Karatsuba would need 9 multiplications

Toom-Cook 4 Way: step-by-step: splitting 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 1 2 4 2 4 . . . . . . . . . . . . . . . . . . . . 64 64 64 64 Splitting operand into 4 polynomials Take y = x 64 A( y ) = A 3 y 3 + A 2 y 2 + A 1 y + A 0 B( y ) = B 3 y 3 + B 2 y 2 + B 1 y + B 0

Toom-Cook 4 Way: step-by-step: evaluation Linear operations + Seven multiplications are computed

Toom-Cook 4 Way: step-by-step: interpolation Linear operations This number has a role to play Linear operations

Advanced Vector Extensions (AVX) Vectorized instructions for 16-bit operands

DSP instructions ARM Cortex-M4 • Popular 32-bit microcontroller • Has DSP instructions for half-word operations

+ AVX Microcontroller with DSP Keep coefficients smaller/equal to 16 bits to use ➢ _epi16( ) AVX intrinsics in high-end platforms ➢ DSP instructions in low-end microcontrollers Options for q: 2 16 , 2 15 , 2 14 , 2 13 … etc

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy - PowerPoint PPT Presentation

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy Solving system of linear equations System of linear equations with unknown s Gaussian elimination solves s when number of equations m n 2 System of linear equations

OCEAN SCIEN ENCE E & E ENGIN INEERIN ING Southern University of Science and Technology

St Stabiliz ilizatio ion/Solid lidif ific icatio ion ( (S/S) S/S) V Valu lue E Engin

Reduction Perspective Mdlin -Dorin in Pop op Autom omati tion and Applied Infor ormati

Lattice-based Signcryption without Random Oracles

A History of Lattice-Based Encryption (in order of increasing efficiency) Vadim Lyubashevsky

Spe pecia ialty ty Track ack paper: "David Garlan. Research Software Architecture and

Operational Practices Internet Security [1] VU Engin Kirda engin@infosys.tuwien.ac.at

Spe pecia ialty ty Track ack Software Architecture and Construction Seminar title:

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Ko Koh Br h Brot others hers Eco Eco Eng Engine ineerin ering g Li Limit mited ed

Some Recent Progress in Lattice-Based Cryptography Chris Peikert SRI TCC 2009 1 / 17

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

MIT 6.875 & Berkeley CS276 Foundations of Cryptography Lecture 20 TODAY: Lattice-based

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Testing Internet Security [1] VU Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Lattice-based cryptography II Constructions and implementation issues Leon Groot Bruinderink

Lattice-based cryptography (I) Thijs Laarhoven ts

C OMPARATIVE A NALYSIS O F S OFTWARE L IBRARIES F OR P UBLIC K EY C RYPTOGRAPHY Ashraf Abusharekh

Lattice-Based Cryptography: Constructing Trapdoors and More Applications Chris Peikert Georgia

Z c (3900) from lattice QCD based on Y. Ikeda et al., (HAL QCD), arXiv.1602.03465(hep-lat).

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Lattice-Based Cryptography Chris Peikert University of Michigan QCrypt 2016 1 / 24 Agenda 1

Improvement and Efficient Implementation of a Lattice-based Signature scheme Rachid El

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy - PowerPoint PPT Presentation

Engin ineerin ing Lattice-based Cry ryptography Sujoy Sinha Roy Solving system of linear equations System of linear equations with unknown s Gaussian elimination solves s when number of equations m n 2 System of linear equations

OCEAN SCIEN ENCE E &amp; E ENGIN INEERIN ING Southern University of Science and Technology

St Stabiliz ilizatio ion/Solid lidif ific icatio ion ( (S/S) S/S) V Valu lue E Engin

Reduction Perspective Mdlin -Dorin in Pop op Autom omati tion and Applied Infor ormati

Lattice-based Signcryption without Random Oracles

A History of Lattice-Based Encryption (in order of increasing efficiency) Vadim Lyubashevsky

Spe pecia ialty ty Track ack paper: &quot;David Garlan. Research Software Architecture and

Operational Practices Internet Security [1] VU Engin Kirda engin@infosys.tuwien.ac.at

Spe pecia ialty ty Track ack Software Architecture and Construction Seminar title:

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Ko Koh Br h Brot others hers Eco Eco Eng Engine ineerin ering g Li Limit mited ed

Some Recent Progress in Lattice-Based Cryptography Chris Peikert SRI TCC 2009 1 / 17

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

MIT 6.875 &amp; Berkeley CS276 Foundations of Cryptography Lecture 20 TODAY: Lattice-based

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Testing Internet Security [1] VU Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Lattice-based cryptography II Constructions and implementation issues Leon Groot Bruinderink

Lattice-based cryptography (I) Thijs Laarhoven ts

C OMPARATIVE A NALYSIS O F S OFTWARE L IBRARIES F OR P UBLIC K EY C RYPTOGRAPHY Ashraf Abusharekh

Lattice-Based Cryptography: Constructing Trapdoors and More Applications Chris Peikert Georgia

Z c (3900) from lattice QCD based on Y. Ikeda et al., (HAL QCD), arXiv.1602.03465(hep-lat).

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

Lattice-Based Cryptography Chris Peikert University of Michigan QCrypt 2016 1 / 24 Agenda 1

Improvement and Efficient Implementation of a Lattice-based Signature scheme Rachid El

OCEAN SCIEN ENCE E & E ENGIN INEERIN ING Southern University of Science and Technology

Spe pecia ialty ty Track ack paper: "David Garlan. Research Software Architecture and

MIT 6.875 & Berkeley CS276 Foundations of Cryptography Lecture 20 TODAY: Lattice-based