Observation ω + - A[4], A[5] During NTT loop m = 2 A[2], A[3] 1. Read {A[0], A[1]} [1 Cycle] A[0], A[1] 2. Butterfly(A[0], A[1]) [1 Cycle] 3. Write {A[0], A[1]} [1 Cycle] 55
Observation ω + - A[4], A[5] A[2], A[3] A[2] Problem happens next, when m=4. A[0] A[0], A[1] Butterfly(A[0], A[2]), Butterfly(A[4], A[6]) … 56
Solution Process 4 coefficients together ω + - A[4], A[5] A[2], A[3] A[0], A[1] 57
Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2], A[3] A[0], A[1] 2R 58
Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2] A[3] A[2], A[3] A[0] A[1] A[0], A[1] 2R + 2C 59
Solution Process 4 coefficients together A[0], A[1] A[2], A[3] ω + - A[4], A[5] A[2] A[3] A[1], A[3] A[0] A[1] A[0], A[2] 2R + 2C + 2W 60
Results Before O(n) + O(n) + O(n log n) + O(n)
Results Before O(n) + O(n) + O(n log n) + O(n) Optimization 1 Zero-cost prescaling Now O(n) + O(n) + O(n log n) + O(n)
Results Before O(n) + O(n) + O(n log n) + O(n) Optimization 2 Memory access reduction Now O(n) + O(n) + ½ O(n log n) + O(n)
Architecture of NTT-based polynomial multiplier 64
Lattice-based public-key instruction-set encryption processor Instruction Set Throughput: 1. LOAD 50,000 encryptions/sec 2. ENCODE-LOAD 3. GAUSSIAN-LOAD 100,000 decryptions/sec 4. FFT Area: 1349 LUT, 860 FF, 5. INV-FFT 1 DSPMULT, 2 BRAM18 6. ADD 7. CMULT (polynomial degree 256) 8. REARRANGE 9. READ Publication: CHES 2014
Instruction-set ring-LWE cryptoprocessor Throughput: 50,000 encryption/sec 100,000 decryption/sec CHES 2014 Area: 1.3K LUT, 860 FF, 1 DSPMULT, 2 BRAM18 CHES 2012 Throughput: 40,000 encryption/sec < 80,000 decryption/sec < Area: 18349 LUT, 5644 FF ECC 66
Ring-LWE encryption: followup works Software implementations • On 32-bit ARM R. de Clercq, S. Sinha Roy, F. Vercauteren, I.Verbauwhede, "Efficient software implementation of ring-LWE encryption", DATE 2015 Encryptions 121,166 cycles Decryptions 43,324 cycles Orders of magnitude faster than ECC • On 8-bit AVR Z. Liu, H. Seo, S. Sinha Roy, J. Großschädl, H. Kim, I. Verbauwhede, "Efficient Ring-LWE Encryption on 8-Bit AVR Processors", CHES 2015 Encryptions 671,628 cycles Decryptions 275,646 cycles 67
Ring-LWE encryption: followup works Side channel security: masking scheme O. Reparaz, S. Sinha Roy, F. Vercauteren, I. Verbauwhede, "A masked ring-LWE implementation", in CHES 2015 68
Hardware accelerators for Homomorphic Computation 69
Homomorphic computation Interesting applications : • Machine learning on encrypted data • Prediction from consumption data in smart electricity meters • Health-care applications • Encrypted web-search engine
Lattice-based Homomorphic encryption 1. Encrypt 2. Decrypt public keys 𝒒 𝟏 , 𝒒 𝟐 private key 𝒕 𝒆𝒃𝒖𝒃 𝒒 𝟐 𝒅𝒖 𝟐 𝒆𝒃𝒖𝒃 𝒕 𝒒 𝟏 𝒅𝒖 𝟏 71
Lattice-based Homomorphic encryption 1. Encrypt locally 2. Process on cloud 3. Decrypt locally public keys 𝒒 𝟏 , 𝒒 𝟐 private key 𝒕 𝒆𝒃𝒖𝒃 ∗ 𝒒 𝟐 𝒅𝒖 𝟐 𝒅𝒖 𝟐 𝒆𝒃𝒖𝒃 ∗ 𝒕 ∗ 𝒅𝒖 𝟏 𝒒 𝟏 𝒅𝒖 𝟏 Evaluated many times 72
Homomorphic Multiplication 𝒅𝒖 𝑩,𝟏 𝒅𝒖 𝑫,𝟏 𝒅𝒖 𝑩,𝟐 𝒅𝒖 𝑫,𝟐 𝒅𝒖 𝑪,𝟏 𝒅𝒖 𝑪,𝟐 Uses multiple computation blocks: • Lift • Polynomial multiplication • Scale 73
Homomorphic Multiplication. How complex? 𝒅𝒖 𝑩,𝟏 𝒅𝒖 𝑫,𝟏 𝒅𝒖 𝑩,𝟐 Two challenges 𝒅𝒖 𝑫,𝟐 • Coefficient size 𝒅𝒖 𝑪,𝟏 • Polynomial length 𝒅𝒖 𝑪,𝟐 Low complexity applications: M edium complexity applications: • Polynomials have 4,000 coeffs. • Polynomials have 32,000 coeffs. • Coeffs are ~180 bit wide • Coeffs are ~1200 bit wide Lattice-based key exchange schemes • Polynomials with 256 or 512 coeffs • Coeff size ~10 bits 74
Hardware accelerators for Homomorphic Computation → Arit rithmetic ic of f lar large coeffic ficie ients 75
Application of Residue Number System • We need to compute arithmetic modulo q • Let q = ∏q i where q i are coprime • Then we can work with Residue Number System (RNS) Chinese Arithmetic mod q 0 Remainder Arithmetic mod q 1 Arithmetic mod q Theorem … (CRT) Arithmetic mod q L RNS arithmetic Result mod q • Small coefficients • Parallel computation 76
Example: polynomial multiplication Let q = q 0 ∙q 1 where q 0 and q 1 are of equal bit-length Input: a(x), b(x) mod q Overhead1: Splitting into residues a(x) * b(x) mod q 0 a(x) * b(x) mod q 1 Advantages: • Parallel multiplications • Smaller ALU width due to smaller coefficient size Overhead2: Chinese Remainder Theorem Reconstruction from residues Output: a(x) * b(x) mod q 77
On Hardware Parallel processing using multiple Residue Polynomial Arithmetic Unit (RPAU) RPAU L RPAU 0 Memory Memory Core Core File File Number or RPAUs is a design parameter
Hardware accelerators for Homomorphic Computation → Arit rithmetic ic of f lar large coeffic ficie ients → Arit rithmetic ic of f lar large poly lynomia ials ls 79
Polynomial multiplication : multiple butterfly cores BRAM … Single core NTT too slow! Design Challenges: BRAM • Long routing • Memory access conflicts BRAM 80
Memory access parallelism m=2048 m=4096 m=8 m=2 m=4 Upper Upper Upper BRAM Upper Upper #2047 … … … NTT Core 2 #1024 Lower Lower BRAM Lower Lower Lower #1023 … … … NTT Core 1 #0 COSIC - KU Leuven 81
Block Level Pipelining Lift • Separate building blocks for block-level pipeline • Realize a resource shared architecture • Reduces the area requirement • Increase the computation time 82
Execution Units • Two parallel cores for Lift and Scale • Seven Residue Polynomial Arithmetic Unit (RPAU) Lift & Scale RPAU 0 ... RPAU 6 Core Core Core 0 0 0 Memory Memory File Core File Core Core 1 1 1 Parameter: • Ciphertext polynomial degree 4096 • Ciphertext coefficient size 180 83
Arm rm + FPGA Im Imple lementatio ion Zynq UltraScale+ MPSoC ZCU102 FPGA Arm 0 Arm 1 DMA AXI Coprocessor 0 Interface Arm 2 Arm 3 Cache AXI Coprocessor 1 Interface Mem. Controller Source code public on Github 84
Performance of High-Level Operations Speed Operation (cycles) (msec) Add in HW 31,339 0.026 Multiply in HW 5,349,567 4.458 Send two ciphertext to HW 434,013 0.362 Receive result ciphertext from HW 216,697 0.180 Measurements are in cycles of CPU clocked at 1200 MHz Coprocessor is clocked at 200 MHz Publication: HPCA 2019 400 homomorphic multiplications per sec (2 cores) Faster than Tesla K80 GPU 85
Reso source Utiliz ilizatio ion LUTs REGs BRAMs DSPs # of used instances % utilization 133,692 60,312 815 416 Two Coprocessors & Interface 49 11 89 16 63,522 25,622 388 208 A Single Coprocessor & Interface 23 5 43 8 86
Conclusions so far • Ring-LWE is efficient in hardware and software • But, there are security concerns due to special structure a 0 -a 3 -a 2 -a 1 s 0 e 0 b 0 a 1 a 0 -a 3 -a 2 s 1 e 1 b 1 + ≈ * (mod q) a 2 a 1 a 0 -a 3 s 2 e 2 b 2 a 3 a 2 a 1 a 0 s 3 e 3 b 3 Special structure in matrix 87
Interpolating LWE and ring-LWE: Module LWE e 0 a 8 -a 11 -a 10 -a 9 s 0 b 0 a 0 -a 3 -a 2 -a 1 e 1 a 9 a 8 -a 11 -a 10 s 1 b 1 a 1 a 0 -a 3 -a 2 e 2 a 10 a 9 a 8 -a 11 s 2 b 2 a 2 a 1 a 0 -a 3 e 3 b 3 a 11 a 10 a 7 a 8 s 3 a 3 a 2 a 1 a 0 + ≈ * e 4 s 4 b 4 a 12 -a 15 -a 14 -a 13 a 4 -a 7 -a 6 -a 5 e 5 s 5 b 5 a 5 a 4 -a 7 -a 6 a 13 a 12 -a 15 -a 14 e 6 b 6 s 6 a 6 a 5 a 4 -a 7 a 14 a 13 a 12 -a 15 e 7 b 7 s 7 a 7 a 6 a 5 a 4 a 15 a 14 a 13 a 12 a 0,0 (x) a 0,1 (x) s 0 (x) e 0 (x) b 0 (x) + ≈ (mod q) (mod x 4 + 1) * a 1,0 (x) a 1,1 (x) s 1 (x) e 1 (x) b 1 (x) 88
Saber: Module-LWR based key exchange, CPA-secure encryption and CCA-secure KEM a lattice-based candidate for NIST standardization moved to second round! Jointly designed by EE and Math team! 89
SABER: flexibility and efficiency • Saber uses module-LWR problem • Polynomials are always of 256 coefficients [Efficient pol. arithmetic] • Flexibility : matrix dimensions is parameterizable ➢ 2-by-2 for 115-bit post-quantum security Light SABER ➢ 3-by-3 for 180-bit post-quantum security SABER ➢ 4-by-4 for 245-bit post-quantum security Fire SABER 90
SABER: Parameter set a 0,0 (x) ... a 0,k-1 (x) s 0 (x) b 0 (x) p (mod x 256 + 1) ... ≈ … * q a k-1,0 (x) ... a k-1,k-1 (x) s k-1 (x) b k-1 (x) • Polynomials of fixed size 256 coefficients • Flexible dimension k = 2, 3 or 4 • How to choose p and q? 91
Learning with rounding (LWR) A problem with rounding: where p < q Uniform in [0, q-1] Prime q introduces rounding bias - Cannot use prime q - Hence, no NTT-based fast polynomial multiplication + No modular reduction + Easy rounding → We need to use generic polynomial multiplication algorithm
Next best polynomial multiplication algorithms • Karatsuba multiplication O(n log 2 3 ) 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 2 1 3 2 3 . . . . . . . . . . . . . . . . . . . . 128 128 128 128
Next best polynomial multiplication algorithms • Toom-Cook multiplication 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 2 1 4 2 4 . . . . . . . . . . . . . . . . . . . . 64 64 64 64 Toom-Cook 4 Way needs 7 multiplications Karatsuba would need 9 multiplications
Toom-Cook 4 Way: step-by-step: splitting 256 256 A(x) B(x) . . . . . . . . . . . . . . . . 1 1 2 4 2 4 . . . . . . . . . . . . . . . . . . . . 64 64 64 64 Splitting operand into 4 polynomials Take y = x 64 A( y ) = A 3 y 3 + A 2 y 2 + A 1 y + A 0 B( y ) = B 3 y 3 + B 2 y 2 + B 1 y + B 0
Toom-Cook 4 Way: step-by-step: evaluation Linear operations + Seven multiplications are computed
Toom-Cook 4 Way: step-by-step: interpolation Linear operations This number has a role to play Linear operations
Advanced Vector Extensions (AVX) Vectorized instructions for 16-bit operands
DSP instructions ARM Cortex-M4 • Popular 32-bit microcontroller • Has DSP instructions for half-word operations
+ AVX Microcontroller with DSP Keep coefficients smaller/equal to 16 bits to use ➢ _epi16( ) AVX intrinsics in high-end platforms ➢ DSP instructions in low-end microcontrollers Options for q: 2 16 , 2 15 , 2 14 , 2 13 … etc
Recommend
More recommend