High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10 th 2018 – CHES 2018
FHE – The holy grail of Cryptography • FHE enables computing on encrypted data without decryption [GB2009] • Challenge: requires enormous computation Homomorphic evaluation of 𝑔 Encryption/Decryption 𝐹𝑜𝑑(𝑦) 𝐹𝑜𝑑(𝑔(𝑦)) Cloud Server Client Ahmad Al Badawi - ahmad@u.nus.edu 2
How the problem is being tackled? • Algorithmic methods: – New FHE schemes – Plaintext packing (1D, 2 D, …) – Encoding schemes – Approximated computing – Squashing the target function – DAG optimizations for the target circuit • Acceleration methods: – Speedup FHE basic primitives (KeyGen, Enc, Dec, Add, Mul) – Modular Algorithms – Parallel Implementations – Hardware implementations: GPUs, FPGAs and probably ASICs Ahmad Al Badawi - ahmad@u.nus.edu 3
Our Contributions 1. Implementation of FV RNS on GPUs 2. Introducing a set of CUDA optimizations 3. Benchmarking with state-of-the-art implementations Ahmad Al Badawi - ahmad@u.nus.edu 4
Why GPUs for FHE? • GPU + – Naturally available – many computing cores – Developer friendly (FPGA, ASIC) • FHE + “If you were plowing a field, – Huge level of parallelism which would you rather use? Two strong oxen or 1024 chickens ?” Seymour Cray 1925-1996 Ahmad Al Badawi - ahmad@u.nus.edu 5
Textbook FV • Basic mathematical structure is 𝑆: ℤ 𝑦 /(𝑦 𝑜 + 1) – Plaintext space: 𝑆 𝑢 : ℤ 𝑢 𝑦 /(𝑦 𝑜 + 1) – Ciphertext space: 𝑆 𝑟 : ℤ 𝑟 𝑦 /(𝑦 𝑜 + 1) • Public key: 𝑞𝑙 0 , 𝑞𝑙 1 ∈ 𝑆 𝑟 • Secret key: 𝑡𝑙 ∈ 𝑆 𝑟 𝑟 • 𝑑 = 𝐹𝑜𝑑(𝑛): ( 𝑢 𝑛 + 𝑞𝑙 0 𝑣 + 𝑓 0 𝑟 , 𝑞𝑙 1 𝑣 + 𝑓 1 𝑟 ) 𝑢 • 𝑛 = 𝐸𝑓𝑑 𝑑 : 𝑑 0 + 𝑑 1 𝑡𝑙 𝑟 𝑢 𝑟 • 𝑑 + = 𝐵𝑒𝑒 𝑑 0 , 𝑑 1 : ( 𝑑 00 + 𝑑 10 𝑟 , 𝑑 01 + 𝑑 11 𝑟 ) Ahmad Al Badawi - ahmad@u.nus.edu 6
Textbook FV (cont.) • 𝑑 × = 𝑁𝑣𝑚 𝑑 0 , 𝑑 1 , 𝑓𝑤𝑙 : 1. Tensor product: 𝑢 𝑢 𝑤 0 = ⌊ 𝑟 𝑑 00 𝑑 10 ⌉ 𝑟 , 𝑤 2 = ⌊ 𝑟 𝑑 01 𝑑 11 ⌉ 𝑟 𝑤 1 = ⌊𝑢 𝑟 (𝑑 00 𝑑 11 + 𝑑 01 𝑑 10 )⌉ 𝑟 2. Base decomposition: 𝑚 (𝑗) 𝑥 𝑗 𝑤 2 = 𝑤 2 , 𝑚 = 𝑗=0 2. Relinearization: 𝑚 𝑑 × = 𝑤 𝑘 + 𝑓𝑤𝑙 𝑗𝑘 ⋅ 𝑤 2 (𝑗) , 𝑘 ∈ {0,1} 𝑗=0 𝑟 Ahmad Al Badawi - ahmad@u.nus.edu 7
Implementation Requirements • Polynomial arithmetic in cyclotomic rings • Large polynomial degree (a few thousands) – Power-of-2 cyclotomic – Addition/Subtraction: 𝒫(𝑜) – Multiplication: 𝒫(n log 𝑜) • Large coefficients ∈ ℤ 𝑟 (a few hundreds of bits) – Modular algorithms (RNS) • Extra non-trivial operations: 𝑢 – Scaling-and-round ⌊ 𝑟 𝑦⌉ – Base decomposition Ahmad Al Badawi - ahmad@u.nus.edu 8
Polynomial Arithmetic CRT : 𝑙−1 𝑟 = 𝑞 𝑗 , where 𝑞 𝑗 is a prime 𝑗=0 log 2 𝑞 -bit number RNS / CRT 𝑜 . . . 𝑞 0 . 𝑞 1 . . . 𝑙 = log 2 𝑟 . log 2 𝑞 . 𝑞 𝑙−1 Addition/Subtraction : component-wise add/sub modulo 𝑞 𝑗 Ahmad Al Badawi - ahmad@u.nus.edu 9
Polynomial Arithmetic (cont.) 32-bit NTT number RNS / CRT NTT (DGT) 𝑜 𝑜 . . . . . . 𝑞 0 𝑞 0 . . 𝑞 1 𝑞 1 . . . . . . 𝑙 = log 2 𝑟 . . log 2 𝑞 . . 𝑞 𝑙−1 𝑞 𝑙−1 NTT −1 Addition/Subtraction/Multiplication : component-wise add/sub/mul modulo 𝑞 𝑗 Ahmad Al Badawi - ahmad@u.nus.edu 10
DFT, NTT, DWT, DGT…? Pros Cons ′ s ) increase - Floating point errors increase as (𝑜 & 𝑞 𝑗 - Well-established DFT ′ s ) => longer RNS matrix - Reduce precision (smaller 𝑞 𝑗 - Several efficient libraries to use => more DFTs - Transform length ( 2𝑜 ) NTT - Exact DWT - Exact - Only power-of-2 cyclotomics - Transform length ( 𝑜 ) DGT - Exact - Only power-of-2 cyclotomics 𝑜 - Gaussian Arithmetic (larger number of - Transform length ( 2 ) multiplications ~(30% - 40%) - 50% Less interaction with memory • We use DGT in our implementation Ahmad Al Badawi - ahmad@u.nus.edu 11
Efficient DGT/NTT/DWT on GPU? • Better to store 𝑥 𝑘𝑗 in lookup table. 𝐵 = NTT(𝑏) s.t. 𝑜−1 – LUT can be stored in GPU texture 𝐵 𝑘 = 𝑏 𝑗 𝑥 𝑘𝑗 mod 𝑟 𝑗=0 𝑏 = NTT −1 (𝐵) s.t. memory (which is limited on GPU) 𝑜−1 𝑏 𝑗 = 𝑜 −1 𝐵 𝑘 𝑥 −𝑗𝑘 – DWT LUT are 𝒫(𝑜) mod 𝑟 𝑘=0 𝑜 – DGT LUT are 𝒫( 2 ) • Compute in 𝐻𝐺(𝑞 ) or in 𝐻𝐺(𝑞 𝑗 ) ? – We found it is better to do it 𝐻𝐺(𝑞 𝑗 ) . – Why? (see next) Ahmad Al Badawi - ahmad@u.nus.edu 12
Compute in 𝐻𝐺(𝑞 ) or in 𝐻𝐺(𝑞 𝑗 ) ? 𝑜 . . . 𝑞 0 . . 𝑞 1 . . 𝑙 = log 2 𝑟 . . log 2 𝑞 𝑞 𝑙−1 ) 𝑯𝑮(𝒒 𝑯𝑮(𝒒 𝒋 ) 𝑞 : 64-bit prime (should fit in one word) 𝑞 : word-size prime (can be 64-bit) 𝑞 - Shorter RNS matrix => Less NTTs 𝑞 ≤ 2𝑜 (one multiplication) - No size doubling - Supports unlimited number of 2 12 2 13 2 14 2 15 2 16 𝑜 operations in NTT domain log 2 𝑞 26 25 25 24 24 - Longer RNS matrix => more NTTs - Size double (32-bit => 64-bit) - Supports limited number of operations in NTT domain Ahmad Al Badawi - ahmad@u.nus.edu 13
But, is NTT/DWT/DGT performance-critical? Breakdown of homomorphic multiplication (AND) in the BFV FHE scheme Toy Settings Medium Settings NTT NTT RNS Base Extension RNS Base Extension RNS Scaling RNS Scaling others others Large Settings NTT RNS Base Extension RNS Scaling others Halevi, Shai, Yuriy Polyakov, and Victor Shoup. "An Improved RNS Variant of the BFV Homomorphic Encryption Scheme." (2018). Ahmad Al Badawi - ahmad@u.nus.edu 14
Computing CRT on GPU? - CRT 𝑏, {𝑞 𝑗 } : • At least two methods: (𝑏 0 , … , 𝑏 𝑙−1 ) = 𝑏 mod 𝑞 𝑗 - CRT −1 (𝑏 0 , … , 𝑏 𝑙−1 ) = 𝑏 s.t. – Classic algorithm – Garner’s algorithm 𝑙−1 −1 𝑏 = 𝑟 𝑟 𝑏 𝑗 (mod 𝑞 𝑗 ) (mod 𝑟) 𝑞 𝑗 𝑞 𝑗 𝑗=0 𝑙−1 where 𝑟 = 𝑞 𝑗 𝑗=0 Classic Garners 𝑙 2 𝑙 𝑙 − 1 LUT 2 Non tractable Nil Thread Divergence • Is CRT critical to performance? – No! Ahmad Al Badawi - ahmad@u.nus.edu 15
RNS tools • Useful to: – Remain in RNS representation – No costly multi-precision arithmetic • Two basic operations: – Scale-and-round – Base decomposition • Adopted from (BEHZ2016 * ) scheme • Are RNS tools critical to performance? – Extremely critical * Bajard, Jean-Claude, et al. "A full RNS variant of FV like somewhat homomorphic encryption schemes." International Conference on Selected Areas in Cryptography . Springer, Cham, 2016. Ahmad Al Badawi - ahmad@u.nus.edu 16
FV_RNS Homomorphic Multiplication Ahmad Al Badawi - ahmad@u.nus.edu 17
Benchmarking Results Dec Key Generation 1600.000 18.000 16.000 1400.000 GPU-FV GPU-FV 14.000 1200.000 SEAL SEAL 12.000 Time (ms) 1000.000 Time (ms) NFLlib-FV NFLlib-FV 10.000 800.000 8.000 600.000 6.000 400.000 4.000 200.000 2.000 0.000 0.000 (11,62) (12,186) (13,372) (14,744) (11,62) (12,186) (13,372) (14,744) Enc HomoMul + Relinearization 35.000 500.000 450.000 30.000 GPU-FV GPU-FV 400.000 SEAL SEAL 25.000 350.000 Time (ms) Time (ms) 300.000 NFLlib-FV NFLlib-FV 20.000 250.000 15.000 200.000 150.000 10.000 100.000 5.000 50.000 0.000 0.000 (11,62) (12,186) (13,372) (14,744) (11,62) (12,186) (13,372) (14,744) Ahmad Al Badawi - ahmad@u.nus.edu 18
Which FV RNS variant to Implement? • Two RNS variants of FV – BEHZ – HPS • Answer can be found in: – Al Badawi, Ahmad, et al. "Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme." IACR Cryptology ePrint Archive 2018 (2018): 589. Ahmad Al Badawi - ahmad@u.nus.edu 20
Thank You Questions? Ahmad Al Badawi ahmad@u.nus.edu 21
Recommend
More recommend