Using Fully y Homomorphic Encryp yption for St Statistical Analysi sis s of Categori rical, , Or Ordinal and and Num Numeric ical al Data Wen-jie Lu 1 , Shohei Kawasaki 1 , Jun Sakuma 1,2,3 1. University of Tsukuba, Japan 2. JST CREST 3. RIKEN Center for AIP 1
Statistical Analysis on the Cloud Third party cloud server Query & Result Data collection Analyst Multiple data providers Cloud computing is useful for statistical analysis • Gather distributed data, and reduce hardware cost. • Minimal interactions between data providers and the cloud. • The cloud does most of the work for the analyst. 2
Cloud Computing with Sensitive Data Third party cloud server Sensitive data • Using outside cloud servers raises privacy concerns. o E.g, medical records, federal data. • We want to calculate statistics on the cloud while keeping the data secret. 3
Secure Multiparty Computation (SMC) Z = F(x, y) x y x, y: private input Only reveals Z! F: public function • Off-the-shelf tools for SMC protocols o Yao’s garbled circuit (GC). o Fully homomorphic encryption (FHE). • But development cost and efficiency hinder applications of GC and FHE in the cloud. Yao Andrew. Protocols for secure secure computation. 1982. Gentry. Fully homomorphic encryption using ideal lattices . 2009. 4
GC on the Cloud Environment Secret GC protocol Sharing GC requires a large development cost • Multiple servers are needed. o Assume no collusion between servers . • Fast network is necessary for computation. o E.g., 10Gbps bandwidth. 5
FHE on the Cloud Environment FHE protocol ciphertexts • Less development cost o Single server is enough. o Rapid network is not necessary. • But might be inefficient in practice o Encrypt bits one by one. o 1~10 ms per evaluation. o 1~10 megabytes per ciphertext. 6 Gentry et al . Homomorphic Evaluation of the AES Circuit . 2012.
Observation • Purpose of encrypting bits separately o To evaluate any Boolean function. • But to do statistical analysis, we can use o matrix arithmetic operation. o comparison operation. 7
Our Result • Two new FHE-based primitives: o Matrix Operations o Batch Greater-than • Secure statistical protocols: o histogram (count), o order of counts, o contingency table (with cell-suppression), o percentile, o principal component analysis (PCA), o linear regression. • Source codes: https://github.com/fionser/CODA 8
Preliminaries: Fully Homomorphic Encryption • Public-private key scheme. o Data providers & cloud share the public key. o The analyst holds the private key. • Allow addition (subtraction) and multiplication on encrypted integers. o Analogy: black box with gloves Brakerski et al. Fully Homomorphic Encryption without Bootstrapping. 2012. 9
Preliminaries: Packing (Batching) • Enable to encrypt and process vectors at no extra cost. 1 2 3 4 Single 1 2 3 4 + x homomorphic operation 8 7 6 5 8 7 6 5 Multiple results 9 9 9 9 8 14 18 20 o Fewer ciphertexts o Faster computation N.P. Smart et al. Fully homomorphic SIMD operations . 2011. 10
Preliminaries: Slot Manipulation Rotate slots of the encrypted vector. >> 2 1 2 3 4 3 4 1 2 Replicate a specific slot. @3 8 5 1 5 1 1 1 1 Halevi et al. Algorithms in Helib . 2014. 11
Part II Technical Details • Data preprocessing. • Efficient matrix multiplication on ciphertexts. • Comparing two encrypted integers. • Example of two protocols: o Contingency table with cell-suppression o Linear regression (for other protocols, refer to our paper). 12
Data Preprocessing • Numerical data: fixed-point representation o 3.14159 → ⌈3.14159 ×1000⌋ = 3142 o Precision (e.g., 1000) determined in advance • Categorical data: 1-of-k representation o Gender (i.e., k = 2). Female → [1, 0] and Male → [0, 1] • Ordinal data: stair-case encoding 13
Proposed Matrix Primitive • Used for adding & multiplying encrypted matrices • Encrypt each row separately by packing. o Row-wise encryption. o Horizontally partitioned data • Efficient and layout consistent. o 𝑃 𝑂 2 homomorphic operations. 14
Matrix Multiplication[1/2] • Encrypt the matrix row by row with packing. 11 42 × 1𝑏 2 𝑒2 = 11𝑏 + 2𝑑 𝑐 1𝑐 + 2𝑒 Replicate 1 3𝑐 + 4𝑒2 3 𝑑 3𝑏 + 4𝑑 @1 @2 multiply 1a+2c 1b+2d a b 1 1 2 add 3 multiply 2 2 c d 15
Matrix Multiplication[1/2] • Encrypt the matrix row by row with packing. 11 42 × 1𝑏 2 𝑒2 = 11𝑏 + 2𝑑 𝑐 1𝑐 + 2𝑒 Replicate 3𝑐 + 4𝑒2 3 𝑑 3𝑏 + 4𝑑 @1 @2 multiply 1a+2c 1b+2d a b 3 3 add multiply 3a+4c 3b+4d 4 4 c d • N 2 replications, multiplications and additions o 𝑃 𝑂 2 complexity compared to 𝑃 𝑂 3 (no packing). • Also row-wisely encrypted resulting matrix. 16
Matrix Multiplication[2/2] • Layout consistency is important for developing efficient statistical protocols. o Statistical algorithms need iterative matrix multiplications Efficient for single multiplication Layout No Yes consistent ?? Heavy layout adjustment Inefficient for Still efficient for iterative multi. iterative multi. 17
Experimental Settings of Matrix Primitive • Implementations: o FHE: HElib (C++ based) o GC : ObliVM (java based) • Evaluated on 32-bit integers • Networks: o LAN (about 88 Mbps) o WAN (about 48 Mbps) HElib. https://github.com/shaih/HElib. Liu et al. ObliVM: A programming framework for secure computation . 2015. 18
Evaluation of Matrix Primitive Execution Time Communication Cost 10000 100000 10000 1000 537133056 Data Transferred (MB) Elapsed Time (s) 1000 67174400 Second 100 MB 100 8404992 FHE-LAN 10 1052672 10 FHE-WAN GC 132096 1 GC-LAN 1 FHE GC-WAN 16640 0.1 0.1 2 4 8 16 32 64 2 4 8 16 32 64 #Matrix Dimension Matrix Dimension • When do iterative multiplications, FHE-based primitive can offer better performance. o Save communication cost between each iteration 19
Greater-than (GT) Primitive → 𝑓(𝑦 > ? 𝑧) s.t. 0 ≤ 𝑦, 𝑧 ≤ D GT e 𝑦 , 𝑓 𝑧 • [Golle06] based on Paillier cryptosystem: 𝑗𝑔 𝑦 > 𝑧 𝑢ℎ𝑓𝑜 ∃𝑙 ∈ 1, 𝐸 → 𝑦 − 𝑧 − 𝑙 = 0 • Combination with packing gives great improvements: 𝑓 𝑦, … , 𝑦 − 𝑓 𝑧, … , 𝑧 − [1, 2, … , 𝐸] → 𝑓(𝜽) Replicated D times o 0 ∈ 𝜽 ⟺ 𝑦 > 𝑧 (i.e., decryption is needed) o Complexity from 𝐸 to ⌈D/ℓ ⌉ . Golle. A private stable matching algorithm . 2006. 20
Experimental Settings for GT Primitive • Implementations: o FHE: HElib (C++ based) o GC : ObliVM (java based) • Domain 𝐸 = 2 4 ~ 2 24 • Number of slots ℓ ≈ 1700. • Networks: LAN (about 88 Mbps) o WAN (about 48 Mbps) o HElib. https://github.com/shaih/HElib. Liu et al. ObliVM: A programming framework for secure computation . 2015. 21
Evaluation of Greater-than Primitive Execution Time Communication Cost 1000 10000 FHE-LAN GC Data Transferred (MB) 1000 FHE-WAN FHE 100 Elapsed Time (s) 100 GC-LAN Second GC-WAN 10 MB 10 1 0.1 1 0.01 136 124 112 100 88 76 0.001 0.1 4 8 12 16 20 24 4 8 12 16 20 24 #Bits #Bits Works for small domains, which is enough for ordinal statistics. 22
Secure Statistical Protocols • Contingency table with cell-suppression protocol: o Use the greater-than primitive. o One round protocol between cloud and analyst. • Linear regression protocol: o Use the matrix primitive. o Two rounds protocol. o Use a Plaintext Precision Expansion technique (discuss it latter). 23
Contingency Table Gender Smoke K 2 = 2 Male Smoker Smoker Non-smoker Female Non-smoker Male 1 1 K 1 = 2 Male Non-Smoker Female 0 1 Categorical data Contingency Table • Indicator encoding: Male → [1, 0], Female → [0, 1] Smoker → [1, 0], Non-smoker → [0, 1] • Basic Idea: multiply & rotate [a 1 , a 2 ] x [b 1 , b 2 ] counts Male-Smoker, and Female-Nonsmoker [a 1 , a 2 ] x ([b 1 , b 2 ]>>1) = [a 1 , a 2 ] x [b 2 , b 1 ] gives other two counts. Improvement with no extra preprocessing • o O(max(k 1 ,k 2 )) => O(log k 1 k 2 ). 24
Contingency Table: Cell Suppression if < 10 Smoker Non-smoker zero out Smoker Non-smoker Male 20 11 Male 20 11 Female 0 12 Female 3 12 Origin Table Suppressed Table • Protect the privacy of rare individuals . • Given a ciphertext 𝑓(𝑦) , to compute 𝑓 𝑧 where if 𝑦 > threshold then 𝑧 = 𝑦 else 𝑧 = some random value • 𝐻𝑈 𝑓 𝑦 , threshold = 𝑓 𝜽 . iff 𝑦 > threshold, then 0 ∈ 𝜽 . • To compute { 𝑓 𝑦 + 𝒔 , 𝑓 𝜽 + 𝒔 , 𝑓 𝜽 × 𝒔′ } o Non-zero random vectors 𝒔 , 𝒔 ’ o If 0 ∈ 𝜽, we have 0 ∈ 𝜽×𝒔 ’, then we can get 𝒔 and know 𝑦 . 25
Contingency Table Performance Evaluation Elapsed Time (s) #records = 4000 (k 1 k 2 ) • Complexity increases logarithmically with the table sizes. • Most of the work (>90%) done by the cloud. 26
Recommend
More recommend