PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li ∗ , Yitao Duan † , Yu Yu § , Shouyao Zhao § , Wei Xu ∗ ∗ Institute for Interdisciplinary Information Sciences, Tsinghua University † NetEase Youdao § Shanghai Jiaotong University
Making use of data vs. data privacy Privacy Compliance Data asset 2
Scenario 1: Multi-source data mining s r e n w o a t a d f o s t u p n i e t a v i r P Compute servers see nothing Get nothing other than the final results 3
Scenario 2: Inference w/ secret models and data Private model Private data Inference result Similar setting: federated learning, but want to protect the model itself. 4
A nice theory provide solution u Secure multi-party computation (MPC) x 1 F(x 1 , x 2 , …. x n ) y x 2 • We can compute any function F() without revealing the inputs x i . • No noise introduced in computation, and do not reveal anything. x n 5
Tons of cryptography-based solutions tell us … Ø Many novel theoretical solutions • Secret Sharing (Shamir 1979) • Garbled Circuit (Yao 1986) • Fully Homomorphic Encryption (Gentry 2009) Ø Even many “practical” solutions exist • Sharemind (2008) • TASTY (2010) • PICCO (2013) • SPDZ (2008) • SecureML(2017) • ABY3(2018) Ø But, why people still not using it to mine real world data? 6
The gap between cryptography and data science The Cryptography World The Data Science World • Efficient bit-wise and integer operations • Efficient operations on real numbers • Fast single number arithmetic • Fast vector and array operations • Theoretically innovative • Scalable system implementation • A custom and beautiful programming language • Familiar language with rich algorithm libraries The gap is like a set of data structures v.s. a relational database 7
PrivPy attempts to bridge the gap • A fast (4,2)-secret-sharing protocol and engine PrivPy Convenient APIs Language • Python language with automatic code Front-end Interpreter Optimizer optimizer Computation Engines Back-end • NumPy types and libraries • Runs non-trivial algorithms on real data 8
Crypto preliminary: basic secret sharing - Two semi-honest servers: S 1 and S 2 - A large (e.g. 256 bits) number 𝑞 - Computation in the field 𝜚 𝑞 = {0, 1, …, 𝑞 -1} S 1 S 2 𝑣 𝑣 1 𝑣 2 = + 𝜒 ( 𝑣 ) = ( 𝑣 1 , 𝑣 2 ) 𝑣 1 : uniformly distributed in 𝜚 𝑞 𝑣 2 : = 𝑣 - 𝑣 1 (mod 𝑞 ) 9
Multiplication: Our (4 2) -secret sharing scheme , , 𝑤 * 𝑤 + 𝑤 + 𝑤 * , , 𝑣 * 𝑣 + 𝑣 + 𝑣 * • Two auxiliary servers S a and S a S b S b to compute the cross terms • Benefit: one round of S 1 S 2 communication only for × , , 𝑣 * 𝑣 * 𝑣 + 𝑣 + , , 𝑤 * 𝑤 * 𝑤 + 𝑤 + , , , , 𝑣 + 𝑤 + 𝑣 + 𝑤 * 𝑣 * 𝑤 + 𝑣 * 𝑤 * 𝑥 = 𝑣 × 𝑤 = + + + 𝑢 * 𝑢 + 𝑢 1 𝑢 2 10
Using fixed-point to represent real numbers 010010011100100.11011001001 … Fixed-length 𝑚 − 𝑙 Fixed-length 𝑙 Integer part decimal part 010010011100100 11011001001 • Use expensive bit-level operations Ø PICCO, Sharemind, SPDZ, etc • Support built-in fixed-point operations Ø SecureML, ABY3, PrivPy 11
The PrivPy computation engine Servers Clients TASK CONFIG 𝑦 𝐷 1 Python code 𝑇 1 𝑇 𝑏 Data source addr Result addr PO SS Store PO SS Store Engine 1 Engine a … … 𝐷 𝑙 𝑧 … … SS Store PO PO SS 2 Engine Engine Store b 𝑇 𝑐 𝑇 2 𝑨 𝐷 𝑜 12
The PrivPy computation engine Servers Clients res 1 + res 2 = res 𝑦 1 𝑦 𝐷 1 𝑇 1 𝑇 𝑏 𝑦 1 SS Store SS Store PO PO 𝑦 2 a 1 Engine Engine … … res 1 𝑧 1 Private Ops 𝐷 𝑙 𝑧 Protocols 𝑧 2 𝑦 2 … … 𝑨 1 PO SS Store PO SS Store Engine 2 Engine b 𝑇 𝑐 𝑇 2 res 2 𝑨 2 𝑨 𝐷 𝑜 13
Python compatible programming front-end u Overload basic operations for private variables: +, -, × , >, etc 14
Most existing solutions define their own language PICCO OblivC SPDZ Why? Many pitfalls if written in Python resulting in inefficiency. 15
AST-level code optimization to avoid pitfalls Common factor � � …… � ! � …… � ! " � " � � � � ! " � ! " � " � " � Auto vectorization × Still adding more optimizations to the language frontend. 16
APIs: from basic OPs to algorithms u Division: Newton-Raphson method Basic OPs Derived OPs u Sigmoid: Euler Method u ReLu: comparison Division Add u Other functions: e 𝘺 , log(x), … Sigmoid function SS( 𝑒 ) 1 𝑧 𝑦 = Mul =( 𝑒 1 , 𝑒 2 ) 1 + 𝑓 FG ReLU 𝑧′ 𝑦 = 𝑧(𝑦)(1 − 𝑧(𝑦)) 𝑧 𝑦 IJ* = 𝑧 𝑦 I + 𝑧 , (𝑦 I )Δ𝑦 Cmp = 𝑧 𝑦 I + 𝑧 𝑦 I 1 − 𝑧 𝑦 I Δ𝑦 Garbled circuit 17
APIs: arrays are first-class citizen • Array is a built-in type Ø 𝐵 = 𝑞𝑞. 𝑡𝑏𝑠𝑠 … ; 𝐶 = 𝑞𝑞. 𝑡𝑏𝑠𝑠( … ) Ø Both 𝐵 ∗ 𝐶 and 𝐵 + 𝐶 work • Array type is essential for data mining: reduces # of ops, thus # of rounds • Support large arrays (e.g. 1 million × 5000, ~200GB) using automatic disk buffer management 18
Beyond arrays: NumPy’s broadcasting and ndarray u Allow operations between arrays of different shapes Ø E.g. Ø 12d-scalar 𝑦 , a 3 * 4 array 𝐵 and a 2 * 3 * 4 array 𝐶 Ø 𝑦 + 𝐵 , 𝐵 ∗ 𝐶 and 𝑦 > 𝐶 all work Ø Can even mix plaintext and cipher text u Ndarray methods 𝑧 = 𝑔(𝑥 ⊺ ⋅ 𝑌 + 𝑐) 19
API example: neural network inference image PrivPy Inference result Engine model 20
Basic operation performance Throughput of basic operations (ops per second) LAN (10Gbps) Engine Approach decimal multiplication comparison PrivPy SS 10,473,532 1,282,027 Helib FHE 258 - Obliv-C GC 3,930 78,431 P4P+HE SS+HE 4,344 - SS with active SPDZ 83,073 20,472 security SS with active SPDZ+PrivPy 83,229 20,320 security Our thin wrapper 21
Real world algorithm performance Dataset: MNIST with 70,000 labeled handwritten digits Algorithm: • Logistic Regression (LR) : trained using SGD • Matrix Factorization (MF) : decomposes a 𝑛 × 𝑜 matrix to a 𝑛 ×5 matrix and a 5 ×𝑜 matrix • CNN : LeNet-5 Time of training/inference for 1 iteration (seconds) LAN (10Gbps) WAN (50Mbps) Batch size MF CNN MF CNN LR training LR training training inference training inference Single op 5.3e-3 7.1e-3 9.6e-2 2.61 0.37 7.64 Batch (1000 3.92 5.67 12.02 7.3 13.2 56.3 ops) 22
Conclusion and future work u MPC can be useful in data mining, but big gap to bridge u PrivPy is an early attempt to make MPC practical for large datasets Ø Language, data types, function libraries Ø Scalable and efficient system implementation Ø Heavily rely on language-level optimizations u PrivPy is an on-going effort Ø Integrating with other privacy-preserving techniques – differential privacy, federated learning, trusted execution etc. Ø More libraries, algorithms and compiler optimizations Wei Xu –http://iiis.tsinghua.edu.cn/~weixu Yi Li – xiaolixiaoyi@gmail.com 23
Recommend
More recommend