Data Analy1cs WITHOUT Seeing the Data Max O> … with input from the en1re N1 Team max.o>@data61.csiro.au www.csiro.au
Challenge Result Learn this! Computa(on Learn NOTHING Confidential 2 | Data Analy(cs Without Seeing the Data
The Problem How can we learn valuable insights from sensi1ve data from mul1ple organisa(ons? Insights Joint Sensitive Sensitive Analysis data data Confidential Confidential 3 | Data Analy(cs Without Seeing the Data
Three Basic Building Blocks • Private computa(on • Arithme(c on encrypted numbers • Distributed, confiden(al analy(cs • Distributed algorithms, computa(on & protocols • Private Record Linkage • Privacy preserving record level matching 4 | Data Analy(cs Without Seeing the Data
Solu1on (1): Private computa1on 71175935987496430338623223060201843925208459762815635262949815592595 3 16861516633702469933935260534155369128712003211669147527394965883186 E 98743040588706948658192655353713280945959536474253285115856347911583 77797185627083578174160157299579445890692023902698424427665636040729 38327792655060957281939887206011322264791188672934779233385835564950 538042608146734818512597109….......... + “+” 65535371328094595953647425328511585634791158377797185627083578174160 2 15729957944589069202390269842442766563604072976104138715920619699952 E 17697451818900805720754176976456091364980410538327792655060957281939 88720601132226479118867293477923338583556495053804260814673481851259 70093558089132685793389213865608731685640953069735077874534452166343 33195600873200349632089….... = = 95364742532851158563479115837779718562708357817416015729957944589069 5 20239026984244276656360407297610413871592061969995217697451818900805 D 11886729347792333858355649505380426081467348185125971095628099782109 58956224480113528398128884692700462576308469655060770093558089132685 79338921386560873168564095306973507787453445216634333195600873200349 632089270046257630846….... 5 | Data Analy(cs Without Seeing the Data
Solu1on (1): Private computa1on 71175935987496430338623223060201843925208459762815635262949815592595 3 16861516633702469933935260534155369128712003211669147527394965883186 E 98743040588706948658192655353713280945959536474253285115856347911583 77797185627083578174160157299579445890692023902698424427665636040729 38327792655060957281939887206011322264791188672934779233385835564950 538042608146734818512597109….......... + “+” 65535371328094595953647425328511585634791158377797185627083578174160 2 15729957944589069202390269842442766563604072976104138715920619699952 E 17697451818900805720754176976456091364980410538327792655060957281939 88720601132226479118867293477923338583556495053804260814673481851259 70093558089132685793389213865608731685640953069735077874534452166343 33195600873200349632089….... = = 95364742532851158563479115837779718562708357817416015729957944589069 5 20239026984244276656360407297610413871592061969995217697451818900805 D 11886729347792333858355649505380426081467348185125971095628099782109 58956224480113528398128884692700462576308469655060770093558089132685 79338921386560873168564095306973507787453445216634333195600873200349 632089270046257630846….... 6 | Data Analy(cs Without Seeing the Data
Solu1on (2): Distributed analy1cs Data always remains confiden1al to the source ins(tu(on N1 Coordinator Compute Messages containing encrypted data Compute Compute Dept 2 Dept 1 Data Data N1 Secure compute Confidentiality boundary 7 | Data Analy(cs Without Seeing the Data
Solu1on (3): Private Record Linkage Dataset A Dataset B ? Victoria Mckon 7/06/1921 F Tori Mckone 7/06/1921 F ? Tori Mackon 6/07/1921 F 8 | Data Analy(cs Without Seeing the Data
Solu1on (3): Private Record Linkage Jane Doe a8bf342 672bef4 Kat Clark Fuzzy Matching Paul Doe f72630b 14ce54 Jim Clark Jim Clark 14ce54 a8bf242 Janet Doe Kate Clark a72bef4 7830530 Shan Bo Shan Bo 7830530 b3894f3 Bob Doe Reg Pal 4bf6021 80ac364 Joe Smith One way hash func(ons One way hash func(ons 9 | Data Analy(cs Without Seeing the Data
Use Cases
Scoring Model ?? Quality Other Own Data Data 11 | Data Analy(cs Without Seeing the Data
Suspicious Ac1vi1es Need to report? Model Builder 12 | Data Analy(cs Without Seeing the Data
Industry using Gov Data Model Own Builder Data Gov Data 13 | Data Analy(cs Without Seeing the Data
Benchmarking Own Data Model Builder 14 | Data Analy(cs Without Seeing the Data
Device Analy1cs Model of normal behaviour Private Modeling deploy learn OK NG OK OK OK NG OK 15 | Data Analytics Without Seeing the Data
Private Computa1on
Homomorphic encryp1on Partial Allows either addition or Homomorphic multiplication of encrypted Encryption numbers e general al More gener Faster aster Somewhat Allows evaluation of low order Homomorphic polynomials Mor Encryption Fully Allows evaluation of arbitrary Homomorphic functions Encryption 17 | Data Analy(cs Without Seeing the Data
Paillier Encryp1on c = g m r n mod n 2 Encryption of m: Addition of encrypted numbers: ( ) = m 1 + m 2 mod n ) mod n 2 ( ) . E m 2 ( D E m 1 Multiplication of encrypted number by a scalar: m 2 mod n 2 ( ) = m 1 m 2 mod n ( ) D E m 1 18 | Data Analy(cs Without Seeing the Data
Paillier Encryp1on c = g m r n mod n 2 Encryption of m: Addition of encrypted numbers: g m 1 × g m 2 = g m 1 + m 2 Multiplication of encrypted number by a scalar: m 2 = g m 1 m 2 ( ) g m 1 19 | Data Analy(cs Without Seeing the Data
Paillier Implementa1ons • Python – open source • www.github.com/nicta/python-paillier • Java – open source • www.github.com/nicta/javallier • Javascript – s(ll under closed development 20 | Data Analy(cs Without Seeing the Data
Distributed, Confiden1al Analy1cs
Distributed Compu1ng with a Twist Data always remains confiden1al to the source organisa(on N1 Coordinator Compute Messages containing ONLY encrypted data Compute Compute Org 2 Org 1 Data Data N1 Secure compute Confidentiality boundary 22 | Data Analy(cs Without Seeing the Data
Graph Computa1on Engine Coordinator M JSON Message CE AKKA actors CE Messages Data frames DF M M M Domains CE M Properties CE M CE Worker CE Workers DF DF DF 23 | Data Analy(cs Without Seeing the Data
N1 Analy1cs PlaYorm Analytics Machine Learning Statistics Regression Clustering Learn Evaluate Deploy Privacy Technologies Irreversible Partial homomorphic Private Record aggregation encryption Linkage Distributed Graph Computation Engine Data Network Auth 24 | Data Analy(cs Without Seeing the Data
Logis1c Regression θ Minimise for : n ∑ ( ) ( ) = ( ) + 1 − y i ( ) ( ) L θ y i log p x i ; θ log 1 − p x i ; θ Log likelihood i = 0 Evaluate: 1 ( ) = p x ; θ Logis(c func(on − θ . x 1 + e Requires “secure log” and “secure inverse” protocol using Paillier encryp(on 25 | Data Analy(cs Without Seeing the Data
Example Paillier Logis1c Regression Coordinator M JSON Message Private key holder Logistic Secure Secure AKKA actors CE Learner Inverse Log Data frames DF N1Analytics Gradient Org A Org B Descent CE CE Worker Features & labels Features 26 | Data Analy(cs Without Seeing the Data
Performance • Learning • Learnt models have the same accuracy as unencrypted calcula(ons • “Private learning” is (1000x) slower due to encrypted �������� �������� ���������� computa(ons. Learning (mes are ������� ���� ��� ���� several hours. �������� ���� ( � ) • Deployment �� � • A score can be generated in real ���� (me (<50ms) • Customer data that contributes to ���� ��� the score remains private. ��������� ( ���� ) ��� ���� ���� 27 | Data Analy(cs Without Seeing the Data
Scaling Worker Worker Worker Learning time scaling Worker Worker Minutes Worker Worker Worker 500 Data Provider 1 Worker ◆ Worker Worker ◆ Worker ◆ Worker Worker ◆ 100 Worker Worker Worker ■ 10,000x10 features ● Worker Coordinator 50 ■ Worker 100,000x10 features ■ ■ �������� ■ ■ Worker Worker ◆ 1,000,000x10 features Worker Worker Worker 10 ● Worker Worker Worker ● ● 5 ● Worker ● Worker Worker Data Provider 2 Worker Worker Worker Worker Cores Worker 0 100 200 300 400 Worker Worker Worker Worker 28 | Data Analy(cs Without Seeing the Data
Confiden1al Record Linkage
Record Linkage Challenge Dataset A Dataset B ? Victoria Mckon 7/06/1921 F Tori Mckone 7/06/1921 F ? Tori Mackon 6/07/1921 F 30 | Data Analy(cs Without Seeing the Data
Solu1on (3): Private Record Linkage Jane Doe a8bf342 672bef4 Kat Clark Fuzzy Matching Paul Doe f72630b 14ce54 Jim Clark Jim Clark 14ce54 a8bf242 Janet Doe Kate Clark a72bef4 7830530 Shan Bo Shan Bo 7830530 b3894f3 Bob Doe Reg Pal 4bf6021 80ac364 Joe Smith One way hash func(ons One way hash func(ons 31 | Data Analy(cs Without Seeing the Data
Recommend
More recommend