Enhancing Privacy in Machine Learning Mathias Humbert INSA Toulouse/CNRS Toulouse, January 22, 2019
Enhancing Privacy in Machine Learning data ML What ML? What data? What threat? Mathias Humbert - Enhancing Privacy in Machine Learning � 2
Different Attacks: Linkability Robert Alice Marius Eve Ability to link at least two records concerning the same individual If one data set is not anonymized → re-identification Mathias Humbert - Enhancing Privacy in Machine Learning � 3
Different Attacks: Membership Inference Study focusing on HIV patients ? (x,y,z) Ability to infer that a certain target is in a specific dataset Mathias Humbert - Enhancing Privacy in Machine Learning � 4
Trading Off Privacy Privacy Utility ML E ffi ciency What ML? What data? What threat? What defense? Mathias Humbert - Enhancing Privacy in Machine Learning � 5
Different Defense Mechanisms Privacy Utility ML E ffi ciency Anonymization Randomization Differential privacy Cryptography Mathias Humbert - Enhancing Privacy in Machine Learning � 6
Outline of the Talk • Attack - defense - data • Temporal linkability - randomization - microRNA expression ℝ r r ≈ 10 3 USENIX Security’16 • • Re-identification - cryptography - DNA methylation IEEE S&P’17 [0,1] m m ≈ 10 7 • • Membership inference - other defense - any data NDSS'19 • Mathias Humbert - Enhancing Privacy in Machine Learning � 7
Outline of the Talk • Attack - defense - data • Temporal linkability - randomization - microRNA expression USENIX Security’16 • • Re-identification - cryptography - DNA methylation IEEE S&P’17 • • Membership inference - other defense - any data NDSS'19 • Mathias Humbert - Enhancing Privacy in Machine Learning � 8
DNA versus MicroRNA DNA miRNA contains blueprint of what a cell regulates what a cell really • • potentially can do , does , is (mostly) fixed over time , expression changes over time , • • can hint on risks of getting a can tell whether you carry a • • disease . disease . Common belief: no privacy threats from miRNAs, because of temporal variability Mathias Humbert - Enhancing Privacy in Machine Learning � 9
Temporal Linkability Attack • Matching two datasets E.g., a leaked database (incl. name) and public DB (excl. name) • Which sample from t 1 corresponds to which sample from t 2 ? • t 1 t 2 Mathias Humbert - Enhancing Privacy in Machine Learning � 10
Data Pre-processing • High dimensionality: 1,189 miRNAs per sample Possibly correlated and uninteresting components • r t j k • PCA + whitening provides Unit variance • PCA Smaller dimensionality • Uncorrelated components • • Condenses data into a set of smaller dimensions r t j ¯ k with minimal information loss Mathias Humbert - Enhancing Privacy in Machine Learning � 11
Linkability Attack t 2 t 1 Which sample from t 1 corresponds to which sample from t 2 ? r t 2 r t 1 � � � ¯ i − ¯ r t 1 � r t 2 k 2 k i n � � σ ∗ = arg min X r t 2 r t 1 � ¯ σ ( i ) − ¯ � � i � 2 σ i =1 { r t 1 i } n i =1 { r t 2 i } n i =1 Mathias Humbert - Enhancing Privacy in Machine Learning � 12
Linkability Attack t 2 t 1 σ Which sample from t 1 corresponds to which sample from t 2 ? r t 2 r t 1 � � � ¯ i − ¯ r t 1 � r t 2 k 2 k i n � � σ ∗ = arg min X r t 2 r t 1 � ¯ σ ( i ) − ¯ � � i � 2 σ i =1 { r t 1 i } n i =1 { r t 2 i } n i =1 Time complexity: O(n 3 ) Mathias Humbert - Enhancing Privacy in Machine Learning � 13
Athletes Dataset Participants: 29 Points in time: 2 (before and after exercising) Time period: 1 week Disease: none 1,189 miRNAs per sample taken from blood and plasma • Mathias Humbert - Enhancing Privacy in Machine Learning � 14
Lung Cancer Dataset Participants: 26 (huge for a longitudinal study!) Points in time: 8 Time period: 18 months Disease: lung cancer 1,189 miRNAs per sample taken from plasma • before surgery after surgery months -? 0 3 6 9 12 15 18 Mathias Humbert - Enhancing Privacy in Machine Learning � 15
Linkability Attack – Results 55% 90% 29% 48% number of PCA dimensions number of PCA dimensions success up to 90% for blood-based samples Mathias Humbert - Enhancing Privacy in Machine Learning � 16
Linkability Attack – Results How does the success change with larger datasets ? Success decreases sharply for plasma-based samples, but decreases linearly for blood-based samples. Mathias Humbert - Enhancing Privacy in Machine Learning � 17
Outline of the Talk • Attack - defense - data • Temporal linkability - randomization - microRNA expression USENIX Security’16 • • Re-identification - cryptography - DNA methylation IEEE S&P’17 • • Membership inference - other defense - any data NDSS'19 • Mathias Humbert - Enhancing Privacy in Machine Learning � 18
Defense Mechanisms • Hiding non-relevant miRNA expressions Sometimes, randomization is not an option • E.g., for making a diagnosis in a hospital • Caution: correlations between miRNAs • • Randomizing the miRNA expression profiles Adding noise in a fully distributed, differentially-private manner • → providing epigeno-indistinguishability (inspired by [1]) Noise drawn according to multivariate Laplacian mechanism • E.g., for publishing a dataset used in a study • [1] Chatzikokolakis et al. Broadening the scope of di ff erential privacy using metrics , PETS, 2013 Mathias Humbert - Enhancing Privacy in Machine Learning � 19
Privacy-Utility Trade-Off Privacy: prevent linkability of samples Utility: preserve accuracy of classification as diseased / healthy , usually using a radial SVM classifier disease miRNA 2 miRNA 1 Mathias Humbert - Enhancing Privacy in Machine Learning � 20
Privacy-Utility Trade-Off Privacy: prevent linkability of samples Utility: preserve accuracy of classification as diseased / healthy , usually using a radial SVM classifier disease miRNA 2 miRNA 1 Mathias Humbert - Enhancing Privacy in Machine Learning � 21
Privacy-Utility Trade-Off Privacy: prevent linkability of samples Utility: preserve accuracy of classification as diseased / healthy , usually using a radial SVM classifier Another dataset for exploring utility: 1000+ participants, 19 diseases, 1 time point Mathias Humbert - Enhancing Privacy in Machine Learning � 22
Hiding miRNAs – Results <80% <100 miRNAs Mathias Humbert - Enhancing Privacy in Machine Learning � 23
Hiding miRNAs – Results accuracy 99,2% Mathias Humbert - Enhancing Privacy in Machine Learning � 24
Hiding miRNAs – Results attacker’s success rate Mathias Humbert - Enhancing Privacy in Machine Learning � 25
Hiding miRNAs – Results 99,2% Mathias Humbert - Enhancing Privacy in Machine Learning � 26
Hiding miRNAs – Results 99,2 Trade-off at 7 miRNAs Attack success decreased (relative to all) by 54% SVM accuracy decreased (relative to max) by only 1% Mathias Humbert - Enhancing Privacy in Machine Learning � 27
Hiding miRNAs – Results 92,7% Mathias Humbert - Enhancing Privacy in Machine Learning � 28
Hiding miRNAs – Results Trade-off at 4 miRNAs 92,7 Success decreases (relative to all) by 80% Accuracy decreases (relative to max) by only 1% Mathias Humbert - Enhancing Privacy in Machine Learning � 29
Probabilistic Sanitization – Results 99,2% Mathias Humbert - Enhancing Privacy in Machine Learning � 30
Probabilistic Sanitization – Results 99,2% 99,2% Mathias Humbert - Enhancing Privacy in Machine Learning � 31
Probabilistic Sanitization – Results Suitable balance at ℇ =0.025 99,2% Attack success decreased (relative to all) by 63% SVM accuracy decreased (relative to max) by only 0.65% Mathias Humbert - Enhancing Privacy in Machine Learning � 32
Probabilistic Sanitization – Results 96,9% Mathias Humbert - Enhancing Privacy in Machine Learning � 33
Probabilistic Sanitization – Results 96,9% Trade-off at ℇ =0.01 Success decreases (relative to all) by 70% Accuracy decreases (relative to max) by only 0.2% Mathias Humbert - Enhancing Privacy in Machine Learning � 34
Outline of the Talk • Attack - defense - data type • Temporal linkability - randomization - microRNA expression USENIX Security’16 • • Re-identification - cryptography - DNA methylation IEEE S&P’17 • • Membership inference - other defense - any data NDSS'19 • Mathias Humbert - Enhancing Privacy in Machine Learning � 35
Recommend
More recommend