CS573 Data Privacy and Security Midterm Review Li Xiong Department of Mathematics and Computer Science Emory University
Principles of Data Security – CIA Triad • Confidentiality – Prevent the disclosure of information to unauthorized users • Integrity – Prevent improper modification • Availability – Make data available to legitimate users
Privacy vs. Confidentiality • Confidentiality – Prevent disclosure of information to unauthorized users • Privacy – Prevent disclosure of personal information to unauthorized users – Control of how personal information is collected and used – Prevent identification of individuals 11/8/2016 3
Data Privacy and Security Measures • Access control – Restrict access to the (subset or view of) data to authorized users • Cryptography – Use encryption to encode information so it can be only read by authorized users (protected in transmit and storage) • Inference control – Restrict inference from accessible data to sensitive (non- accessible) data
Inference Control • Inference control : Prevent inference from accessible information to individual information (not accessible) • Technologies – De-identification and Anonymization (input perturbation) – Differential Privacy (output perturbation)
Traditional De-identification and Anonymization • Attribute suppression, encoding, perturbation, generalization • Subject to re-identification and disclosure attacks Sanitized Original De-identification Records Data anonymization
Statistical Data Sharing with Differential Privacy • Macro data (as versus micro data) • Output perturbation (as versus input perturbation) • More rigorous guarantee Statistics/ Original Differentially Private Models/ Data Data Sharing Synthetic Records
Cryptography • Encoding data in a way that only authorized users can read it Encrypted Original Encryption Data Data 11/9/2016 8
Applications of Cryptography • Secure data outsourcing – Support computation and queries on encrypted data Computation Encrypted /Queries Data 9 11/9/2016 9
Applications of Cryptography • Multi-party secure computations (secure function evaluation) – Securely compute a function without revealing private inputs x1 x2 f(x1,x2,…, xn) xn x3 10
Applications of Cryptography • Private information retrieval (access privacy) – Retrieve data without revealing query (access pattern) 11
Course Topics • Inference control – De-identification and anonymization – Differential privacy foundations – Differential privacy applications • Histograms • Data mining • Local differential privacy • Location privacy • Cryptography • Access control • Applications 11/8/2016 12
k-Anonymity Caucas 787XX Flu Caucas 78712 Flu Asian/AfrA Asian 78705 Shingle 78705 Shingle m s s Caucas 787XX Flu Caucas 78754 Flu Asian/AfrA Asian 78705 Acne 78705 Acne m AfrAm 78705 Acne Asian/AfrA 78705 Acne m Caucas 78705 Flu Caucas 787XX Flu Quasi-identifiers (QID) = race, zipcode Sensitive attribute = diagnosis K-anonymity: the size of each QID group is at least k
Problem of k-anonymity Caucas 787XX Flu Asian/AfrA 78705 Shingle … … … m s Rusty Shackleford Caucas 78705 Caucas 787XX Flu … … … Asian/AfrA 78705 Acne m Asian/AfrA 78705 Acne m Caucas 787XX Flu Problem: sensitive attributes are not “diverse” within each quasi-identifier group slide 14
l-Diversity [Machanavajjhala et al. ICDE ‘06] Caucas 787XX Flu Caucas 787XX Shingle s Caucas 787XX Acne Entropy of sensitive attributes Caucas 787XX Flu within each quasi-identifier Caucas 787XX Acne group must be at least l Caucas 787XX Flu Asian/AfrA 78XXX Flu m Asian/AfrA 78XXX Flu m Asian/AfrA 78XXX Acne m Asian/AfrA 78XXX Shingle m s Asian/AfrA 78XXX Acne
Problem with l-diversity Original dataset Anonymization A Anonymization B … HIV- Q1 HIV+ Q1 HIV- … HIV- Q1 HIV- Q1 HIV- … HIV- Q1 HIV+ Q1 HIV- … HIV- Q1 HIV- Q1 HIV+ … HIV- Q1 HIV+ Q1 HIV- … HIV+ Q1 HIV- Q1 HIV- … HIV- Q2 HIV- Q2 HIV- … HIV- Q2 HIV- Q2 HIV- 99% HIV- quasi-identifier group is not “diverse” … HIV- Q2 HIV- Q2 HIV- …yet anonymized database does not leak anything … HIV- Q2 HIV- Q2 HIV- … HIV- Q2 HIV- Q2 HIV- 50% HIV- quasi- identifier group is “diverse” … HIV- Q2 HIV- Q2 Flu This leaks a ton of information 99% have HIV-
t-Closeness [Li et al. ICDE ‘07] Caucas 787XX Flu Caucas 787XX Shingle s Distribution of sensitive Caucas 787XX Acne attributes within each quasi-identifier group should Caucas 787XX Flu be “close” to their distribution Caucas 787XX Acne in the entire original database Caucas 787XX Flu Asian/AfrA 78XXX Flu m Asian/AfrA 78XXX Flu m Asian/AfrA 78XXX Acne m Asian/AfrA 78XXX Shingle m s slide 17 Asian/AfrA 78XXX Acne
Problems with Syntactic Privacy notions • Syntactic – Focuses on data transformation, not on what can be learned from the anonymized dataset • “ Quasi- identifier” fallacy – Assumes a priori that attacker will not know certain information about his target – Attacker may know the records in the database or external information slide 18
Course Topics • Inference control – De-identification and anonymization – Differential privacy foundations – Differential privacy applications • Histograms • Data mining • Location privacy • Cryptography • Access control • Applications 11/8/2016 19
Differential Privacy • Statistical outcome is indistinguishable regardless whether a particular user (record) is included in the data
Differential Privacy • Statistical outcome is indistinguishable regardless whether a particular user (record) is included in the data
Statistical Data Release: disclosure risk Original records Original histogram
Statistical Data Release: differential privacy Perturbed histogram Original records Original histogram with differential privacy
Differential Privacy D D’ • D and D’ are neighboring databases if they differ in one record A privacy mechanism A gives ε -differential privacy if for all neighbouring databases D , D’ , and for any possible output S ∈ Range(A), Pr[A(D) = S ] ≤ exp(ε ) × Pr[A(D’) = S]
Laplace Mechanism Global Sensitivity Add Laplace noise to the true output f(D) Δ f = max D,D’ | f ( D ) - f ( D’ )|
Example: Laplace Mechanism • For a single counting query Q over a dataset D , returning Q(D)+Laplace(1/ε) gives ε - differential privacy. 11/8/2016 26
Exponential Mechanism Inputs Outputs Sample output r with a utility score function u(D,r)
Exponential Mechanism For a database D, output space R and a utility score function u : D× R → R , the algorithm A Pr[ A ( D ) = r ] ∝ exp (ε × u ( D, r )/ 2Δ u ) satisfies ε -differential privacy, where Δ u is the sensitivity of the utility score function Δ u = max r & D,D’ | u ( D, r ) - u ( D’, r )|
Example: Exponential Mechanism • Scoring/utility function w: Inputs x Outputs R • D: nationalities of a set of people • f(D) : most frequent nationality in D • u (D, O) = #(D, O) the number of people with nationality O Module 2 Tutorial: Differential Privacy in the Wild 29
Composition theorems Sequential composition Parallel composition ∑ i ε i – differential privacy max( ε i ) – differential privacy
Differential Privacy • Differential privacy ensure an attacker can’t infer the presence or absence of a single record in the input based on any output. • Building blocks – Laplace, exponential mechanism • Composition rules help build complex algorithms using building blocks
Course Topics • Inference control – De-identification and anonymization – Differential privacy foundations – Differential privacy applications • Histograms • Data mining • Location privacy • Cryptography • Access control • Applications 11/8/2016 32
Baseline: Laplace Mechanism • For the counting query Q on each histogram bin, returning Q(D)+Laplace(1/ε) gives ε - differential privacy. 11/8/2016 33
DPCube [SecureDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y ε/2 -DP Bob 31 60K Y Mary 28 20K Y … … … … Original Records DP unit Histogram • Compute unit Multi-dimensional histogram counts partitioning with differential ε/2 -DP privacy • Use DP unit histogram for partitioning • Compute V-optimal histogram counts with differential DP V-optimal Histogram privacy DP Interface
Private Spatial decompositions [CPSSY 12] quadtree kd-tree Need to ensure both partitioning boundary and the counts of each partition are differentially private 35
Histogram methods vs parametric methods Non-parametric methods (only work well for low-dimensional data) Original data Synthetic data Perturbation Histogram Learn empirical distribution through histograms e.g. PSD , Privelet, FP, P-HP Parametric methods (joint distribution difficult to model) Fit the data to a distribution, make inferences about parameters e.g. PrivacyOnTheMap
Recommend
More recommend