CS378 Introduction to Data Mining Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University
Netflix Sequel • 2006, Netflix announced the challenge • 2007, researchers from University of Texas identified individuals by matching Netflix datasets with IMDB • July 2009, $1M grand prize awarded • August 2009, Netflix announced the second challenge • December 2009, four Netflix users filed a class action lawsuit against Netflix • March 2010, Netflix canceled the second challenge
3
Netflix Sequel • 2006, Netflix announced the challenge • 2007, researchers from University of Texas identified individuals by matching Netflix datasets with IMDB • July 2009, $1M grand prize awarded • August 2009, Netflix announced the second challenge • December 2009, four Netflix users filed a class action lawsuit against Netflix • March 2010, Netflix canceled the second challenge
Netflix Sequel • 2006, Netflix announced the challenge • 2007, researchers from University of Texas identified individuals by matching Netflix datasets with IMDB • July 2009, $1M grand prize awarded • August 2009, Netflix announced the second challenge • December 2009, four Netflix users filed a class action lawsuit against Netflix • March 2010, Netflix canceled the second competition
Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000 users took the psychographic personality test app ” thisisyourdigitallife ” • 2016, Trump’s campaign invest heavily in Facebook ads • March 2018, reports revealed that 50 million (later revised to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign • April 11, 2018, Zuckerberg testified before Congress
Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000 users took the psychographic personality test app ” thisisyourdigitallife ” • 2016, Trump’s campaign invest heavily in Facebook ads • March 2018, reports revealed that 50 million (later revised to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign • April 11, 2018, Zuckerberg testified before Congress
• How many people know we are here? (a) no one (b) 1-10 i.e. family and friends (c) 10-100 i.e. colleagues and more (social network) friends
• 73% / 33% of Android apps shared personal info (i.e. email) / GPS coordinates with third parties • 45% / 47% of iOS apps shared email / GPS coordinates with third parties Location data sharing by iOS apps (left) to domains (right) Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps, 2015-10-30 https://techscience.org/a/2015103001/
The EHR Data Map
Shopping records
Big Data Goes Personal • Movie ratings • Social network/media data • Mobile GPS data • Electronic medical records • Shopping history • Online browsing history
Data Mining
Data Mining … the dark side
Privacy Preserving Data Mining Private Sanitized Privacy Preserving Data Data/ Data Mining Models • Privacy goal: personal data is not revealed and cannot be inferred • Utility goal: data/models as close to the private data as possible
Privacy preserving data mining • Differential privacy • Definition • Building blocks (primitive mechanisms) • Composition rules • Data mining algorithms with differential privacy • k-means clustering w/ differential privacy • Frequent pattern mining w/ differential privacy
Differential Privacy
Traditional De-identification and Anonymization • Attribute suppression, perturbation, generalization • Inference possible with external data Sanitized Original De-identification View Data anonymization
Massachusetts GIC Incident (1990s) • Massachusetts Group Insurance Commission (GIC) Encounter data (“de - identified”) – mid 1990s • External information: voter roll from city of Cambridge • Governor’s health records identified • 87% Americans can be uniquely identified using: Zip, birthdate, and sex (2000) Name SSN Birth Zip Diagnosis date Alice 123456789 44 48202 AIDS Bob 323232323 44 48202 AIDS Charley 232345656 44 48201 Asthma Dave 333333333 55 48310 Asthma Eva 666666666 55 48310 Diabetes
AOL Query Log Release (2006) 20 million Web search queries by AOL AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com • User 4417749 • “numb fingers”, • “60 single men” • “dog that urinates on everything” • “landscapers in Lilburn, Ga ” • Several people names with last name Arnold • “homes sold in shadow lake subdivision gwinnett county georgia ”
The Genome Hacker (2013)
Differential Privacy • Statistical outcome (view) is indistinguishable regardless whether a particular user is included in the data
Differential Privacy • Statistical outcome (view) is indistinguishable regardless whether a particular user is included in the data
Differential Privacy • View is indistinguishable regardless of the input Private Privacy preserving Models Data D data mining/sharing /Data mechanism Private Data D’
Differential privacy: an example Perturbed histogram Original records Original histogram with differential privacy
Laplace Mechanism Query q Private True q(D) + η Data answer q(D) η Laplace Distribution – Lap(S/ ε ) 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 0 2 4 6 8 10
Laplace Distribution • PDF: • Denoted as Lap(b) when u=0 • Mean u • Variance 2b 2
How much noise for privacy? [Dwork et al., TCC 2006] Sensitivity : Consider a query q: I R. S(q) is the smallest number s.t. for any neighboring tables D, D’, | q(D) – q (D’ ) | ≤ S(q) Theorem : If sensitivity of the query is S , then the algorithm A(D) = q(D) + Lap(S(q)/ ε ) guarantees ε -differential privacy
Example: COUNT query • Number of people having HIV+ • Sensitivity = ?
Example: COUNT query • Number of people having HIV+ • Sensitivity = 1 • ε - differentially private count: 3 + η , where η is drawn from Lap(1/ ε ) • Mean = 0 • Variance = 2/ ε 2
Example: Sum (Average) query • Sum of Age (suppose Age is in [a,b]) • Sensitivity = ?
Example: Sum (Average) query • Sum of Age (suppose Age is in [a,b]) • Sensitivity = b
Composition theorems Sequential composition Parallel composition ∑ i ε i – differential privacy max( ε i ) – differential privacy
Sequential Composition • If M 1 , M 2 , ..., M k are algorithms that access a private database D such that each M i satisfies ε i - differential privacy, then the combination of their outputs satisfies ε -differential privacy with ε=ε 1 +...+ ε k
Parallel Composition • If M 1 , M 2 , ..., M k are algorithms that access disjoint databases D 1 , D 2 , …, D k such that each M i satisfies ε i - differential privacy, then the combination of their outputs satisfies ε -differential privacy with ε= max{ε 1 ,..., ε k }
Postprocessing • If M 1 is an ε differentially private algorithm that accesses a private database D, then outputting M 2 (M 1 (D)) also satisfies ε -differential privacy. Module 2 42 Tutorial: Differential Privacy in the Wild
Differential privacy: an example Perturbed histogram Original records Original histogram with differential privacy
Privacy preserving data mining • Differential privacy • Definition • Building blocks (primitive mechanisms) • Composition rules • Data mining algorithms with differential privacy • k-means clustering w/ differential privacy • Frequent itemsets mining w/ differential privacy
Privacy Preserving Data Mining as Constrained Optimization • Two goals • Privacy • Error (utility) • Given a task and privacy budget ε, how to design a set of queries (functions) and allocate the budget such that the error is minimized?
Data mining algorithms with differential privacy • General algorithmic framework • Decompose a data mining algorithm into a set of functions • Allocate privacy budget to each function • Implement each function with ε i differential privacy • Compute noisy output using Laplace mechanism based on sensitivity of the function and ε i • Compose them using composition theorem • Optimization techniques • Decomposition design • Budget allocation • Sensitivity reduction for each function
Review: K-means Clustering
K-means Problem • Partition a set of points x 1 , x 2 , …, x n into k clusters S 1 , S 2 , …, S k such that the SSE is minimized: Mean of the cluster S i
K-means Algorithm • Initialize a set of k centers • Repeat until convergence 1. Assign each point to its nearest center 2. Update the set of centers • Output final set of k centers and the points in each cluster
Recommend
More recommend