Would that it were so simple: Yet another theory of privacy John Mitchell (Stanford) Avradip Mandal, Hart Montgomery, Arnab Roy (Fujitsu)
It seems Allegra’s a no‐show, which is simply a bore, but I’ll partner you in bridge. 2
Would that it were so simple 3
Yet another … • • YAAF – … Application YAP – Yet Another Previewer Framework • YAPC – … Perl Conference • Yabasic – … BASIC • YARN – … Resource Negotiator • Yacc – … compiler compiler • YARP – … Robot Platform • Yacas … computer algebra • YARV – … Ruby VM system • Yasca – …Source Code • YaDICs – …Digital Image Analyzer Correlation Software • Y.A.S.U. – … SecuROM Utility • YADIFA – … DNS • Yate – … Telephony Engine, Implementation For All • YAWC – … Wersion of Citadel • YafaRay – … free Ray tracer • YAWL – … Workflow Language, • Yafc – … FTP client • Yaws – … web server • YAFFS – … Flash File System 4
Hard to “anonymize” data • De‐identifying data does not necessarily achieve anonymity. It can often be re‐identified: Ethnicity Name Visit date Address ZIP ZIP Diagnosis Date registered Birth Birth date date Procedure Party Sex Sex Medication Date last voted Total bill Voter Lists Medical Data Source: Latanya Sweeney
Date of birth, gender + 5‐digit ZIP uniquely identifies 87.1% of U.S. population ZIP 60623 has 112,167 people, 11% uniquely identified. Insufficient # over 55 living there. = one ZIP code SOURCE: LATANYA SWEENEY
Privacy example 1: US Census • Raw data: information about every US household – Who, where; age, gender, racial, income and educational data • Why released: determine representation, planning • How anonymized: aggregated to geographic areas (Zip code) – Broken down by various combinations of dimensions – Released in full after 72 years • Attacks: no reports of successful deanonymization • Consequences: greater understanding of US population – Affects funding of civil projects – Rich source of data for future historians, etc. Anonymized Data: Generation, Models, Usage – Cormode & Srivastava
Privacy example 2: AOL Search Data • Raw data: 20M search queries for 650K users from 2006 • Why released: allow researchers to understand search patterns • How anonymized: user identifiers removed – All searches from same user linked by an arbitrary identifier • Attacks: many successful attacks identified individual users – Ego‐surfers: people typed in their own names – Zip codes and town names identify an area – NY Times identified 4417749 as 62yr old GA widow [ • Consequences: CTO resigned, two researchers fired – Well‐intentioned effort failed due to inadequate anonymization Anonymized Data: Generation, Models, Usage – Cormode & Srivastava
Privacy example 3: Netflix Prize • Raw data: 100M dated ratings from 480K users to 18K movies • Why released: improve predicting ratings of unlabeled examples • How anonymized: exact details not described by Netflix – All direct customer information removed – Only subset of full data; dates modified; some ratings deleted, – Movie title and year published in full • Attacks: Narayanan Shmatikov 08] – Attack links data to IMDB where same users also rated movies – Find matches based on similar ratings or dates in both Anonymized Data: Generation, Models, Usage – Cormode & Srivastava
k‐Anonymity • Make individuals “blend into the crowd” – Suppress or generalize attributes in a database so that the identifying characteristics in each row match at least k‐1 other rows
k‐Anonymity • • Disadvantages Advantage of this concept – Does not involve probability – Does not involve probability – Depends on absence of additional info
Two rigorous theories of privacy • Contextual integrity – Normative framework for evaluating the flow of information between agents – Agents act in roles within social contexts – Principles of transmission • Confidentiality, reciprocity, dessert, etc • Differential privacy San S DB= Distrib. ¢¢¢ distance ≤ San S’ DB’= ¢¢¢ Adam Smith
Yet Another Theory of Privacy • Formulate privacy and utility around – Private database accessed via privacy mechanism – Possibly linkable to public information – In presence of prior distribution about population • Measure privacy for user and utility using entropy – Privacy loss: information gain about user, identified by name and address or other public identifier – Utility gain: information about “anonymized” identifier that is used to access private data 13
The Targeted Advertising Ecosystem Ad Exchanges & Consumers Ad Networks Companies Publishers Ad Auctions Ads Rubicon AdNexus Rocketfuel Services that Services that manage People who Websites that People trying to match people to ad campaigns and browse the web publish content sell you stuff targeted ads target users Slide credit: Guevara Noubir
The Targeted Advertising Ecosystem Ad Exchanges & Consumers Ad Networks Companies Publishers Ad Auctions $$$ Rubicon Tracking data is stored and exchanged amongst these AdNexus companies Rocketfuel Services that Services that manage People who Websites that People trying to match people to ad campaigns and browse the web publish content sell you stuff targeted ads target users
Yet Another Theory of Privacy • Assume advertising data provider – Maintains database of user interest and behavior, indexed by a supercookie – Supplies rows of this database to ad networks • Assume Ad Networks – Use this information to bid $$ on ad impression • Threat model – Attacker has access to • Rows of database indexed by supercookie • Prior distribution of traits within population • Public information in external databases (e.g., FB, Yelp) – What can attacker learn about real individuals? 16
The Targeted Advertising Ecosystem Ad Exchanges & Consumers Ad Networks Companies Publishers Ad Auctions Ads Rubicon AdNexus Rocketfuel Services that Services that manage People who Websites that People trying to match people to ad campaigns and browse the web publish content sell you stuff targeted ads target users Slide credit: Guevara Noubir
Another theory of privacy • Database – Mapping from users U to rows (from some set X) – Mapping from users to their supercookies – Prior distribution P on databases, known to adversary • Privacy mechanism – Transformation M of database – Advertiser gets supercookie‐based access to transformed database • Privacy for user – Decrease in uncertainty about the user and her data, when provided supercookie‐based access to transformed database • Utility per user – Decrease in uncertainty about the supercookie and user data, when provided supercookie‐based access to transformed database 18
More Details • Database – Mapping from users U to rows (from some set X) – Mapping from users to their supercookies – Prior distribution P on databases, known to adversary • Privacy mechanism – Transformation M of database – Function maps user to supercookie • Privacy for user – Decrease in uncertainty about the user and her data, when provided supercookie‐based access to transformed database • Utility per user – Decrease in uncertainty about the supercookie and user data, when provided supercookie‐based access to transformed database 19
More Details • Privacy mechanism – Transformation M of database – Function maps user to supercookie • Privacy for user Uncertainty about user and her data Uncertainty when provided “private” data access • Utility per user Uncertainty about supercookie and associated data Uncertainty when provided “private” data access 20
Some insight from the model • Privacy is still mathematically complicated – Not easy to prove interesting entropy relationships • Lower bound – Generalize the Netflix example: sparse database where # of users << exp(# of columns) e.g., columns =movies – Even adding randomly sampled Bernoulli noise, there is a level of noise where privacy loss is still catastrophic and utility insufficient for most practical applications • Upper bound – Course‐grained database with # users >> exp(# of columns) – Intuition: restaurant recommendations based on user category preferences; privacy even with Yelp as side information 21
Example Can advertiser connect preference data with public reviews to estimate u i private preferences? 22
Example • Under conservative assumption • Can calculate privacy as function of n and 23
Complicated Expression 24
Summary Ad Exchanges & Consumers Ad Networks Companies Publishers Ad Auctions Ads Rubicon AdNexus Rocketfuel Services that Services that manage People who Websites that People trying to match people to ad campaigns and browse the web publish content sell you stuff targeted ads target users Slide credit: Guevara Noubir
Recommend
More recommend