privacy cognizant information privacy cognizant
play

Privacy Cognizant Information Privacy Cognizant Information Systems - PowerPoint PPT Presentation

Privacy Cognizant Information Privacy Cognizant Information Systems Systems Rakesh Agrawal Agrawal Rakesh IBM Almaden Almaden Research Center Research Center IBM Jt. work with Srikant, Kiernan, Xu & Evfimievski Evfimievski Thesis


  1. Privacy Cognizant Information Privacy Cognizant Information Systems Systems Rakesh Agrawal Agrawal Rakesh IBM Almaden Almaden Research Center Research Center IBM Jt. work with Srikant, Kiernan, Xu & Evfimievski Evfimievski

  2. Thesis Thesis ƒ There is increasing need to build information There is increasing need to build information ƒ systems that systems that ƒ protect the privacy and ownership of information protect the privacy and ownership of information ƒ ƒ do not impede the flow of information do not impede the flow of information ƒ ƒ Cross Cross- -fertilization of ideas from the security and fertilization of ideas from the security and ƒ database research communities can lead to the database research communities can lead to the development of innovative solutions. development of innovative solutions.

  3. Outline Outline � Motivation Motivation � � Privacy Preserving Data Mining Privacy Preserving Data Mining � � Privacy Aware Data Management Privacy Aware Data Management � � Information Sharing Across Private Databases Information Sharing Across Private Databases � � Conclusions Conclusions �

  4. Drivers Drivers � Policies and Legislations Policies and Legislations � – U.S. and international regulations U.S. and international regulations – – Legal proceedings against businesses – Legal proceedings against businesses � Consumer Concerns Consumer Concerns � – Consumer privacy apprehensions continue to plague the – Consumer privacy apprehensions continue to plague the Web … these fears will hold back roughly $15 billion in e- - Web … these fears will hold back roughly $15 billion in e Commerce revenue.” Forrester Research, 2001 Commerce revenue.” Forrester Research, 2001 – Most consumers are “privacy pragmatists.” Westin Most consumers are “privacy pragmatists.” Westin – Surveys Surveys � Moral Imperative Moral Imperative � – The right to privacy: the most cherished of human The right to privacy: the most cherished of human – freedom -- -- Warren & Brandeis, 1890 Warren & Brandeis, 1890 freedom

  5. Outline Outline � Motivation Motivation � � Privacy Preserving Data Mining Privacy Preserving Data Mining � � Privacy Aware Data Management Privacy Aware Data Management � � Information Sharing Across Private Databases Information Sharing Across Private Databases � � Conclusions Conclusions �

  6. Data Mining and Privacy Data Mining and Privacy � The primary task in data mining: The primary task in data mining: � – development of models about aggregated data. development of models about aggregated data. – � Can we develop accurate models, while Can we develop accurate models, while � protecting the privacy of individual records? protecting the privacy of individual records?

  7. Setting Setting � Application scenario: A central server interested in Application scenario: A central server interested in � building a data mining model using data obtained building a data mining model using data obtained from a large number of clients, while preserving from a large number of clients, while preserving their privacy their privacy – Web Web- -commerce, e.g. recommendation service commerce, e.g. recommendation service – � Desiderata: Desiderata: � – Must not slow Must not slow- -down the speed of client interaction down the speed of client interaction – – Must scale to very large number of clients Must scale to very large number of clients – � During the application phase During the application phase � – Ship model to the clients Ship model to the clients – – Use oblivious computations – Use oblivious computations

  8. World Today World Today Alice 35 35 35 95,000 95,000 95,000 J.S. Bach Recommendation J.S. Bach J.S. Bach painting painting Service painting nasa nasa nasa 45 45 60,000 60,000 Bob B. Spears B. Spears 45 baseball baseball 42 60,000 42 cnn cnn 85,000 B. Spears 85,000 B. Marley baseball B. Marley Chris camping cnn camping 42 microsoft microsoft 85,000 B. Marley, camping, microsoft

  9. World Today World Today Alice 35 35 35 95,000 95,000 95,000 J.S. Bach Recommendation J.S. Bach J.S. Bach painting painting Service painting nasa nasa nasa 45 45 60,000 60,000 Bob Mining Algorithm B. Spears B. Spears 45 baseball baseball 42 60,000 42 cnn cnn 85,000 B. Spears 85,000 B. Marley baseball B. Marley Chris camping cnn Data Mining Model camping 42 microsoft microsoft 85,000 B. Marley, camping, microsoft

  10. New Order: New Order: Randomization to Randomization to Alice 35 Protect Privacy Protect Privacy 50 becomes 50 35 50 65,000 65,000 (35+15) 95,000 Metallica Recommendation Metallica J.S. Bach painting painting Service painting nasa nasa nasa 38 38 90,000 90,000 Bob B. Spears B. Spears 45 soccer soccer 32 60,000 32 fox fox Randomization techniques 55,000 B. Spears 55,000 differ for numeric and B. Marley baseball B. Marley Chris categorical data camping cnn camping Each attribute randomized 42 linuxware independently linuxware 85,000 B. Marley, Per-record randomization camping, without considering other records Randomization parameters microsoft common across users

  11. New Order: New Order: Randomization to Randomization to Alice Protect Privacy Protect Privacy 50 50 35 65,000 65,000 95,000 Metallica Recommendation Metallica J.S. Bach painting painting Service painting nasa nasa nasa 38 38 90,000 90,000 Bob B. Spears B. Spears 45 soccer soccer 32 60,000 32 fox fox 55,000 B. Spears 55,000 B. Marley baseball B. Marley Chris True values camping cnn camping Never Leave 42 linuxware linuxware the User! 85,000 B. Marley, camping, microsoft

  12. New Order: New Order: Randomization Randomization Alice Protects Privacy Protects Privacy 50 50 35 65,000 65,000 95,000 Metallica Recommendation Metallica J.S. Bach painting painting Service painting nasa nasa nasa Recovery 38 38 90,000 90,000 Bob Mining Algorithm B. Spears B. Spears 45 soccer soccer 32 60,000 32 fox fox 55,000 B. Spears 55,000 B. Marley baseball B. Marley Chris camping cnn Data Mining Model camping 42 linuxware linuxware 85,000 B. Marley, Recovery of camping, distributions, not microsoft individual records

  13. Reconstruction Problem Reconstruction Problem (Numeric Data) (Numeric Data) � Original values x Original values x 1 , x 2 , ..., x x n 1 , x 2 , ..., � n – from probability distribution X (unknown) from probability distribution X (unknown) – � To hide these values, we use y To hide these values, we use y 1 , y 2 , ..., y y n 1 , y 2 , ..., � n – from probability distribution Y from probability distribution Y – � Given Given � – x x 1 +y 1 , x 2 +y 2 , ..., x x n +y n – 1 +y 1 , x 2 +y 2 , ..., n +y n – the probability distribution of Y the probability distribution of Y – Estimate the probability distribution of X. Estimate the probability distribution of X.

  14. Reconstruction Algorithm Reconstruction Algorithm 0 := Uniform distribution f X f 0 := Uniform distribution X j := 0 j := 0 repeat repeat + − j n 1 f (( x y ) a ) f ( a ) ∑∫ Y i i X f X j+1 (a) := (a) := Bayes Bayes’ Rule ’ Rule f j+1 ∞ n X + − j f (( x y ) a ) f ( a ) = i 1 Y i i X − ∞ j := j+1 j := j+1 until (stopping criterion met) until (stopping criterion met) (R. Agrawal Agrawal & R. & R. Srikant Srikant, SIGMOD 2000) , SIGMOD 2000) (R. Converges to maximum likelihood estimate. Converges to maximum likelihood estimate. � � – D. Agrawal & C.C. Aggarwal, PODS 2001. D. Agrawal & C.C. Aggarwal, PODS 2001. –

  15. Works Well Works Well 1200 1000 Number of People Original 800 Randomized 600 Reconstructed 400 200 0 20 60 Age

  16. Decision Tree Example Decision Tree Example Age < 25 Age Salary Repeat Visitor? No Yes 23 50K Repeat 17 30K Repeat Salary < Repeat 50K 43 40K Repeat Yes 68 50K Single No 32 70K Single Single Repeat 20 20K Repeat

  17. Algorithms Algorithms � Global Global � – Reconstruct for each attribute once at the beginning Reconstruct for each attribute once at the beginning – � By Class By Class � – For each attribute, first split by class, then reconstruct For each attribute, first split by class, then reconstruct – separately for each class. separately for each class. � Local Local � – Reconstruct at each node – Reconstruct at each node See SIGMOD 2000 paper for details. See SIGMOD 2000 paper for details.

  18. Experimental Methodology Experimental Methodology � Compare accuracy against Compare accuracy against � – Original Original: unperturbed data without randomization. : unperturbed data without randomization. – – Randomized Randomized: perturbed data but without making any : perturbed data but without making any – corrections for randomization. corrections for randomization. � Test data not randomized. Test data not randomized. � � Synthetic benchmark from [AGI+92]. Synthetic benchmark from [AGI+92]. � � Training set of 100,000 records, split equally Training set of 100,000 records, split equally � between the two classes. between the two classes.

Recommend


More recommend