Privacy preserving data mining multiplicative perturbation - PowerPoint PPT Presentation

Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Outline  Review and critique of randomization approaches (additive noise)  Multiplicative data perturbations  Rotation perturbation  Geometric Data Perturbation  Random projection  Comparison

Additive noise (randomization)  Reveal entire database, but randomize entries User Database x 1 +  1 … x 1 x n +  n … x n Add random noise  i to each database entry x i For example, if distribution of noise has mean 0, user can compute average of x i slide 3

Learning decision tree on randomized data Alice’s age 30 | 70K | ... 50 | 40K | ... ... Add random Randomizer Randomizer number to Age ... 65 | 20K | ... 25 | 60K | ... 30 becomes 65 Reconstruct Reconstruct (30+35) ... Distribution Distribution of Age of Salary Classification Model Algorithm

Summary on additive perturbations  Benefits  Easy to apply – applied separately to each data point (record)  Low cost  Can be used for both web model and corporate model user 1 User 2 User n Private x 1 +  1 x 1 … info … x n +  n Web x n Apps data

Additive perturbations - privacy  Need to publish noise distribution  The column distribution is disclosed  Subject to data value attacks! On the Privacy Preserving Properties of Random Data Perturbation Techniques, Kargupta, 2003a

The spectral filtering technique can be used to estimate the original data

The spectral filtering technique can perform poorly when there is an inherent random component in the original data

Randomization – data utility  Only preserves column distribution  Need to redesign/modify existing data mining algorithms  Limited data mining applications  Decision tree and naïve bayes classifier

Randomization approaches ? Data Utility/ Privacy Model accuracy guarantee Privacy guarantee Data utility/ Model accuracy • Difficult to balance the two factors • Low data utility • Subject to attacks

More thoughts about perturbation 1. Preserve Privacy 2. Preserve Data Utility for Tasks  Hide the original data  Single-dimensional  not easy to estimate the original values from the properties - column perturbed data distribution, etc.  Protect from data  Decision tree, Bayesian reconstruction techniques classifier  The attacker has prior  Multi-dimensional knowledge on the published properties - covariance data matrix, distance, etc  SVM classifier, knn classification, clustering

Multiplicative perturbations  Preserving multidimensional data properties  Geometric data perturbation (GDP) [Chen ’07]  Rotation data perturbation  Translation data perturbation  Noise addition  Random projection perturbation(RPP) [Liu ‘06] Chen, K. and Liu, L. Towards attack-resilient geometric data perturbation. SDM, 2007 Liu, K., Kargupt, H., and Ryan, J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. TKDE, 2006

Rotation Perturbation  G(X) = R*X R m*m - an orthonormal matrix (R T R = RR T = I) X m*n - original data set with n m-dimensional data points G(X) m*n - rotated data set  Key features  preserves Euclidean distance and inner product of data points  preserves geometric shapes such as hyperplane and hyper curved surfaces in the multidimensional space Example: ID 001 002 ID 001 002 .83 -.40 .40 age 1176 948 age 30 25 = * .2 .86 .46 rent 3112 2392 rent 1350 1000 .53 .30 -.79 tax -2920 -2309 tax 4230 3320

Illustration of multiplicative data perturbation Preserving distances while perturbing each individual dimensions

Data properties  A model is invariant to geometric perturbation if distance plays an important role  Class/cluster members and decision boundaries are correlated in Class 1 terms of distance, not the concrete locations Class 1 2D Example: Class Rotation and 2 translation Classification Classification boundary boundary Class C l a 2 s s 1 Distance perturbation (Noise addition) C Slightly l a 2 s s changed Classification boundary

Applicable DM algorithms  Models “invariant” to GDP  all Euclidean distance based clustering algorithms  Classification algorithms  K Nearest Neighbors  Kernel methods  Linear classifier  Support vector machines  Most regression models  And potentially more …

When to Use Multiplicative Data Perturbation Service Provider/data user Data Owner G(X)=RX+T+D G(X) F(G(X),  ) Mined models/patterns Good for the corporate model or dataset publishing. Major issue!! curious service providers/data users try to break G(X)

Attacks!  Three levels of knowledge  Know nothing  naïve estimation  Know column distributions  Independent Component Analysis  Know specific points (original points and their images in perturbed data)  distance inference

Attack 1: naïve estimation  Estimate original points purely based on the perturbed data If using “random rotation” only  Intensity of perturbation matters  Points around origin Y Class 1 Class 1 Class 1 Class X Class 2 Class 2 2 Classification boundary Classification boundary Classification boundary

Countering naïve estimation  Maximize intensity  Based on formal analysis of “rotation intensity”  Method to maximize intensity  Fast_Opt algorithm in GDP  “Random translation” T  Hide origin  Increase difficulty of attacking!  Need to estimate R first, in order to find out T

Attack 2: ICA based attacks  Independent Component Analysis (ICA)  Try to separate R and X from Y= R*X

Characteristics of ICA 1. Ordering of dimensions is not preserved. 2. Intensity (value range) is not preserved Conditions of effective ICA-attack 1. Knowing column distribution 2. Knowing value range.

Countering ICA attack  Weakness of ICA attack  Need certain amount of knowledge  Cannot effectively handle dependent columns  In reality…  Most datasets have correlated columns  We can find optimal rotation perturbation  maximizing the difficulty of ICA attacks

Attack 3: distance-inference attack If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping… image Known point Perturbed Original

How is the Attack done …  Knowing points and their images …  find exact images of the known points  Enumerate pairs by matched distances … Less effective for large data …  we assume pairs are successfully identified  Estimation 1. Cancel random translation T from pairs (x, x’) 2. calculate R with pairs: Y=RX  R = Y*X -1 3. calculate T with R and known pairs

Countering distance-inference: Noise addition  Noise brings enough variance in estimation of R and T  Can the noise be easily filtered?  Need to know noise distribution,  Need to know distribution of RX + T,  Both distributions are not published, however. Note: It is very different from the attacks to noise addition data perturbation [Kargupta03]

Attackers with more knowledge?  What if attackers know large amount of original records?  Able to accurately estimate covariance matrix, column distribution, and column range, etc., of the original data  Methods PCA,etc can be used  What do we do? Stop releasing any kind of data anymore 

Benefits of Geometric Data Perturbation decoupled Data Utility/ Privacy Model accuracy guarantee Applicable to many DM algorithms -Distance-based Clustering -Classification: linear, KNN, Kernel, SVM,… Make optimization and balancing easier! - Almost fully preserving model accuracy - we optimize privacy only

A randomized perturbation optimization algorithm  Start with a random rotation  Goal: passing tests on simulated attacks  Not simply random – a hillclimbing method 1. Iteratively determine R - Test on naïve estimation (Fast_opt) - Test on ICA (2 nd level)  find a better rotation R 2. Append a random translation component 3. Append an appropriate noise component

Privacy guarantee:GDP  In terms of naïve estimation and ICA-based attacks  Use only the random rotation and translation components (R*X+T) Optimized perturbation for both attacks Optimized for Naïve estimation only Worst perturbation (no optimization)

Privacy guarantee:GDP  In terms of distance inference attacks  Use all three components (R*X +T+D)  Noise D : Gaussian N(0,  2 )  Assume pairs of (original, image) are identified by attackers  no noise addition, privacy guarantee =0 Considerably high PG at small perturbation  =0.1

Data utility: GDP with noise addition  Noise addition vs. model accuracy - noise: N(0, 0.1 2 ) Boolean data is more sensitive to distance perturbation

Random Projection Perturbation  Random projection  projects a set of data points from high dimensional space to a lower dimensional subspace  F(X) = P*X  X is m*n matrix: m columns and n rows  P is a k*m random matrix, k <= m  Johnson-Lindenstrauss Lemma There is a random projection F() with e is a small number <1, so that (1-e)|| x - y ||<=||F( x )-F( y )||<=(1+e)|| x - y || i.e. distance is approximately preserved.

Privacy preserving data mining multiplicative perturbation - PowerPoint PPT Presentation

Privacy preserving data mining multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity Outline Review and critique of randomization approaches (additive noise) Multiplicative data perturbations Rotation

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Privacy preserving data mining randomized response and association rule hiding Li Xiong

Privacy Preserving Data Mining: Additive Data Perturbation Outline Input perturbation

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining & Information Privacy:

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Collaborative Privacy Preserving Data Mining in Vertically Partitioned Databases Ehud Gudes

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Multiplicative Weights Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 13 :

Developing Multiplicative Thinking- Foundations of Multiplicative Thinking with Julie Adams

Developing Multiplicative Thinking More Assessing and Monitoring Multiplicative Thinking Welcome

Imaginary multiplicative chaos and the XOR-Ising model Janne Junnila (EPFL) joint work with Eero

Finite field models in additive combinatorics Julia Wolf University of Bristol Emerging

From ABC to XYZ, or Addition versus Multiplication Je ff Lagarias , University of Michigan Ann

Frobenius Additive Fast Fourier Transform Wen-Ding Li Research Center for Information Technology

Interactions In the additive model, the effect of one factor is not affected by the level of the

Picturing Resources in Concurrency: from Linear to Additive Relations Filippo Bonchi, Joshua

Alex Psomas: Lecture 14. Probability Basics Review Probability is Additive Theorem Events,

with Constant Multiplicative Error Uri Stemmer Ben-Gurion University joint work with Haim Kaplan

The Complexity of Simple and Optimal Deterministic Mechanisms for an Additive Buyer Xi Chen,

Privacy preserving data mining multiplicative perturbation - PowerPoint PPT Presentation

Privacy preserving data mining multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity Outline Review and critique of randomization approaches (additive noise) Multiplicative data perturbations Rotation

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Privacy preserving data mining randomized response and association rule hiding Li Xiong

Privacy Preserving Data Mining: Additive Data Perturbation Outline Input perturbation

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining &amp; Information Privacy:

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Collaborative Privacy Preserving Data Mining in Vertically Partitioned Databases Ehud Gudes

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Multiplicative Weights Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 13 :

Developing Multiplicative Thinking- Foundations of Multiplicative Thinking with Julie Adams

Developing Multiplicative Thinking More Assessing and Monitoring Multiplicative Thinking Welcome

Imaginary multiplicative chaos and the XOR-Ising model Janne Junnila (EPFL) joint work with Eero

Finite field models in additive combinatorics Julia Wolf University of Bristol Emerging

From ABC to XYZ, or Addition versus Multiplication Je ff Lagarias , University of Michigan Ann

Frobenius Additive Fast Fourier Transform Wen-Ding Li Research Center for Information Technology

Interactions In the additive model, the effect of one factor is not affected by the level of the

Picturing Resources in Concurrency: from Linear to Additive Relations Filippo Bonchi, Joshua

Alex Psomas: Lecture 14. Probability Basics Review Probability is Additive Theorem Events,

with Constant Multiplicative Error Uri Stemmer Ben-Gurion University joint work with Haim Kaplan

The Complexity of Simple and Optimal Deterministic Mechanisms for an Additive Buyer Xi Chen,

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining & Information Privacy: