Need to Preserve Privacy How to Preserve . . . In Statistical Data . . . Estimating Accuracy . . . Data Anonymization that Towards Optimal . . . Leads to the Most Accurate First Result: . . . We Need to Dismiss . . . Estimates of Statistical How to Also Take into . . . Main Result: Optimal . . . Characteristics Home Page Gang Xiang 1 and Vladik Kreinovich 2 Title Page ◭◭ ◮◮ 1 Applied Biomathematics, 100 North Country Rd. Setauket, NY 11733, USA, gxiang@sigmaxi.net ◭ ◮ 2 Department of Computer Science Page 1 of 22 University of Texas at El Paso Go Back El Paso, TX 79968, USA, vladik@utep.edu Full Screen Close Quit
Need to Preserve Privacy How to Preserve . . . 1. Need to Preserve Privacy In Statistical Data . . . • One of the main objectives of engineering is to help Estimating Accuracy . . . people: Towards Optimal . . . First Result: . . . – civil engineering designs houses in which we live We Need to Dismiss . . . and roads along which we travel, How to Also Take into . . . – electrical engineering designs appliances – and elec- Main Result: Optimal . . . tric networks that help use these appliances. Home Page • To better serve customers, it is important to know as Title Page much as possible about the potential customers. ◭◭ ◮◮ • Customers are reluctant to share information, since ◭ ◮ this information can be potentially used against them. Page 2 of 22 • For example, age can be used by companies to (unlaw- fully) discriminate against older job applicants. Go Back Full Screen • It is thus important to preserve privacy when storing customer data. Close Quit
Need to Preserve Privacy How to Preserve . . . 2. How to Preserve Privacy: k -Anonymity and ℓ - In Statistical Data . . . Diversity Estimating Accuracy . . . • To maintain privacy, we divide the space of all possible Towards Optimal . . . combinations of values ( x 1 , . . . , x n ) into boxes. First Result: . . . We Need to Dismiss . . . • For each record, instead of storing the actual values x i , How to Also Take into . . . we only store the label of the box containing x . Main Result: Optimal . . . • To avoid further loss of privacy, it is important to make Home Page sure that location in a box does not identify a person. Title Page • This is usually achieved by requiring that for some ◭◭ ◮◮ fixed k , each box contains at least k records. ◭ ◮ • It is also not good if all records within a box have the Page 3 of 22 same value of an i -th quantity x i . Go Back • It is thus required that for some ℓ , in each box there are at least ℓ different values of each x i . Full Screen Close Quit
Need to Preserve Privacy How to Preserve . . . 3. Statistical Data Processing In Statistical Data . . . • Our main objective is to predict the desired character- Estimating Accuracy . . . istic x i 0 . Towards Optimal . . . First Result: . . . • In most cases, the dependence is linear, so we must � m We Need to Dismiss . . . find c q s.t. x i 0 ≈ c 0 + c q · x i q . How to Also Take into . . . q =1 Main Result: Optimal . . . • Least Squares Approach leads to: Home Page m N � � Title Page c r · C i q i r = C i 0 i q ; c 0 = E i 0 − c q · E i q . ◭◭ ◮◮ r =1 q =1 ◭ ◮ • We also want to know which quantities are correlated, C ij Page 4 of 22 i.e., we want to estimate ρ ij = . σ i · σ j Go Back • In all these tasks, we need to estimate averages E i , Full Screen variances V i = σ 2 i , covariances C ij , and correlations ρ ij . Close Quit
Need to Preserve Privacy How to Preserve . . . 4. Statistical Characteristics: Reminder In Statistical Data . . . • The means are usually estimated as follows: Estimating Accuracy . . . Towards Optimal . . . N N � � E i = 1 E j = 1 x ( p ) x ( p ) N · i , N · j . First Result: . . . We Need to Dismiss . . . p =1 p =1 How to Also Take into . . . • The covariance is usually estimated as: Main Result: Optimal . . . � � � � N � Home Page C ij = 1 x ( p ) x ( p ) N · − E i · − E j . i j Title Page p =1 ◭◭ ◮◮ • The variance is usually estimated as: ◭ ◮ � � 2 N � V i = 1 x ( p ) Page 5 of 22 N · − E i . i p =1 Go Back Full Screen Close Quit
Need to Preserve Privacy How to Preserve . . . 5. In Statistical Data Processing, Privacy Leads In Statistical Data . . . to Uncertainty Estimating Accuracy . . . • To maintain privacy, we replace each numerical value Towards Optimal . . . x ( p ) with the corresponding interval. First Result: . . . i We Need to Dismiss . . . • Different values from these intervals lead to different How to Also Take into . . . values of the resulting statistical characteristics. Main Result: Optimal . . . • Hence, for each characteristic, we get a whole interval Home Page of possible values. Title Page • If this interval is too wide, the resulting range is useless: ◭◭ ◮◮ e.g., [ − 1 , 1] for correlation. ◭ ◮ • It is therefore desirable to select: Page 6 of 22 – among all possible subdivisions into boxes which Go Back preserve k -anonymity (and ℓ -diversity), Full Screen – the one which leads to the narrowest intervals for the desired statistical characteristic. Close Quit
Need to Preserve Privacy How to Preserve . . . 6. Estimating Accuracy Caused by Privacy-Based In Statistical Data . . . Subdivision into Boxes: Case of k -Anonymity Estimating Accuracy . . . • To minimize uncertainty, we select the smallest boxes. Towards Optimal . . . First Result: . . . • Hence, each box B should have exactly k records. We Need to Dismiss . . . x i +∆ i ], instead of C ( x (1) 1 , . . . , x ( N ) • For intervals [ � x i − ∆ i , � n ), How to Also Take into . . . we get: Main Result: Optimal . . . x (1) 1 + ∆ x (1) n ) , where | ∆ x ( p ) x ( N ) + ∆ x ( N ) C ( � 1 , . . . , � i | ≤ ∆ i . Home Page n • When we have many records, boxes are small, so we Title Page can use a linear approximation: ◭◭ ◮◮ N n � � ∂C ◭ ◮ · ∆ x ( p ) C = � C + i . ∂x i Page 7 of 22 p =1 i =1 Go Back • The range of this linear expression is [ � C − ∆ , � C + ∆], � � � � � � � � � N � n = k · � � n ∂C ∂C Full Screen def � · ∆ ( p ) � � � � where ∆ = � · ∆ i . � � i ∂x i ∂x i p =1 i =1 B i =1 Close Quit
Need to Preserve Privacy How to Preserve . . . 7. Expressions for the Corr. Partial Derivatives In Statistical Data . . . • The estimate for the accuracy ∆ is described in terms Estimating Accuracy . . . of partial derivatives ∂C Towards Optimal . . . of the stat. characteristic C . ∂x i First Result: . . . = 1 • For the mean E i , the derivative is equal to ∂E i We Need to Dismiss . . . N . ∂x i How to Also Take into . . . = 2 · ( x i − E i ) • For the variance V i , we have ∂V i Main Result: Optimal . . . . ∂x i N Home Page • Therefore, for σ i = √ V i , we get ∂σ i = x i − E x Title Page . ∂x i σ x ◭◭ ◮◮ • For the covariance C ij , we have ∂C ij = x j − E j . ◭ ◮ ∂x i N Page 8 of 22 • For the correlation ρ ij , we have: Go Back ( x j − E j ) − C ij · ( x i − E i ) σ 2 ∂ρ ij = 1 Full Screen i N · . σ i · σ j ∂x i Close Quit
Need to Preserve Privacy How to Preserve . . . 8. Towards Optimal Subdivision into Boxes In Statistical Data . . . • The overall expression for ∆ is a sum of terms corre- Estimating Accuracy . . . sponding to different points. Towards Optimal . . . First Result: . . . • So, to minimize ∆, we must, for each point, minimize � � � � � n ∂C We Need to Dismiss . . . def � � the corresponding term a i · ∆ i , where a i = � . � ∂x i How to Also Take into . . . i =1 Main Result: Optimal . . . • The only constraint on the values ∆ i is that the corre- Home Page sponding box should contain exactly k different points. Title Page • The number of points can be obtained by multiplying � n ◭◭ ◮◮ the data density ρ ( x ) by the box volume (2∆ i ). i =1 ◭ ◮ • The data density can be estimated based on the data. Page 9 of 22 � n • So, we minimize a i · ∆ i under the constraint Go Back i =1 Full Screen n � ρ ( x ) · 2 n · ∆ i = k. Close i =1 Quit
Need to Preserve Privacy How to Preserve . . . 9. First Result: (Asymptotically) Optimal Subdi- In Statistical Data . . . vision into Boxes (Case of k -Anonymity) Estimating Accuracy . . . • Method: Lagrange multiplier technique leads to Towards Optimal . . . � � � � ∆ i = c ( x ) ∂C First Result: . . . � � , where a i = � . � a i ∂x i We Need to Dismiss . . . � How to Also Take into . . . � n � � • From the constraint, we get c ( x ) = 1 � k Main Result: Optimal . . . 2 · ρ ( x ) · a j . n Home Page j =1 Title Page • Conclusion: around each point x , we need to select the box with half-widths ◭◭ ◮◮ � � n ◭ ◮ � a j n ∆ i = 1 k j =1 Page 10 of 22 2 · ρ ( x ) · . n a i Go Back • The resulting accuracy: ∆ = n · � c ( x ) , where the sum Full Screen x is taken over all N data points x . Close Quit
Recommend
More recommend