CS573 Data Privacy and Security Statistical Databases Statistical - PowerPoint PPT Presentation

CS573 Data Privacy and Security Statistical Databases Statistical Databases Li Xiong

Today � Statistical databases � Definitions � Early query restriction methods � Output perturbation and differential privacy � Output perturbation and differential privacy

Statistical Data Release Population count city �� 20 �� 30 50 �� Age �� 40 40 50 Diagnosis • Release statistical summary of the data (vs. individual records) • Useful for analysis and learning • Medical statistics • Query log statistics – frequent search terms • Still need rigorous inference control

Statistical Database � A statistical database is a database which provides statistics on subsets of records � Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of MEAN, MEDIAN, COUNT, MAX AND MIN of records � Inference control to prevent inference from statistics to individual records

Methods � Data perturbation/anonymization � Query restriction � Output perturbation

Data Perturbation ��

Query Resitrction

Output Perturbation "��# Query �� Results �� !�� $�� Results Query �� "��

Methods � Data perturbation/anonymization � Query restriction � Query set size control � Query set overlap control � Query set overlap control � Query auditing � Output perturbation

Query Set Size Control � A query-set size control limit the number of records that must be in the result set � Allows the query results to be displayed only if the size of the query set |C| satisfies the the size of the query set |C| satisfies the condition K <= |C| <= L – K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

Query Set Size Control

Tracker � Q1: Count ( Sex = Female ) = A � Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B What if B = A+1? What if B = A+1?

Tracker � Q1: Count ( Sex = Female ) = A � Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1 � Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!

Query set size control � If the threshold value k is large, then it will restrict too many queries � And still does not guarantee protection from compromise compromise � The database can be easily compromised within a frame of 4-5 queries

Query Set Overlap Control � Basic idea: successive queries must be checked against the number of common records. � If the number of common records in any � If the number of common records in any query exceeds a given threshold, the requested statistic is not released. � A query q(C) is only allowed if: | q (C ) ^ q (D) | ≤ r, r > 0 Where r is set by the administrator

Query-set-overlap control � Statistics for a set and its subset cannot be released – limiting usefulness � High processing overhead – every new query compared with all previous ones compared with all previous ones � Multiple users - need to keep user profile, need to consider collusion between users � Still no formal privacy guarantee

Auditing � Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued � Excessive computation and storage � Excessive computation and storage requirements � Only “efficient” methods for special types of queries

Audit Expert (Chin 1982) � Query auditing method for SUM queries � A SUM query can be considered as a linear equation where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result sensitive value, and q is the query result � A set of SUM queries can be thought of as a system of linear equations � Maintains the binary matrix representing linearly independent queries and update it when a new query is issued � A row with all 0s except for i th column indicates disclosure

Audit Expert � Only stores linearly independent queries � Not all queries are linearly independent Q1: Sum(Sex=M) Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)

Audit Expert � O(L 2 ) time complexity � Further work reduced to O(L) time and space when number of queries < L � Only for SUM queries � Only for SUM queries

Auditing – recent developments � Online auditing � “Detect and deny” queries that violate privacy requirement � Denial themselves may implicitly disclose sensitive � Denial themselves may implicitly disclose sensitive information � Offline auditing � Check if a privacy requirement has been violated after the queries have been executed � Not to prevent

Methods � Data perturbation/anonymization � Query restriction � Output perturbation � Differential privacy

Differential Privacy � Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set � E.g.: Q = select count() where Age = [20,30] and Diagnosis = B = B D1 A (D1) Bob in Output Q Perturbation User D2 A (D2) Bob out

Differential Privacy � Differential privacy � Laplace mechanism Q(D) + Y where Y is drawn from � Query sensitivity � Query sensitivity D1 A(D1) = Q(D1) + Y1 Bob in Differentially Q Private User Interface D2 A (D2) = Q(D2) + Y2 Bob out

Composition of Differential Privacy � Sequential composition [McSherry SIGMOD 09] � Let Mi each provides differential privacy. The sequence of Mi provides differential privacy � Parallel composition � If Di are disjoint subsets of the original database and Mi provides differential privacy for each Di , then the sequence of provides differential privacy for each Di , then the sequence of Mi provides differential privacy. D1 A1 (D1), A2(D1), … Bob in Differentially Q1,Q2, … Private User Interface D2 A1 (D2), A2(D2), … Bob out

Differential Privacy � Is unfettered access to raw data truly essential? � Is released data sufficient (provide sufficient utility guarantee)? Privacy Privacy Raw Released User mechanism Data Data count �� city �� Age �� Diagnosis

Challenges � Differential privacy cost accumulates quickly with number of queries � Typical tasks require multiple queries or multiple steps steps � Need to support multiple users � Impossible to guarantee utility for all (any) data or all (any) applications

Possible Middle Ground � Guaranteed utility for certain applications � Counting queries, classification, logistic regression � Guaranteed utility for certain kinds of data � Use prior or domain knowledge about data � Use prior or domain knowledge about data � Use intermediate results (differentially private) Target Applications Prior or domain Intermediate knowledge Result Privacy Raw Released User mechanism Data Data

Our Research: Adaptive Differentially Private Data Release � Data knowledge � Dense and “smooth” data � High dimensional and sparse data � Dynamic data � Application knowledge � Query workload � � Specific tasks Specific tasks

Histogram Example ?

Strategy I: Baseline Cell Partitioning Q1: count() where Age = 20, Diagnosis = A diagnosis Diagnosis Q2: count() where Age = 20, Diagnosis = B A B … A B Q �% #% 20 �%' #%' 20 Age DP Age alpha �% &% 30 �%' &%' 30 • Goal: to release a differentially private histogram to support random predicate queries • Q: select count() where Age = [20,30] and Income = 40K • If a query predicate consists of multiple cells or partitions, it will have aggregated perturbation error

Strategy II: Hierarchical Partitioning A B 20 diagnosis �%%' A B alpha/3 30 �% #% 20 Age A B �% &% 30 (%' alpha/3 20 #)%' 30 alpha/3 A B �%' #%' 20 �%' &%' 30 • Large perturbation error due to small divided privacy budget at each level

CS573 Data Privacy and Security Statistical Databases Statistical - PowerPoint PPT Presentation

CS573 Data Privacy and Security Statistical Databases Statistical Databases Li Xiong Today Statistical databases Definitions Early query restriction methods Output perturbation and differential privacy Output perturbation and

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

CS573 Data Privacy and Security Statistical Databases Li Xiong Department of Mathematics and

CS573 Data Privacy and Security Local Differential Privacy Li Xiong Privacy at Scale: Local

CS573 Data Privacy and Security Differential Privacy Real World Deployments Li Xiong

CS573 Data Privacy and Security Location Privacy Location Privacy Yonghui (Yohu) Xiao htt //

Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security

Statistical Databases Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial

Healthcare privacy and security Li Xiong CS573 Data Privacy and Security Patients Are Concerned

CS573 Data Privacy and Security Li Xiong Department of Mathematics and Computer Science Emory

CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Differential Privacy Li Xiong Outline Differential Privacy Definition Basic techniques

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Privacy-Preserving Query Processing over Encrypted Data in Cloud CS573 Data Privacy and Security

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Review Review Course Overview Privacy Querying Published data, Encrypted Data yp Statistical

Department of Pediatrics Faculty Meeting June 27, 2019 The Year in Review Mary Leonard,

UKALL14 TRAINING AND Q&A SESSION 1 UKALL14 Registration only sub -study training slides

Outline of lecture sampling in lymphoma diagnosis Overview the approach to the use of bone To

How I treat ATL in Standard Treatment in front-line and prognostic index Kunihiro Tsukasaki,

A Multitumor Regional Symposium Focused on the Application of Emerging Research Information to the

3/7/2017 17 th Multidisciplinary Management of Cancers: A Case based Approach 17 th

Malaysian Healthy Ageing Society Chew Boon How 1* MMed (FamMed) , Sazlina Shariff-Ghazali 1 MMed

COVID-19 Guidance for Reopening Long-Term Care Facilities and Adult Care Homes Presented by:

CS573 Data Privacy and Security Statistical Databases Statistical - PowerPoint PPT Presentation

CS573 Data Privacy and Security Statistical Databases Statistical Databases Li Xiong Today Statistical databases Definitions Early query restriction methods Output perturbation and differential privacy Output perturbation and

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

CS573 Data Privacy and Security Statistical Databases Li Xiong Department of Mathematics and

CS573 Data Privacy and Security Local Differential Privacy Li Xiong Privacy at Scale: Local

CS573 Data Privacy and Security Differential Privacy Real World Deployments Li Xiong

CS573 Data Privacy and Security Location Privacy Location Privacy Yonghui (Yohu) Xiao htt //

Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security

Statistical Databases Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial

Healthcare privacy and security Li Xiong CS573 Data Privacy and Security Patients Are Concerned

CS573 Data Privacy and Security Li Xiong Department of Mathematics and Computer Science Emory

CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Differential Privacy Li Xiong Outline Differential Privacy Definition Basic techniques

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Privacy-Preserving Query Processing over Encrypted Data in Cloud CS573 Data Privacy and Security

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Review Review Course Overview Privacy Querying Published data, Encrypted Data yp Statistical

Department of Pediatrics Faculty Meeting June 27, 2019 The Year in Review Mary Leonard,

UKALL14 TRAINING AND Q&amp;A SESSION 1 UKALL14 Registration only sub -study training slides

Outline of lecture sampling in lymphoma diagnosis Overview the approach to the use of bone To

How I treat ATL in Standard Treatment in front-line and prognostic index Kunihiro Tsukasaki,

A Multitumor Regional Symposium Focused on the Application of Emerging Research Information to the

3/7/2017 17 th Multidisciplinary Management of Cancers: A Case based Approach 17 th

Malaysian Healthy Ageing Society Chew Boon How 1* MMed (FamMed) , Sazlina Shariff-Ghazali 1 MMed

COVID-19 Guidance for Reopening Long-Term Care Facilities and Adult Care Homes Presented by:

UKALL14 TRAINING AND Q&A SESSION 1 UKALL14 Registration only sub -study training slides