CS573 Data Privacy and Security Statistical Databases Statistical Databases Li Xiong
Today � Statistical databases � Definitions � Early query restriction methods � Output perturbation and differential privacy � Output perturbation and differential privacy
Statistical Data Release Population count city ��� ���� ��������� 20 �� ������� ��������������������� 30 50 �� ������� ��������������������� Age �� �� ������� ������� ��������������������� ��������������������� 40 40 50 Diagnosis • Release statistical summary of the data (vs. individual records) • Useful for analysis and learning • Medical statistics • Query log statistics – frequent search terms • Still need rigorous inference control
Statistical Database � A statistical database is a database which provides statistics on subsets of records � Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of MEAN, MEDIAN, COUNT, MAX AND MIN of records � Inference control to prevent inference from statistics to individual records
Methods � Data perturbation/anonymization � Query restriction � Output perturbation
Data Perturbation ����� ������ ������ � � � � �
Query Resitrction
Output Perturbation "����# Query ����� Results ������ !���������� ��� ������ ��� ������ $������� �������� � � � � � � Results Query ����� "�����
Methods � Data perturbation/anonymization � Query restriction � Query set size control � Query set overlap control � Query set overlap control � Query auditing � Output perturbation
Query Set Size Control � A query-set size control limit the number of records that must be in the result set � Allows the query results to be displayed only if the size of the query set |C| satisfies the the size of the query set |C| satisfies the condition K <= |C| <= L – K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2
Query Set Size Control
Tracker � Q1: Count ( Sex = Female ) = A � Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B What if B = A+1? What if B = A+1?
Tracker � Q1: Count ( Sex = Female ) = A � Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1 � Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!
Query set size control � If the threshold value k is large, then it will restrict too many queries � And still does not guarantee protection from compromise compromise � The database can be easily compromised within a frame of 4-5 queries
Query Set Overlap Control � Basic idea: successive queries must be checked against the number of common records. � If the number of common records in any � If the number of common records in any query exceeds a given threshold, the requested statistic is not released. � A query q(C) is only allowed if: | q (C ) ^ q (D) | ≤ r, r > 0 Where r is set by the administrator
Query-set-overlap control � Statistics for a set and its subset cannot be released – limiting usefulness � High processing overhead – every new query compared with all previous ones compared with all previous ones � Multiple users - need to keep user profile, need to consider collusion between users � Still no formal privacy guarantee
Auditing � Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued � Excessive computation and storage � Excessive computation and storage requirements � Only “efficient” methods for special types of queries
Audit Expert (Chin 1982) � Query auditing method for SUM queries � A SUM query can be considered as a linear equation where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result sensitive value, and q is the query result � A set of SUM queries can be thought of as a system of linear equations � Maintains the binary matrix representing linearly independent queries and update it when a new query is issued � A row with all 0s except for i th column indicates disclosure
Audit Expert � Only stores linearly independent queries � Not all queries are linearly independent Q1: Sum(Sex=M) Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)
Audit Expert � O(L 2 ) time complexity � Further work reduced to O(L) time and space when number of queries < L � Only for SUM queries � Only for SUM queries
Auditing – recent developments � Online auditing � “Detect and deny” queries that violate privacy requirement � Denial themselves may implicitly disclose sensitive � Denial themselves may implicitly disclose sensitive information � Offline auditing � Check if a privacy requirement has been violated after the queries have been executed � Not to prevent
Methods � Data perturbation/anonymization � Query restriction � Output perturbation � Differential privacy
Differential Privacy � Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set � E.g.: Q = select count() where Age = [20,30] and Diagnosis = B = B D1 A (D1) Bob in Output Q Perturbation User D2 A (D2) Bob out
Differential Privacy � Differential privacy � Laplace mechanism Q(D) + Y where Y is drawn from � Query sensitivity � Query sensitivity D1 A(D1) = Q(D1) + Y1 Bob in Differentially Q Private User Interface D2 A (D2) = Q(D2) + Y2 Bob out
Composition of Differential Privacy � Sequential composition [McSherry SIGMOD 09] � Let Mi each provides differential privacy. The sequence of Mi provides differential privacy � Parallel composition � If Di are disjoint subsets of the original database and Mi provides differential privacy for each Di , then the sequence of provides differential privacy for each Di , then the sequence of Mi provides differential privacy. D1 A1 (D1), A2(D1), … Bob in Differentially Q1,Q2, … Private User Interface D2 A1 (D2), A2(D2), … Bob out
Differential Privacy � Is unfettered access to raw data truly essential? � Is released data sufficient (provide sufficient utility guarantee)? Privacy Privacy Raw Released User mechanism Data Data count ��� ���� ��������� city �� ������� ������������ ��������� �� ������� ������������� Age �������� �� ������� ������������� �������� Diagnosis
Challenges � Differential privacy cost accumulates quickly with number of queries � Typical tasks require multiple queries or multiple steps steps � Need to support multiple users � Impossible to guarantee utility for all (any) data or all (any) applications
Possible Middle Ground � Guaranteed utility for certain applications � Counting queries, classification, logistic regression � Guaranteed utility for certain kinds of data � Use prior or domain knowledge about data � Use prior or domain knowledge about data � Use intermediate results (differentially private) Target Applications Prior or domain Intermediate knowledge Result Privacy Raw Released User mechanism Data Data
Our Research: Adaptive Differentially Private Data Release � Data knowledge � Dense and “smooth” data � High dimensional and sparse data � Dynamic data � Application knowledge � Query workload � � Specific tasks Specific tasks
Histogram Example ?
Strategy I: Baseline Cell Partitioning Q1: count() where Age = 20, Diagnosis = A diagnosis Diagnosis Q2: count() where Age = 20, Diagnosis = B A B … A B Q �% #% 20 �%' #%' 20 Age DP Age alpha �% &% 30 �%' &%' 30 • Goal: to release a differentially private histogram to support random predicate queries • Q: select count() where Age = [20,30] and Income = 40K • If a query predicate consists of multiple cells or partitions, it will have aggregated perturbation error
Strategy II: Hierarchical Partitioning A B 20 diagnosis �%%' A B alpha/3 30 �% #% 20 Age A B �% &% 30 (%' alpha/3 20 #)%' 30 alpha/3 A B �%' #%' 20 �%' &%' 30 • Large perturbation error due to small divided privacy budget at each level
Recommend
More recommend