statistical databases query auditing
play

Statistical Databases Query Auditing Li Xiong CS573 Data Privacy - PowerPoint PPT Presentation

Statistical Databases Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin Query Audit Problem Maintaining privacy of data Auditor Q 1 Q 2 Q n Database


  1. Statistical Databases – Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin

  2. Query Audit Problem  Maintaining “privacy” of data Auditor Q 1 Q 2 … Q n … Database Q A 1 A 2 … A n A or “Denied” Does answer to Q combined with answers to Q 1 ,…,Q n reveal something? slide 2

  3. Variations of the Problem Specifies subset of the variables Auditor Q 1 Q 2 … Q n … Database A 1 A 2 … A n List of real, integer, or Boolean values Wants to Min, max, median, sum, average, learn value of or count of specified subset some variable slide 3

  4. Offline vs. Online  Offline auditing  Given a collection of queries and answers to them, check whether anything “forbidden” was revealed  Detects privacy breaches after the fact  Online auditing  Queries are presented to auditor one at a time; auditor checks if answering the current query (in combination with past answers) reveals “forbidden” information  Prevents privacy breaches on-the-fly slide 4

  5. Disclosure measures  Full disclosure (exact-value disclosure) – the exact value of a protected attribute is disclosed  Partial disclosure (interval-based disclosure) – the disclosed range (difference of the lower and upper bounds) of the protected attribute is smaller than a predefined threshold  Probability-based disclosure – the posterior distribution of the data after answering queries is significantly different from its prior distribution

  6. Offline Auditors for Full Disclosure  Sum queries  Max and min queries  Median and average queries

  7. Audit Expert (Chin 1982)  Query auditing method for SUM queries  A SUM query can be considered as a linear equation where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result  A set of SUM queries can be thought of as a system of linear equations  Maintains the binary matrix representing linearly independent queries and update it when a new query is issued  A row with all 0s except for i th column indicates disclosure

  8. Offline Auditing for Full Disclosure  Arbitrary combinations of aggregate queries  It is unlikely there will be efficient on-line application algorithms for SUM + MAX queries, SUM + MIN queries, or SUM + MAX + MIN queries.  Example hardness results: Theorem. There is no polynomial time full- disclosure auditing algorithm for sum and max queries unless P=NP.

  9. Auditing Sum Queries on Booleans  Database: collection of secret Boolean variables  Query: specifies subset S of variables  Answer: sum of variables in S  Privacy breach: after asking several queries, user learns the value of some secret variable(s)  Auditing problem: given a set of Boolean equations, is there a variable that has the same value in all solutions? Auditing Boolean attributes, Kleinberg, 2000 slide 9

  10. Why Is This Interesting?  Linear Diaphantine equations  Query can be safe on real-valued, unbounded data, but reveal information when the data are discrete, with known bounds x + y + w = 1 Real: multiple solutions, secure Boolean: unique solution, insecure (why?) y + z = 1 x + z = 1  Hardness results: the auditing problem for Boolean values is coNP-complete. slide 10

  11. Offline Auditor for Partial Disclosure  Partial disclosure – the disclosed range of the protected attribute is smaller than a predefined threshold  Sum queries  Interval-based disclosure: monitoring upper and lower bounds for all confidential attributes Auditing interval-based inference. Li et al. 2002

  12. Offline Auditor for Partial Disclosure  New query  A series of linear programming problems

  13. Incremental evaluation of the LP  Treats the auditing problem as a series of updation problems and updates the bounds with certain rules  Horizontal updation – given the same set of queries, the bounds of one variable, how can the prior result be modified to get the bounds of other variables  Vertical updatation – given the same set of variables, and bounds under the previous queries, how can the prior result modified to get the bounds when a new query arrives An efficient online auditing approach to limit private data disclosure, Lu, 2009

  14. Online Auditing Auditor Q i+1 Database A or “Denied” Previous queries “Denied” if answering Q 1 … Q i Q i+1 would cause a privacy breach slide 14

  15. Online Auditing  Given a sequence of queries and corresponding answers, and a new query, determine if the new query should be answered or denied in order to prevent a privacy breach.  Earliest approaches  Query set size control  Query set overlap control  Limited data utility

  16. Offline to Online?  Can an offline auditor directly solve the online auditing problem?  Denials leak information!

  17. Sounds Familiar? [slide stolen from Kobbi Nissim] Colonel Oliver North, on the Iran-Contra arms deal “On the advice of my counsel I respectfully and regretfully decline to answer the question based on my constitutional rights.” David Duncan, former auditor for Enron and partner in Arthur Andersen “Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States.” slide 17

  18. Example: Sum/Max  Variables d i are real, privacy breached if adversary learns some d i Auditor Gimme sum(d 1 ,d 2 ,d 3 ) Answer=15 Database Gimme max(d 1 ,d 2 ,d 3 ) “Denied” Oh well Wait… there must be a reason why second The only possible reason for query was denied denial is if d 1 =d 2 =d 3 =5 slide 18

  19. Online Audit  Denials reduce the search space Possible assignments to {d 1 ,…,d n } Assignments consistent with (q 1 ,…q i ; a 1 ,…,a i ) q i+1 denied slide 19

  20. One workaround  Deny whenever the offline algorithm does, in addition, randomly deny queries.  Issues  Leakage is not prevented  Have to remember which queries were randomly denied  Semantically determine whether two queries are equivalent

  21. Simulatable Auditing  Observation: denials have the potential to leak information if the auditor uses information that is unavailable to the attacker (the answer to the new query)  Simlatable auditing: the attacker should be able to simulate or mimic the auditors decisions to answer or deny a query  Denials provably do not leak information Simulatable Auditing, Kenthapadi, 2005

  22. Simulatable Auditing  An auditor is a function of Q, A and X  An auditor is simulatable if there exists a simulator that is a function of only Q and A-a i+1 and whose output is always equal to the auditor. q i+1 q i+1 q 1 ,…,q i q 1 ,…,q i a 1 ,…,a i Database Auditor Simulator  Deny or answer Deny or answer slide 22

  23. Simulatable Auditing Possible assignments to {d 1 ,…,d n } Assignments consistent with (q 1 ,…q i , a 1 ,…,a i ) q i+1 denied/allowed slide 23

  24. Simulatable Auditing  Query-set-size control  Query-set-overlap control  Audit expert for sum queries

  25. Constructing Simulatable Auditor  General sufficient condition: the auditor should determine if there is any possible dataset, consistent with all past responses, in which the answer to the current query would cause some element to be fully disclosed

  26. Example revisited: Sum/Max Auditor Gimme sum(d 1 ,d 2 ,d 3 ) Answer=a1 Database Gimme max(d 1 ,d 2 ,d 3 ) “Denied”  Simulatable auditor would always deny the max query following a sum query  Lose some utility due to the requirement of simulatability slide 26

  27. Challenges  Privacy definition  Privacy of groups/families  Algorithmic limitations  Simulatable algorithms computationally prohibitive  Most work on sum queries, some on max, min, median, hardness results on mixed queries  Collusion  Reduced utility for legitimate users  Large audit trail  Utility  Percentage of denials may not be the best measure

Recommend


More recommend