towards practical differential privacy for sql queries
play

Towards Practical Differential Privacy for SQL Queries Noah - PowerPoint PPT Presentation

Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley Outline 1. Discovering real-world requirements 2. Elastic sensitivity & calculating sensitivity of SQL queries 3. Our experience:


  1. Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley

  2. Outline 1. Discovering real-world requirements 2. Elastic sensitivity & calculating sensitivity of SQL queries 3. Our experience: lessons & challenges

  3. Part 1 Discovering Real-world Requirements

  4. Our collaboration with Uber • Uber’s goal: deploy differential privacy • Internally (for some analysts) • Externally (for partners & regulators) • Our goals • Explore real-world requirements for differential privacy • Build open-source systems

  5. Previous work on differential privacy for analytics: insufficient for real-world applications Previous work: either… • Theoretical (does not explore practical applications) • Targets specialized analytics tasks • Google RAPPOR: browsing statistics • Apple: keyboard & emoji trends Result: little use in real-world analytics environments • No practical, scalable systems for DP in analytics

  6. Empirical study: understanding real-world data analytics • Conducted large-scale empirical study of real-world analytics queries • Dataset: 8 million SQL queries written by data analysts at Uber • Covers wide range of use cases: fraud detection, marketing, business metrics, etc. • Goal: identify DP requirements for real-world workload

  7. Empirical study results The most common aggregations are COUNT , SUM, AVG, MAX, and MIN : 39.3% 40% 30% 22.6% 20% 6.5% 10% 4.6% 3.8% 0.2% 0.1% 0% COUNT SUM AVG MAX MIN MEDIAN STDDEV è Most existing DP mechanisms support only counting queries

  8. Empirical study results 62% of queries use JOIN, and some queries use many joins : 95 53 Joins in query 33 16 0 1 1000 1000000 # queries è Very few existing mechanisms support join

  9. Empirical study results Many different databases in use 6,362,631 1,494,680 1000000 94,206 81,660 39,521 29,387 # queries 1000 1 Vertica Postgres MySQL Hive Presto Other è Existing approaches require modifying/replacing DB

  10. Part 2 Elastic Sensitivity & Analyzing SQL Queries

  11. Global sensitivity vs. local sensitivity for joins Global sensitivity • Unbounded for queries with joins • Single added join key in one table could match an unbounded number of keys in another Local sensitivity • Bounded for queries with joins • Data in true database bounds number of possible new matches • Computationally expensive • Must consider every possible change to true database

  12. Elastic sensitivity Upper bound on local sensitivity • Efficient, compositional calculation from query Supports queries with equijoins • Insight : increase in size of joined relation tightly bounded by multiplicities of join keys • Key multiplicities queried from database in advance Supports more than just count • Works well for COUNT • Works less well for SUM

  13. Example: elastic sensitivity of join SELECT COUNT(*) FROM A JOIN B ON A.k = B.k Duplicate join key 1 causes k v k k v duplicate rows in joined 1 a 1 1 a relation 1 1 a A A JOIN B B Maximum change in COUNT: k v k k v add another 1 to A 1 a 1 1 a 1 b 1 1 a Local sensitivity = 2 1 b A B In general: local sensitivity 1 b bounded by maximum A JOIN B multiplicities of k in A and B

  14. A static analysis framework for SQL queries Built a practical framework for analyzing real-world queries Challenge: these queries are complex Our framework: • Solve complexity once • Enable many different analyses

  15. Differential privacy for SQL queries using Elastic Sensitivity Differentially private Database Sensitive results results Output perturbation Analysis framework Elastic SQL Elastic sensitivity analysis sensitivity Query

  16. Empirical evaluation results Dataset : 9862 Uber queries, run on production database

  17. Part 3 Lessons Learned & Future Challenges

  18. Value of close collaboration • Opportunity to examine real use cases • Dataset of queries: what analysts actually did • Insight into privacy goals in the real world • e.g. concern about external and internal sharing • Discover requirements & infrastructure restrictions • e.g. we really can’t modify the database engine

  19. Challenges of close collaboration • Analysts skeptical about need for privacy protections • Concerned about utility • Believe privacy is already protected • e.g. machine learning teams believe models protect privacy • Privacy team unsure of privacy goals • Belief that de-identification is enough, or • Differential privacy seen as a silver bullet • Would like to “have differential privacy” all in one go • Infrastructure teams want a one-size-fits-all solution • Multiple solutions = more work

  20. Conclusions • Perfect deployment will take time, experimentation • Early versions will be limited • There will be bugs • We can accelerate the process • Encouragement • Constructive engagement • We should encourage transparency • Secrecy encourages bugs, discourages adoption https://github.com/uber/sql-differential-privacy https://arxiv.org/abs/1706.09479 jnear@berkeley.edu Thank you!

Recommend


More recommend