2020 Disclosure Avoidance System (DAS) Presenter: John L. El?nge Assistant Director for Research and Methodology Presen?ng materials originally from: Simson L. Garfinkel Senior Computer Scien:st for Confiden:ality and Data Access John M. Abowd Chief Scien:st and Associate Director for Research and Methodology (ADRM) 1
Acknowledgements and Disclaimer Almost all of the materials covered in the slides were originally prepared by Simson Garfinkel and John Abowd of the United States Census Bureau. The views expressed in this presenta:on are those of the authors and speaker, and do not necessarily represent the policies of the United States Census Bureau. 2
General Background Essen?ally All Large-Scale Sta?s?cal Programs Require a Complex Balance of Mul?ple Dimensions of: - Quality - Risk (Including Disclosure Risk) - Cost 3
Disclosure Avoidance System Purpose • The Disclosure Avoidance System (DAS) assures that the 2020 Census data products meet the legal requirements of Title 13, Sec:on 9 of the U.S. Code. • The DAS is designed to prevent improper disclosures of data about individuals and establishments in the 2020 census data products. • Stakeholders: All users of data from the 2020 Census. 4
Disclosure Avoidance System Agenda Project purpose — Why do we need a new DAS? • CONTROLLED Noise injec:on and differen:al privacy — A brief tutorial • NOISE State of the project • Looking forward and conclusion • 5
Project purpose: Why we need a new disclosure avoidance system 6
We create sta:s:cs by collec:ng data, processing and publishing PUBLISHED RESPONDENT SUMMARY PROCESSING DATA DATA 7
Database reconstruc:on is a mathema:cal process that reverses this process. PUBLISHED RESPONDENT SUMMARY PROCESSING DATA DATA 8
Database reconstruc:on is a mathema:cal process that reverses this process. PUBLISHED RESPONDENT SUMMARY PROCESSING DATA DATA 9
PUBLISHED DATA Consider a census block: Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2 68
PUBLISHED DATA RECONSTRUCTED DATA Counts Race 1 Race 2 Race 3 Age < 18 4 Age >= 18 6 R1 Race 1 4 Race 2 4 Race 3 2 69
PUBLISHED DATA RECONSTRUCTED DATA Counts Race 1 Race 2 Race 3 Age < 18 4 Age >= 18 6 R1 Race 1 4 Race 2 4 R2 Race 3 2 70
Race 1 Race 2 Race 3 71
AGE + RACE AGE >=18 72
RACE RACE RACE RACE RACE RACE RACE RACE RACE RACE + AGE + AGE + AGE + AGE + AGE + AGE + AGE + AGE + AGE + AGE 73
RACE RACE RACE RACE RACE RACE RACE RACE RACE RACE + AGE + AGE + AGE + AGE + AGE + AGE + AGE + AGE + AGE + AGE TWENTY CONFIDENTIAL VALUES 74
PUBLISHED DATA Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 TWENTY CONFIDENTIAL VALUES Race 3 2 FIVE PUBLISHED STATISTICS 75
18
“This is the official form for all the “It is quick people at this and easy, and address.” your answers are protected by law.” 19
2010 Census of Popula:on and Housing Total popula?on 308,745,538 Pieces of informa:on per person: 6 Total pieces of informa:on: 1,852,473,228 20
2010 Census Publica:on Schedule PL94-171 Redistric?ng Balance of Summary File 1 Summary File 2 2,771,998,263 2,806,899,669 2,093,683,376 21
2010 Census: Summary of Publica:ons (approximate counts) Publica?on Released counts (including zeros) PL94-171 Redistric:ng 2,771,998,263 Balance of Summary File 1 2,806,899,669 Summary File 2 2,093,683,376 Public-use micro data sample 30,874,554 Lower bound on published sta:s:cs 7,703,455,862 Sta:s:cs/person 25 22
The threat of database reconstruc:on 2010 Census Sta?s?cs/person collected: 6 2010 Census Sta:s:cs/person published: 25 Lower bound on collected sta:s:cs: 1,852,473,228 (308,745,538 x 6) Lower bound on published sta:s:cs 7,703,455,862 (25 sta:s:cs per person) 23
Two privacy mechanisms for the 2010 Census Aggrega?on 24
Two privacy mechanisms for the 2010 Census Aggrega?on Swapping 25
Noise injec=on and CONTROLLED NOISE differen=al privacy 26
Database reconstruc:on and noise injec:on Counts Counts Age < 18 4 5 Age >= 18 6 5 NOISE Race 1 4 3 Race 2 4 5 Race 3 2 2 27
The more noise, the more privacy — and the less accuracy Counts Counts Age < 18 4 5 Age >= 18 6 5 Lille Noise Race 1 4 3 Race 2 4 5 Race 3 2 2 28
The more noise, the more privacy — and the less accuracy Counts Counts Counts 2 Age < 18 4 5 8 Age >= 18 6 5 BIG Lille Noise NOISE 8 Race 1 4 3 1 Race 2 4 5 1 Race 3 2 2 29
POSSIBILITY 2 Counts The more noise, the more privacy Age < 18 8 — and the less accuracy Age >= 2 Counts 18 Counts Age < 18 4 2 Counts Age >= 18 6 Race 1 3 8 BIG Age < 18 3 Race 2 2 NOISE Age >= 18 7 Race 1 4 Race 3 5 8 Race 2 4 1 Race 1 5 Race 3 2 1 Race 2 2 POSSIBILITY 1 Race 3 3 30 POSSIBILITY 3
Differen:al privacy is a tool for controlling the noise/accuracy trade-off 31
In 2017, the Census Bureau announced that it would use differen:al privacy for the 2020 Census. • Differen:al privacy provides: • Provable bounds on the maximum privacy loss Less Noise • Algorithms that allow policy makers to manage the trade-off between accuracy and privacy loss. MORE NOISE Final privacy-loss budget determined by the Data Stewardship Execu:ve Policy Commilee (DSEP) with recommenda:ons from the Disclosure Review Board (DRB) 92
State of the project 33
The “Disclosure Avoidance System” is part of the Census data processing pipeline Red = Confiden:al Data Blue = Priva:zed Data Global Confiden,ality Protec,on Pre-specified tabular summaries: Process PL94-171, SF1, SF2 Disclosure Decennial Census Census Microdata Avoidance Response Unedited Edited Detail File File File File System Special tabula?ons and post-census accuracy ε research trade-offs Privacy-loss Budget, Accuracy Decisions 34
Differen:al privacy has many advantages to swapping • Advantages : • Privacy guarantees are tunable and provable • Privacy guarantees are future-proof • Privacy guarantees are public and explainable Global Confiden:ality Protec:on Process • Protects against database reconstruc,on • Disadvantages : Disclosure Avoidance System • En:re country must be processed at once for best accuracy • Every use of private data must be tallied in the privacy-loss budget 35
We will make the DAS public! • Open source system • Source code published on the Internet • Testable with data from 1940 Census 36
Communica:ons Strategy Differen:al privacy is not widely known or • understood outside academia Most data users expect the same accuracy • regardless of the level of detail In 2000 and 2010 we used swapping with an • undisclosed swap rate – The Census Bureau did not quan:fy the error rate 37
State of the DAS Project(s): Engineering & Science • ENGINEERING PROJECT – Building a Turnkey Batch-Oriented System Crea:ng a produc:on system that runs within the 2018 End-to-End Census Test and 2020 Census • produc:on environments – Resource intensive, but only when ac:vely in use – Based on Amazon Elas:c Map Reduce technology – Reads CEF from the Census Data Lake – Processes using DAS algorithms and a commercial op:mizer – Creates the Microdata Detail File – Saves results in the Census Data Lake 38
State of the DAS Project(s): Engineering & Science • SCIENCE PROJECT — Improving the differen?al privacy algorithms We are steadily improving the accuracy/ • privacy trade-off Progress requires interac:ve access to • microdata from the 2010 Census, and By block con:nued access to high-performance compu:ng on demand. By block 39
Looking forward 40
DAS Highlights --- Good news! The current “top-down” algorithm handles the PL94-171 queries and generates micro-data that meet the • requirements to publish test files. We’re sharing tables with Subject Maker Experts (SMEs) and discussing possible improvements • We will soon integrate the High-Dimensional Matrix Mechanism (HDMM) , into our top-down algorithm, • which will improve accuracy on requested tabula:ons The Census Bureau is collec:ng “use cases” from our data users • 41
FRN No=ce We want users of 2020 Census Data Products to tell us how they use our data! First FRN: 83 FR 84111 7/19/2018 -> 9/17/2018 Second FRN: 83 FR 50636 10/09/2018 -> 11/08/2018 42
DAS Science Highlights --- Challenges! We have not yet addressed household queries or person-household joins , although we have in-progress • research for both – Householder queries, e.g. “how many households are headed by someone aged 20-30?” – Person-household join, e.g. “how many children are in households headed by someone aged 20-30?” Lack of scien:sts and engineers trained in differen:al privacy • Many open ques:ons in mathema:cal sta:s:cs and methodology • 43
Recommend
More recommend