Differential Privacy Tabular Data Li Xiong Outline Tabular data - - PowerPoint PPT Presentation
Differential Privacy Tabular Data Li Xiong Outline Tabular data - - PowerPoint PPT Presentation
CS573 Data Privacy and Security Differential Privacy Tabular Data Li Xiong Outline Tabular data and histogram/range queries Algorithms for low dimensional data Algorithms for high dimensional data Example: statistics/synthetic
Outline
- Tabular data and histogram/range queries
- Algorithms for low dimensional data
- Algorithms for high dimensional data
Example: statistics/synthetic data for medical records
- Histograms
- Cohort discovery: range queries
– Select COUNT(*) from D Where A1 in I1 and A2 in I2 and … and Am in Im.
- A marginal over attributes
𝐵1, … , 𝐵𝑙 reports count for each combination of attribute values.
– aka cube, contingency table – E.g. 2-way marginal on EmploymentStatus and Gender
- U.S. Census Bureau statistics
can typically be derived from k-way marginal over different combinations of available attributes
- Hundreds of marginals
released
Example: statistical agencies: data publishing
Tutorial: Differential Privacy in the Wild 4
https://factfinder.census.gov/
Module 3
Scatter plot of input data
Example: range queries over spatial data
Tutorial: Differential Privacy in the Wild 5
Input: sensitive data D
BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of taxi pickup locations in Beijing, China in 1 month.
[1]Raw data from: Taxi trajectory open dataset, Tsinghua university, China. http://sensor.ee.tsinghua.edu.cn, 2009.
Input: range query workload W Shown is workload of 3 range queries Task: compute answers to workload W over private input D
Module 3
Problem variant: offline vs. online
- Offline (batch):
– Entire W given as input, answers computed in batch
- Online (adaptive):
– W is sequence q1, q2, … that arrives online – Adaptive: analyst’s choice for qi can depend on answers 𝑏1, … , 𝑏𝑗−1
Tutorial: Differential Privacy in the Wild 6 Module 3
Important aspects of problem: Data and query complexity
- Data complexity
– Dimensionality: number of attributes – Domain size: number of distinct attribute combinations – Many techniques specialized for low dimensional data
- Query complexity
– Given query workload vs. no query workload – Classes of queries: histograms, count queries, linear queries (sum, average), median …
Tutorial: Differential Privacy in the Wild 7 Module 3
Solution variants: query answers vs. synthetic data
Two high-level approaches to solving problem
- 1. Direct:
– Output of the algorithm is list of query answers
- 2. Synthetic data:
– Algorithm constructs a synthetic dataset D’, which can be queried directly by analyst – Analyst can pose additional queries on D’ (though answers may not be accurate)
Tutorial: Differential Privacy in the Wild 8 Module 3
- Nonparametric methods – release empirical distributions, i.e.
histograms with differential privacy
- Parametric methods – learn parameters of a distribution with
differential privacy
Categories of Methods
- Semi-parametric methods
– DP marginal histograms (non-parametric) – Model dependence between attributes (parametric) Categories of Methods
Outline
- Tabular data and histogram/range queries
- Algorithms for low dimensional data
– Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, …
- Algorithms for high dimensional data
Baseline algorithm
1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) – unit histogram 3. Use noisy counts to either…
1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample)
Tutorial: Differential Privacy in the Wild 12
Scatter plot of input data
Limitations
- Granularity of discretization
– Coarse: detail lost – Fine: noise overwhelms signal
- Noise accumulates: squared error
grows linearly with range
Module 3
September 20, 2016 13
DPCube: An early attempt [SDM 2010, ICDE 2012 demo]
- Domain-based partitioning does not work very well
– Equi-width: equal bucket range – Uniformity assumption
- Data-driven partitioning
– V-optimal: with the least frequency variance – Intuition: highest uniformity within each bucket
September 20, 2016 14
Histograms (review)
- Divide data into buckets and store average (sum) for each bucket
- Partitioning rules:
– Equi-width: equal bucket range – Equi-depth: equal frequency – V-optimal: with the least frequency variance
Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y … … … …
Original Records DP V-optimal Histogram Multi-dimensional partitioning
DPCube [SecureDM 2010, ICDE 2012 demo]
DP unit Histogram
DP Interface ε/2-DP ε/2-DP
- Compute unit
histogram counts with differential privacy
- Use DP unit histogram
for partitioning
- Compute V-optimal
histogram counts with differential privacy
Use kd-tree for partitioning
- Choose dimension and splitting point to split (minimize
variance)
- Repeat until:
Count of this node less than threshold Variance or entropy of this node less than threshold
Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y … … … …
Original Records DP V-optimal Histogram Multi-dimensional partitioning
DPCube [SecureDM 2010, ICDE 2012 demo]
DP unit Histogram
DP Interface ε/2-DP ε/2-DP
- Limitations:
– DP unit histogram very noisy – Affects the accuracy
- f partitioning
Private Spatial decompositions [CPSSY 12]
quadtree kd-tree Build: partitioning with differential privacy Release: a private description of data distribution (in the form of bounding boxes and noisy counts)
18
19
Building a Private kd-tree
Process to build a private kd-tree
- Input: maximum height h, minimum leaf size L, data set
- Choose dimension to split
- Get (private) median in this dimension
- Create child nodes and add noise to the counts
- Recurse until:
Max height is reached Noisy count of this node less than L Budget along the root-leaf path has used up
The entire PSD satisfies DP by the composition property
Building PSDs – privacy budget allocation
Budget is split between medians and counts at each node
– Tradeoff accuracy of division with accuracy of counts
Budget is split across levels of the tree
– Privacy budget used along any root-leaf path should total – Optimal budget allocation – Post processing with consistency check
Building PSDs – privacy budget allocation
Budget is split between medians and counts at each node
– Tradeoff accuracy of division with accuracy of counts
Budget is split across levels of the tree
– Privacy budget used along any root-leaf path should total – Optimal budget allocation – Post processing with consistency check
Sequential composition Parallel composition
21
Data Transformations
Can think of trees as a ‘data-dependent’ transform of input Can apply other data transformations General idea:
– Apply transform of data – Add noise in the transformed space (based on sensitivity) – Publish noisy coefficients, or invert transform (post-processing)
Goal: pick a transform that preserves good properties of data
– And which has low sensitivity, so noise does not corrupt Original Data Transform Coefficients Noisy Coefficients Noise Private Data Invert 22
Linear transformations
- Approach
– Discretize domain to finest granularity cells – Use Laplace mechanism to answer batch of queries, each of which is linear combination of cell counts
- Examples
– Hierarchical: Trees [HRMS10,QYL13], full height quadtree [CPSSY12] – Haar Wavelet [XWG10] – Discrete Fourier transform [BCDKMT07]
- Inverting transformation
– Some transformations (e.g. tree) have redundancy (over- constrained), so require pseudo-inverse
Tutorial: Differential Privacy in the Wild 23 Module 3
Lossy transformations
- Variants
– Drop “small” coefficients:
- Quad-tree with early stopping (noisy count threshold)
- Fourier coefficients: EFPA [ACC12], [RN10]
– Data-adaptive discretization:
- PrivTree [ZXX16], KD-Tree [CPSSY12], DAWA [LHMY14], [DNRR15], [QYL13], [BLR08]
– Data-adaptive measurement:
- MWEM [HLM12], DualQuery [GAHRW14]
– Randomized transforms: sketches and compressed sensing
- JL Transform [BBDS12], Compressive mechanism [LZWY11]
- “Inverting” transformation
– Because lossy, they are under-constrained, requires estimation
- Error rates depend on input
– Can be much lower (trades off small bias for lower variance) – Warrants careful empirical evaluation; algorithms are “data dependent”
Tutorial: Differential Privacy in the Wild 24 Module 3
Empirical benchmarks
- [HMMCZ16] propose a novel evaluation framework for
standardized evaluation of privacy algorithms.
- Study of algorithms for range query answering over 1 and 2D
- Benchmark website www.dpcomp.org
Tutorial: Differential Privacy in the Wild 25
[HMMCZ16]
One finding from [HMMCZ16]: Some data-dependent algorithms fail to
- ffer benefits at
larger scales (no. of tuples)
Outline
- Tabular data and histogram/range queries
- Algorithms for low dimensional data
– Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, …
- Algorithms for high dimensional data
– Copula functions [LXJ14] – Bayesian networks [ZCPSX14]
Parametric methods Non-parametric methods
DPCopula: Motivation
Original data Histogram Synthetic data Perturbation
Fit the data to a distribution, make inferences about parameters Learn empirical distribution through histograms
e.g. PrivacyOnTheMap e.g. PSD , Privelet, FP, P-HP
DPCopula
A semi-parametric method
Non-parametric estimation for each dimension
Age Hours /week Income
42 64 30K 31 82 60K 28 40 20K 43 36 80K
… … …
Original data set
Hours/week Age Income
DP Marginal Histograms Dependence structure
Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …
DP synthetic data set
Parametric estimation for dependence
Privacy guarantee
Age Hours /week Income
42 64 30K 31 82 60K 28 40 20K 43 36 80K
… … …
Original data set
Hours/week Age Income
DP Marginal Histograms Dependence structure
Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …
DP synthetic data set
1
2
2 1
- Differential privacy
- Differential privacy
- Differential privacy
Challenges
Age Hours /week Income
42 64 30K 31 82 60K 28 40 20K 43 36 80K
… … …
Original data set
Hours/week Age Income
DP Marginal Histograms Dependence structure
Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …
DP synthetic data set
A multi-Gaussian density for :
) , , , (
2 1 m
u u u C
) , , , (
2 1 m
X X X ) (
i i X
F
I is the identity matrix P is a correlation matrix:
3 1 132 . 108 . 132 . 1 053 . 108 . 053 . 1 m P
Gaussian copula: models the dependence with arbitrary margins Gaussian distribution: models the joint distribution
Age Age Hours urs /week ek Inco come me
42 64 30K 31 82 60K 28 40 20K 43 36 80K
… … …
Origi igina nal data ta set et
Hours/week Age Income
Step ep 2: Comput utin ing g DP corre relati tion
- n matri
rix x throu rough gh DP MLE (Maxi ximum mum Likeli eliho hood
- d Esti
timat mation
- n
Age Hours /week Incom
- me
42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …
Step ep 3: Sampling ng DP synt nthet hetic data ta Step ep 1: Comput utin ing g DP margin ginal His isto togr grams ms
1 132 . 108 . 132 . 1 053 . 108 . 053 . 1 ~ P
DP synt nthet hetic ic data ta set DP margin rginal hist stogra
- grams
ms DP corre rrela lati tion
- n matri
rix DP depend endenc ence e stru ructu ture re
MLE
Overview
Age Hours /week Incom
- me
Gende der 42 64 30K F 31 82 60K M 28 40 20K F 43 36 80K M
… … …
…
Age Age Hours /week Income
- me
42 64 30K 28 40 20K … … … Age Age Hours /week Income
- me
31 82 60K 43 36 80K … … …
DPCop Copul ula DPCop Copul ula Gend nder r = F Gend nder r = M
Age Age Hours /week Income
- me
36 62 28K 30 42 18K … … … Age Age Hours /week Income
- me
34 76 52K 31 32 69K … … …
) / 1 ( ~
1 1
Lap n n
) / 1 ( ~
2 2
Lap n n
1
~ n
2
~ n
Datasets
US Census data: 4 attributes, 100,000 records
Brazil data: 8 attributes, 188,846 records Synthetic data
Comparison:
PSD, Privelet+, FP, P-HP
Metrics:
Random range-count queries with random query predicates covering all attributes Relative error: Absolute error:
Query accuracy vs. differential privacy budget
Query accuracy vs. query range size
Gaussian dependence assumption Pair-wise attribute correlation does not scale
with high dimensions
sensitive database 𝐸 synthetic database 𝐸∗ convert noisy distribution + noise sample a set of low-dim distributions noisy low-dim distributions + noise convert approximate full-dim tuple distribution sample
A 5-dimensional database:
age workclass education title income Pr 𝑏𝑓 Pr 𝑥𝑝𝑠𝑙 | 𝑏𝑓 Pr 𝑓𝑒𝑣 | 𝑏𝑓 Pr 𝑢𝑗𝑢𝑚𝑓 | 𝑥𝑝𝑠𝑙 Pr 𝑗𝑜𝑑𝑝𝑛𝑓 | 𝑥𝑝𝑠𝑙
A 5-dimensional database:
age workclass education title income Pr 𝑏𝑓 ⋅ Pr 𝑥𝑝𝑠𝑙 | 𝑏𝑓 ⋅ Pr 𝑓𝑒𝑣 | 𝑏𝑓 ⋅ Pr 𝑢𝑗𝑢𝑚𝑓 | 𝑥𝑝𝑠𝑙 ⋅ Pr 𝑗𝑜𝑑𝑝𝑛𝑓 | 𝑥𝑝𝑠𝑙 Pr ∗ ≈
STEP 1: Choose a suitable Bayesian network 𝒪
- must in a differentially private way
STEP 2: Compute conditional distributions implied by 𝒪
- straightforward to do under differential privacy
- inject noise – Laplace mechanism
STEP 3: Generate synthetic data by sampling from 𝒪
- post-processing: no privacy issues
Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information 𝐽 of its edges ⇔ finding the maximum spanning tree, where the weight of edge (𝑌, 𝑍) is mutual information 𝐽(𝑌, 𝑍).
Build a 1-degree BN for database 𝐵 𝐶 𝐷 𝐸 Alan Bob Cykie 1 1 1 David Eric 1 1 Frank 1 1 George Helen 1 1 1 Ivan Jack 1 1
Start from a random attribute 𝐵
A C B D
Select next tree edge by its mutual information
A C B D
0.5 0.5 0.5 0.2 0.3 0.5 0.5
candidates: 𝐵 → 𝐶 𝐵 → 𝐷 𝐵 → 𝐸
𝐵 𝐶 𝐷 𝐸 Alan Bob Cykie 1 1 1 David Eric 1 1 Frank 1 1 George Helen 1 1 1 Ivan Jack 1 1
Select next tree edge by its mutual information
A C B D
𝑱 = 𝟐 𝑱 = 𝟏. 𝟓 𝑱 = 𝟏
candidates: 𝐵 → 𝐶 𝐵 → 𝐷 𝐵 → 𝐸
Select next tree edge by its mutual information
A C B D
Select next tree edge by its mutual information
A C B D
𝑱 = 𝟏. 𝟓 𝑱 = 𝟏 𝑱 = 𝟏. 𝟑 𝑱 = 𝟏
candidates: 𝐵 → 𝐷 𝐵 → 𝐸 𝐶 → 𝐷 𝐶 → 𝐸
Select next tree edge by its mutual information
A C B D
DONE!
Do it under Differential Privacy! (Non-private) select the edge with maximum 𝐽 (Private) 𝐽 is data-sensitive
- > the best edge is also data-sensitive
Databases 𝐸 Edges 𝑓 define 𝑟 𝐸, 𝑓 → 𝑆 How good edge 𝑓 is as the result of selection, given database 𝐸 Return 𝑓 with probability: Pr
[𝑓] ∝ exp 𝜁 2 ⋅ 𝑟 𝐸, 𝑓 Δ 𝑟
Δ 𝑟 = max
𝐸,𝐸′,𝑓 𝑟 𝐸, 𝑓 − 𝑟(𝐸′, 𝑓) 1
where
noise info
Outline
- Tabular data and histogram/range queries
- Algorithms for low dimensional data
– Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, …
- Algorithms for high dimensional data
– Copula functions – Bayesian networks
Open questions
- High dimensional data
- Robust and private algorithm selection
- Error bounds for data-dependent algorithms
Tutorial: Differential Privacy in the Wild 54 Module 3
References
- [ACC12] Ács et al. Differentially private histogram publishing through lossy compression. In ICDM, 2012.
- [BBDS12] Blocki et al. The johnson-lindenstrauss transform itself preserves differential privacy. In FOCS, 2012.
- [BCDKMT07] Barak et al. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, 2007.
- [BLR08] Blum et al. A learning theory approach to noninteractive database privacy. In STOC, 2008.
- [DNRR15] Dwork et al. Pure Differential Privacy for Rectangle Queries via Private Partitions. In ASIACRYPT, 2015.
- [CPSSY12] Cormode et al. Differentially Private Spatial Decompositions. In ICDE, 2012.
- [GAHRW14] Gaboardi et al. Dual Query: Practical Private Query Release for High Dimensional Data. In ICML, 2014.
- [HLM12] Hardt et al. A simple and practical algorithm for differentially private data release. In NIPS, 2012.
- [HMMCZ16] Hay et al. Principled Evaluation of Differentially Private Algorithms using DPBench. In SIGMOD, 2016.
- [HRMS10] Hay et al. Boosting the accuracy of differentially private histograms through consistency. In PVLDB, 2010.
- [LHMY14] Li et al. A data- and workload-aware algorithm for range queries under differential privacy. In PVLDB, 2014.
- [LHRMM10] Li et al. Optimizing linear counting queries under differential privacy. In PODS, 2010.
- [LM12] Li et al. An adaptive mechanism for accurate query answering under differential privacy. In PVLDB, 2012.
- [LM13] Li et al. Optimal error of query sets under the differentially-private matrix mechanism. In ICDT, 2013.
- [LZWY11] Li et al. Compressive mechanism: utilizing sparse representation in differential privacy. In WPES, 2011.
- [QYL13] Qardaji et al. Understanding hierarchical methods for differentially private histograms. In PVLDB, 2013.
- [QYL13] Qardaji et al. Differentially private grids for geospatial data. In ICDE, 2013.
- [RN10] Rastogi et al. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, 2010.
- [WWLTRD09] Wang et al. Privacy-preserving genomic computation through program specialization. In CCS, 2009.
- [XWG10] Xiao et al. Differential privacy via wavelet transforms. In ICDE, 2011.
- [ZCPSX14] Zhang et al. PrivBayes: private data release via bayesian networks. In SIGMOD, 2014.
- [ZXX16] Zhang et al. PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions. In SIGMOD, 2016.
Tutorial: Differential Privacy in the Wild 55 Module 3