CS573 Data Privacy and Security Differential Privacy – Tabular Data Li Xiong
Outline • Tabular data and histogram/range queries • Algorithms for low dimensional data • Algorithms for high dimensional data
Example: statistics/synthetic data for medical records • Histograms • Cohort discovery: range queries – Select COUNT(*) from D Where A1 in I1 and A2 in I2 and … and Am in Im.
Example: statistical agencies: data publishing • A marginal over attributes 𝐵 1 , … , 𝐵 𝑙 reports count for each combination of attribute values. – aka cube, contingency table – E.g. 2-way marginal on EmploymentStatus and Gender • U.S. Census Bureau statistics can typically be derived from k -way marginal over different combinations of available attributes • Hundreds of marginals released https://factfinder.census.gov/ Module 3 Tutorial: Differential Privacy in the Wild 4
Example: range queries over spatial data Input: sensitive data D Input: range query workload W Shown is workload of 3 range queries BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of taxi pickup locations in Beijing, China in 1 month. Scatter plot of input data Task : compute answers to workload W over private input D [1]Raw data from: Taxi trajectory open dataset, Tsinghua university, China. Module 3 Tutorial: Differential Privacy in the Wild 5 http://sensor.ee.tsinghua.edu.cn, 2009.
Problem variant: offline vs. online • Offline (batch): – Entire W given as input, answers computed in batch • Online (adaptive): – W is sequence q 1 , q 2 , … that arrives online – Adaptive : analyst’s choice for q i can depend on answers 𝑏 1 , … , 𝑏 𝑗−1 Module 3 Tutorial: Differential Privacy in the Wild 6
Important aspects of problem: Data and query complexity • Data complexity – Dimensionality: number of attributes – Domain size: number of distinct attribute combinations – Many techniques specialized for low dimensional data • Query complexity – Given query workload vs. no query workload – Classes of queries: histograms, count queries, linear queries (sum, average), median … Module 3 Tutorial: Differential Privacy in the Wild 7
Solution variants: query answers vs. synthetic data Two high-level approaches to solving problem 1. Direct: – Output of the algorithm is list of query answers 2. Synthetic data : – Algorithm constructs a synthetic dataset D’ , which can be queried directly by analyst – Analyst can pose additional queries on D’ (though answers may not be accurate) Module 3 Tutorial: Differential Privacy in the Wild 8
Categories of Methods • Nonparametric methods – release empirical distributions, i.e. histograms with differential privacy • Parametric methods – learn parameters of a distribution with differential privacy
Categories of Methods • Semi-parametric methods – DP marginal histograms (non-parametric) – Model dependence between attributes (parametric)
Outline • Tabular data and histogram/range queries • Algorithms for low dimensional data – Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, … • Algorithms for high dimensional data
Baseline algorithm 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) – unit histogram 3. Use noisy counts to either… 1. Answer queries directly (assume distribution is uniform Scatter plot of input data within cell) 2. Generate synthetic data Limitations (derive distribution from counts • Granularity of discretization and sample) – Coarse: detail lost – Fine: noise overwhelms signal • Noise accumulates: squared error grows linearly with range Module 3 Tutorial: Differential Privacy in the Wild 12
DPCube: An early attempt [SDM 2010, ICDE 2012 demo] • Domain-based partitioning does not work very well – Equi-width: equal bucket range – Uniformity assumption • Data-driven partitioning – V-optimal: with the least frequency variance – Intuition: highest uniformity within each bucket September 20, 2016 13
Histograms (review) • Divide data into buckets and store average (sum) for each bucket • Partitioning rules: – Equi-width: equal bucket range – Equi-depth: equal frequency – V-optimal: with the least frequency variance September 20, 2016 14
DPCube [SecureDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y ε/ 2-DP Bob 31 60K Y Mary 28 20K Y … … … … Original Records DP unit Histogram Multi-dimensional • Compute unit partitioning histogram counts with ε/ 2-DP differential privacy • Use DP unit histogram for partitioning • Compute V-optimal histogram counts with differential privacy DP V-optimal Histogram DP Interface
Use kd-tree for partitioning Choose dimension and splitting point to split (minimize variance) Repeat until: Count of this node less than threshold Variance or entropy of this node less than threshold
DPCube [SecureDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y ε/ 2-DP Bob 31 60K Y Mary 28 20K Y … … … … Original Records DP unit Histogram Multi-dimensional partitioning • Limitations: ε/ 2-DP – DP unit histogram very noisy – Affects the accuracy of partitioning DP V-optimal Histogram DP Interface
Private Spatial decompositions [CPSSY 12] quadtree kd-tree Build: partitioning with differential privacy Release: a private description of data distribution (in the form of bounding boxes and noisy counts) 18
Building a Private kd-tree Process to build a private kd-tree Input: maximum height h , minimum leaf size L, data set Choose dimension to split Get (private) median in this dimension Create child nodes and add noise to the counts Recurse until: Max height is reached Noisy count of this node less than L Budget along the root-leaf path has used up The entire PSD satisfies DP by the composition property 19
Building PSDs – privacy budget allocation Budget is split between medians and counts at each node – Tradeoff accuracy of division with accuracy of counts Budget is split across levels of the tree – Privacy budget used along any root-leaf path should total – Optimal budget allocation – Post processing with consistency check
Building PSDs – privacy budget allocation Budget is split between medians and counts at each node – Tradeoff accuracy of division with accuracy of counts Budget is split across levels of the tree – Privacy budget used along any root-leaf path should total – Optimal budget allocation – Post processing with consistency check Sequential composition Parallel composition 21
Data Transformations Can think of trees as a ‘data - dependent’ transform of input Can apply other data transformations General idea: – Apply transform of data – Add noise in the transformed space (based on sensitivity) – Publish noisy coefficients, or invert transform (post-processing) Goal: pick a transform that preserves good properties of data – And which has low sensitivity, so noise does not corrupt Noise Invert Transform Coefficients Original Noisy Private Data Coefficients Data 22
Linear transformations • Approach – Discretize domain to finest granularity cells – Use Laplace mechanism to answer batch of queries, each of which is linear combination of cell counts • Examples – Hierarchical: Trees [HRMS10,QYL13], full height quadtree [CPSSY12] – Haar Wavelet [XWG10] – Discrete Fourier transform [BCDKMT07] • Inverting transformation – Some transformations (e.g. tree) have redundancy (over- constrained), so require pseudo-inverse Module 3 Tutorial: Differential Privacy in the Wild 23
Lossy transformations • Variants – Drop “small” coefficients: • Quad-tree with early stopping (noisy count threshold) • Fourier coefficients: EFPA [ACC12], [RN10] – Data-adaptive discretization: • PrivTree [ZXX16], KD-Tree [CPSSY12], DAWA [LHMY14], [DNRR15], [QYL13], [BLR08] – Data-adaptive measurement: • MWEM [HLM12], DualQuery [GAHRW14] – Randomized transforms: sketches and compressed sensing • JL Transform [BBDS12], Compressive mechanism [LZWY11] • “Inverting” transformation – Because lossy, they are under-constrained, requires estimation • Error rates depend on input – Can be much lower (trades off small bias for lower variance) – Warrants careful empirical evaluation; algorithms are “ data dependent ” Module 3 Tutorial: Differential Privacy in the Wild 24
[HMMCZ16] Empirical benchmarks • [HMMCZ16] propose a novel evaluation framework for standardized evaluation of privacy algorithms. • Study of algorithms for range query answering over 1 and 2D • Benchmark website www.dpcomp.org One finding from [HMMCZ16] : Some data-dependent algorithms fail to offer benefits at larger scales (no. of tuples) Tutorial: Differential Privacy in the Wild 25
Outline • Tabular data and histogram/range queries • Algorithms for low dimensional data – Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, … • Algorithms for high dimensional data – Copula functions [LXJ14] – Bayesian networks [ZCPSX14]
Recommend
More recommend