Differential Privacy Tabular Data Li Xiong Outline Tabular data - - PowerPoint PPT Presentation

differential privacy tabular data li xiong outline
SMART_READER_LITE
LIVE PREVIEW

Differential Privacy Tabular Data Li Xiong Outline Tabular data - - PowerPoint PPT Presentation

CS573 Data Privacy and Security Differential Privacy Tabular Data Li Xiong Outline Tabular data and histogram/range queries Algorithms for low dimensional data Algorithms for high dimensional data Example: statistics/synthetic


slide-1
SLIDE 1

CS573 Data Privacy and Security Differential Privacy – Tabular Data

Li Xiong

slide-2
SLIDE 2

Outline

  • Tabular data and histogram/range queries
  • Algorithms for low dimensional data
  • Algorithms for high dimensional data
slide-3
SLIDE 3

Example: statistics/synthetic data for medical records

  • Histograms
  • Cohort discovery: range queries

– Select COUNT(*) from D Where A1 in I1 and A2 in I2 and … and Am in Im.

slide-4
SLIDE 4
  • A marginal over attributes

𝐵1, … , 𝐵𝑙 reports count for each combination of attribute values.

– aka cube, contingency table – E.g. 2-way marginal on EmploymentStatus and Gender

  • U.S. Census Bureau statistics

can typically be derived from k-way marginal over different combinations of available attributes

  • Hundreds of marginals

released

Example: statistical agencies: data publishing

Tutorial: Differential Privacy in the Wild 4

https://factfinder.census.gov/

Module 3

slide-5
SLIDE 5

Scatter plot of input data

Example: range queries over spatial data

Tutorial: Differential Privacy in the Wild 5

Input: sensitive data D

BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of taxi pickup locations in Beijing, China in 1 month.

[1]Raw data from: Taxi trajectory open dataset, Tsinghua university, China. http://sensor.ee.tsinghua.edu.cn, 2009.

Input: range query workload W Shown is workload of 3 range queries Task: compute answers to workload W over private input D

Module 3

slide-6
SLIDE 6

Problem variant: offline vs. online

  • Offline (batch):

– Entire W given as input, answers computed in batch

  • Online (adaptive):

– W is sequence q1, q2, … that arrives online – Adaptive: analyst’s choice for qi can depend on answers 𝑏1, … , 𝑏𝑗−1

Tutorial: Differential Privacy in the Wild 6 Module 3

slide-7
SLIDE 7

Important aspects of problem: Data and query complexity

  • Data complexity

– Dimensionality: number of attributes – Domain size: number of distinct attribute combinations – Many techniques specialized for low dimensional data

  • Query complexity

– Given query workload vs. no query workload – Classes of queries: histograms, count queries, linear queries (sum, average), median …

Tutorial: Differential Privacy in the Wild 7 Module 3

slide-8
SLIDE 8

Solution variants: query answers vs. synthetic data

Two high-level approaches to solving problem

  • 1. Direct:

– Output of the algorithm is list of query answers

  • 2. Synthetic data:

– Algorithm constructs a synthetic dataset D’, which can be queried directly by analyst – Analyst can pose additional queries on D’ (though answers may not be accurate)

Tutorial: Differential Privacy in the Wild 8 Module 3

slide-9
SLIDE 9
  • Nonparametric methods – release empirical distributions, i.e.

histograms with differential privacy

  • Parametric methods – learn parameters of a distribution with

differential privacy

Categories of Methods

slide-10
SLIDE 10
  • Semi-parametric methods

– DP marginal histograms (non-parametric) – Model dependence between attributes (parametric) Categories of Methods

slide-11
SLIDE 11

Outline

  • Tabular data and histogram/range queries
  • Algorithms for low dimensional data

– Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, …

  • Algorithms for high dimensional data
slide-12
SLIDE 12

Baseline algorithm

1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) – unit histogram 3. Use noisy counts to either…

1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample)

Tutorial: Differential Privacy in the Wild 12

Scatter plot of input data

Limitations

  • Granularity of discretization

– Coarse: detail lost – Fine: noise overwhelms signal

  • Noise accumulates: squared error

grows linearly with range

Module 3

slide-13
SLIDE 13

September 20, 2016 13

DPCube: An early attempt [SDM 2010, ICDE 2012 demo]

  • Domain-based partitioning does not work very well

– Equi-width: equal bucket range – Uniformity assumption

  • Data-driven partitioning

– V-optimal: with the least frequency variance – Intuition: highest uniformity within each bucket

slide-14
SLIDE 14

September 20, 2016 14

Histograms (review)

  • Divide data into buckets and store average (sum) for each bucket
  • Partitioning rules:

– Equi-width: equal bucket range – Equi-depth: equal frequency – V-optimal: with the least frequency variance

slide-15
SLIDE 15

Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y … … … …

Original Records DP V-optimal Histogram Multi-dimensional partitioning

DPCube [SecureDM 2010, ICDE 2012 demo]

DP unit Histogram

DP Interface ε/2-DP ε/2-DP

  • Compute unit

histogram counts with differential privacy

  • Use DP unit histogram

for partitioning

  • Compute V-optimal

histogram counts with differential privacy

slide-16
SLIDE 16

Use kd-tree for partitioning

  • Choose dimension and splitting point to split (minimize

variance)

  • Repeat until:

 Count of this node less than threshold  Variance or entropy of this node less than threshold

slide-17
SLIDE 17

Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y … … … …

Original Records DP V-optimal Histogram Multi-dimensional partitioning

DPCube [SecureDM 2010, ICDE 2012 demo]

DP unit Histogram

DP Interface ε/2-DP ε/2-DP

  • Limitations:

– DP unit histogram very noisy – Affects the accuracy

  • f partitioning
slide-18
SLIDE 18

Private Spatial decompositions [CPSSY 12]

quadtree kd-tree  Build: partitioning with differential privacy  Release: a private description of data distribution (in the form of bounding boxes and noisy counts)

18

slide-19
SLIDE 19

19

Building a Private kd-tree

 Process to build a private kd-tree

  • Input: maximum height h, minimum leaf size L, data set
  • Choose dimension to split
  • Get (private) median in this dimension
  • Create child nodes and add noise to the counts
  • Recurse until:

 Max height is reached  Noisy count of this node less than L  Budget along the root-leaf path has used up

 The entire PSD satisfies DP by the composition property

slide-20
SLIDE 20

Building PSDs – privacy budget allocation

 Budget is split between medians and counts at each node

– Tradeoff accuracy of division with accuracy of counts

 Budget is split across levels of the tree

– Privacy budget used along any root-leaf path should total  – Optimal budget allocation – Post processing with consistency check

slide-21
SLIDE 21

Building PSDs – privacy budget allocation

 Budget is split between medians and counts at each node

– Tradeoff accuracy of division with accuracy of counts

 Budget is split across levels of the tree

– Privacy budget used along any root-leaf path should total  – Optimal budget allocation – Post processing with consistency check

Sequential composition Parallel composition

21

slide-22
SLIDE 22

Data Transformations

 Can think of trees as a ‘data-dependent’ transform of input  Can apply other data transformations  General idea:

– Apply transform of data – Add noise in the transformed space (based on sensitivity) – Publish noisy coefficients, or invert transform (post-processing)

 Goal: pick a transform that preserves good properties of data

– And which has low sensitivity, so noise does not corrupt Original Data Transform Coefficients Noisy Coefficients Noise Private Data Invert 22

slide-23
SLIDE 23

Linear transformations

  • Approach

– Discretize domain to finest granularity cells – Use Laplace mechanism to answer batch of queries, each of which is linear combination of cell counts

  • Examples

– Hierarchical: Trees [HRMS10,QYL13], full height quadtree [CPSSY12] – Haar Wavelet [XWG10] – Discrete Fourier transform [BCDKMT07]

  • Inverting transformation

– Some transformations (e.g. tree) have redundancy (over- constrained), so require pseudo-inverse

Tutorial: Differential Privacy in the Wild 23 Module 3

slide-24
SLIDE 24

Lossy transformations

  • Variants

– Drop “small” coefficients:

  • Quad-tree with early stopping (noisy count threshold)
  • Fourier coefficients: EFPA [ACC12], [RN10]

– Data-adaptive discretization:

  • PrivTree [ZXX16], KD-Tree [CPSSY12], DAWA [LHMY14], [DNRR15], [QYL13], [BLR08]

– Data-adaptive measurement:

  • MWEM [HLM12], DualQuery [GAHRW14]

– Randomized transforms: sketches and compressed sensing

  • JL Transform [BBDS12], Compressive mechanism [LZWY11]
  • “Inverting” transformation

– Because lossy, they are under-constrained, requires estimation

  • Error rates depend on input

– Can be much lower (trades off small bias for lower variance) – Warrants careful empirical evaluation; algorithms are “data dependent”

Tutorial: Differential Privacy in the Wild 24 Module 3

slide-25
SLIDE 25

Empirical benchmarks

  • [HMMCZ16] propose a novel evaluation framework for

standardized evaluation of privacy algorithms.

  • Study of algorithms for range query answering over 1 and 2D
  • Benchmark website www.dpcomp.org

Tutorial: Differential Privacy in the Wild 25

[HMMCZ16]

One finding from [HMMCZ16]: Some data-dependent algorithms fail to

  • ffer benefits at

larger scales (no. of tuples)

slide-26
SLIDE 26

Outline

  • Tabular data and histogram/range queries
  • Algorithms for low dimensional data

– Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, …

  • Algorithms for high dimensional data

– Copula functions [LXJ14] – Bayesian networks [ZCPSX14]

slide-27
SLIDE 27

Parametric methods Non-parametric methods

DPCopula: Motivation

Original data Histogram Synthetic data Perturbation

Fit the data to a distribution, make inferences about parameters Learn empirical distribution through histograms

e.g. PrivacyOnTheMap e.g. PSD , Privelet, FP, P-HP

slide-28
SLIDE 28

DPCopula

A semi-parametric method

Non-parametric estimation for each dimension

Age Hours /week Income

42 64 30K 31 82 60K 28 40 20K 43 36 80K

… … …

Original data set

Hours/week Age Income

DP Marginal Histograms Dependence structure

Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …

DP synthetic data set

Parametric estimation for dependence

slide-29
SLIDE 29

Privacy guarantee

Age Hours /week Income

42 64 30K 31 82 60K 28 40 20K 43 36 80K

… … …

Original data set

Hours/week Age Income

DP Marginal Histograms Dependence structure

Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …

DP synthetic data set

1

2

    

2 1

  • Differential privacy
  • Differential privacy
  • Differential privacy
slide-30
SLIDE 30

Challenges

Age Hours /week Income

42 64 30K 31 82 60K 28 40 20K 43 36 80K

… … …

Original data set

Hours/week Age Income

DP Marginal Histograms Dependence structure

Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …

DP synthetic data set

slide-31
SLIDE 31

A multi-Gaussian density for :

) , , , (

2 1 m

u u u C 

) , , , (

2 1 m

X X X  ) (

i i X

F

I is the identity matrix P is a correlation matrix:

3 1 132 . 108 . 132 . 1 053 . 108 . 053 . 1             m P

slide-32
SLIDE 32

Gaussian copula: models the dependence with arbitrary margins Gaussian distribution: models the joint distribution

slide-33
SLIDE 33

Age Age Hours urs /week ek Inco come me

42 64 30K 31 82 60K 28 40 20K 43 36 80K

… … …

Origi igina nal data ta set et

Hours/week Age Income

Step ep 2: Comput utin ing g DP corre relati tion

  • n matri

rix x throu rough gh DP MLE (Maxi ximum mum Likeli eliho hood

  • d Esti

timat mation

  • n

Age Hours /week Incom

  • me

42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …

Step ep 3: Sampling ng DP synt nthet hetic data ta Step ep 1: Comput utin ing g DP margin ginal His isto togr grams ms

           1 132 . 108 . 132 . 1 053 . 108 . 053 . 1 ~ P

DP synt nthet hetic ic data ta set DP margin rginal hist stogra

  • grams

ms DP corre rrela lati tion

  • n matri

rix DP depend endenc ence e stru ructu ture re

MLE

slide-34
SLIDE 34

 Overview

Age Hours /week Incom

  • me

Gende der 42 64 30K F 31 82 60K M 28 40 20K F 43 36 80K M

… … …

Age Age Hours /week Income

  • me

42 64 30K 28 40 20K … … … Age Age Hours /week Income

  • me

31 82 60K 43 36 80K … … …

DPCop Copul ula DPCop Copul ula Gend nder r = F Gend nder r = M

Age Age Hours /week Income

  • me

36 62 28K 30 42 18K … … … Age Age Hours /week Income

  • me

34 76 52K 31 32 69K … … …

) / 1 ( ~

1 1

 Lap n n  

) / 1 ( ~

2 2

 Lap n n  

1

~ n

2

~ n

slide-35
SLIDE 35

 Datasets

US Census data: 4 attributes, 100,000 records

Brazil data: 8 attributes, 188,846 records Synthetic data

 Comparison:

PSD, Privelet+, FP, P-HP

 Metrics:

Random range-count queries with random query predicates covering all attributes Relative error: Absolute error:

slide-36
SLIDE 36

Query accuracy vs. differential privacy budget

slide-37
SLIDE 37

 Query accuracy vs. query range size

slide-38
SLIDE 38

 Gaussian dependence assumption  Pair-wise attribute correlation does not scale

with high dimensions

slide-39
SLIDE 39

sensitive database 𝐸 synthetic database 𝐸∗ convert noisy distribution + noise sample a set of low-dim distributions noisy low-dim distributions + noise convert approximate full-dim tuple distribution sample

slide-40
SLIDE 40

 A 5-dimensional database:

age workclass education title income Pr 𝑏𝑕𝑓 Pr 𝑥𝑝𝑠𝑙 | 𝑏𝑕𝑓 Pr 𝑓𝑒𝑣 | 𝑏𝑕𝑓 Pr 𝑢𝑗𝑢𝑚𝑓 | 𝑥𝑝𝑠𝑙 Pr 𝑗𝑜𝑑𝑝𝑛𝑓 | 𝑥𝑝𝑠𝑙

slide-41
SLIDE 41

 A 5-dimensional database:

age workclass education title income Pr 𝑏𝑕𝑓 ⋅ Pr 𝑥𝑝𝑠𝑙 | 𝑏𝑕𝑓 ⋅ Pr 𝑓𝑒𝑣 | 𝑏𝑕𝑓 ⋅ Pr 𝑢𝑗𝑢𝑚𝑓 | 𝑥𝑝𝑠𝑙 ⋅ Pr 𝑗𝑜𝑑𝑝𝑛𝑓 | 𝑥𝑝𝑠𝑙 Pr ∗ ≈

slide-42
SLIDE 42

 STEP 1: Choose a suitable Bayesian network 𝒪

  • must in a differentially private way

 STEP 2: Compute conditional distributions implied by 𝒪

  • straightforward to do under differential privacy
  • inject noise – Laplace mechanism

 STEP 3: Generate synthetic data by sampling from 𝒪

  • post-processing: no privacy issues
slide-43
SLIDE 43

 Finding optimal 1-degree Bayesian network was solved

in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information 𝐽 of its edges ⇔ finding the maximum spanning tree, where the weight of edge (𝑌, 𝑍) is mutual information 𝐽(𝑌, 𝑍).

slide-44
SLIDE 44

 Build a 1-degree BN for database 𝐵 𝐶 𝐷 𝐸 Alan Bob Cykie 1 1 1 David Eric 1 1 Frank 1 1 George Helen 1 1 1 Ivan Jack 1 1

slide-45
SLIDE 45

 Start from a random attribute 𝐵

A C B D

slide-46
SLIDE 46

 Select next tree edge by its mutual information

A C B D

0.5 0.5 0.5 0.2 0.3 0.5 0.5

candidates: 𝐵 → 𝐶 𝐵 → 𝐷 𝐵 → 𝐸

𝐵 𝐶 𝐷 𝐸 Alan Bob Cykie 1 1 1 David Eric 1 1 Frank 1 1 George Helen 1 1 1 Ivan Jack 1 1

slide-47
SLIDE 47

 Select next tree edge by its mutual information

A C B D

𝑱 = 𝟐 𝑱 = 𝟏. 𝟓 𝑱 = 𝟏

candidates: 𝐵 → 𝐶 𝐵 → 𝐷 𝐵 → 𝐸

slide-48
SLIDE 48

 Select next tree edge by its mutual information

A C B D

slide-49
SLIDE 49

 Select next tree edge by its mutual information

A C B D

𝑱 = 𝟏. 𝟓 𝑱 = 𝟏 𝑱 = 𝟏. 𝟑 𝑱 = 𝟏

candidates: 𝐵 → 𝐷 𝐵 → 𝐸 𝐶 → 𝐷 𝐶 → 𝐸

slide-50
SLIDE 50

 Select next tree edge by its mutual information

A C B D

DONE!

slide-51
SLIDE 51

 Do it under Differential Privacy!  (Non-private) select the edge with maximum 𝐽  (Private) 𝐽 is data-sensitive

  • > the best edge is also data-sensitive
slide-52
SLIDE 52

Databases 𝐸 Edges 𝑓 define 𝑟 𝐸, 𝑓 → 𝑆 How good edge 𝑓 is as the result of selection, given database 𝐸 Return 𝑓 with probability: Pr

[𝑓] ∝ exp 𝜁 2 ⋅ 𝑟 𝐸, 𝑓 Δ 𝑟

Δ 𝑟 = max

𝐸,𝐸′,𝑓 𝑟 𝐸, 𝑓 − 𝑟(𝐸′, 𝑓) 1

where

noise info

slide-53
SLIDE 53

Outline

  • Tabular data and histogram/range queries
  • Algorithms for low dimensional data

– Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, …

  • Algorithms for high dimensional data

– Copula functions – Bayesian networks

slide-54
SLIDE 54

Open questions

  • High dimensional data
  • Robust and private algorithm selection
  • Error bounds for data-dependent algorithms

Tutorial: Differential Privacy in the Wild 54 Module 3

slide-55
SLIDE 55

References

  • [ACC12] Ács et al. Differentially private histogram publishing through lossy compression. In ICDM, 2012.
  • [BBDS12] Blocki et al. The johnson-lindenstrauss transform itself preserves differential privacy. In FOCS, 2012.
  • [BCDKMT07] Barak et al. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, 2007.
  • [BLR08] Blum et al. A learning theory approach to noninteractive database privacy. In STOC, 2008.
  • [DNRR15] Dwork et al. Pure Differential Privacy for Rectangle Queries via Private Partitions. In ASIACRYPT, 2015.
  • [CPSSY12] Cormode et al. Differentially Private Spatial Decompositions. In ICDE, 2012.
  • [GAHRW14] Gaboardi et al. Dual Query: Practical Private Query Release for High Dimensional Data. In ICML, 2014.
  • [HLM12] Hardt et al. A simple and practical algorithm for differentially private data release. In NIPS, 2012.
  • [HMMCZ16] Hay et al. Principled Evaluation of Differentially Private Algorithms using DPBench. In SIGMOD, 2016.
  • [HRMS10] Hay et al. Boosting the accuracy of differentially private histograms through consistency. In PVLDB, 2010.
  • [LHMY14] Li et al. A data- and workload-aware algorithm for range queries under differential privacy. In PVLDB, 2014.
  • [LHRMM10] Li et al. Optimizing linear counting queries under differential privacy. In PODS, 2010.
  • [LM12] Li et al. An adaptive mechanism for accurate query answering under differential privacy. In PVLDB, 2012.
  • [LM13] Li et al. Optimal error of query sets under the differentially-private matrix mechanism. In ICDT, 2013.
  • [LZWY11] Li et al. Compressive mechanism: utilizing sparse representation in differential privacy. In WPES, 2011.
  • [QYL13] Qardaji et al. Understanding hierarchical methods for differentially private histograms. In PVLDB, 2013.
  • [QYL13] Qardaji et al. Differentially private grids for geospatial data. In ICDE, 2013.
  • [RN10] Rastogi et al. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, 2010.
  • [WWLTRD09] Wang et al. Privacy-preserving genomic computation through program specialization. In CCS, 2009.
  • [XWG10] Xiao et al. Differential privacy via wavelet transforms. In ICDE, 2011.
  • [ZCPSX14] Zhang et al. PrivBayes: private data release via bayesian networks. In SIGMOD, 2014.
  • [ZXX16] Zhang et al. PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions. In SIGMOD, 2016.

Tutorial: Differential Privacy in the Wild 55 Module 3