[PPT] - Differential Privacy Tabular Data Li Xiong Outline Tabular data PowerPoint Presentation

SLIDE 1

CS573 Data Privacy and Security Differential Privacy – Tabular Data

Li Xiong

SLIDE 2

Outline

Tabular data and histogram/range queries
Algorithms for low dimensional data
Algorithms for high dimensional data

SLIDE 3

Example: statistics/synthetic data for medical records

Histograms
Cohort discovery: range queries

– Select COUNT(*) from D Where A1 in I1 and A2 in I2 and … and Am in Im.

SLIDE 4

A marginal over attributes

𝐵1, … , 𝐵𝑙 reports count for each combination of attribute values.

– aka cube, contingency table – E.g. 2-way marginal on EmploymentStatus and Gender

U.S. Census Bureau statistics

can typically be derived from k-way marginal over different combinations of available attributes

Hundreds of marginals

released

Example: statistical agencies: data publishing

Tutorial: Differential Privacy in the Wild 4

https://factfinder.census.gov/

Module 3

SLIDE 5

Scatter plot of input data

Example: range queries over spatial data

Tutorial: Differential Privacy in the Wild 5

Input: sensitive data D

BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of taxi pickup locations in Beijing, China in 1 month.

[1]Raw data from: Taxi trajectory open dataset, Tsinghua university, China. http://sensor.ee.tsinghua.edu.cn, 2009.

Input: range query workload W Shown is workload of 3 range queries Task: compute answers to workload W over private input D

Module 3

SLIDE 6

Problem variant: offline vs. online

Offline (batch):

– Entire W given as input, answers computed in batch

Online (adaptive):

– W is sequence q1, q2, … that arrives online – Adaptive: analyst’s choice for qi can depend on answers 𝑏1, … , 𝑏𝑗−1

Tutorial: Differential Privacy in the Wild 6 Module 3

SLIDE 7

Important aspects of problem: Data and query complexity

Data complexity

– Dimensionality: number of attributes – Domain size: number of distinct attribute combinations – Many techniques specialized for low dimensional data

Query complexity

– Given query workload vs. no query workload – Classes of queries: histograms, count queries, linear queries (sum, average), median …

Tutorial: Differential Privacy in the Wild 7 Module 3

SLIDE 8

Solution variants: query answers vs. synthetic data

Two high-level approaches to solving problem

1. Direct:

– Output of the algorithm is list of query answers

2. Synthetic data:

– Algorithm constructs a synthetic dataset D’, which can be queried directly by analyst – Analyst can pose additional queries on D’ (though answers may not be accurate)

Tutorial: Differential Privacy in the Wild 8 Module 3

SLIDE 9

Nonparametric methods – release empirical distributions, i.e.

histograms with differential privacy

Parametric methods – learn parameters of a distribution with

differential privacy

Categories of Methods

SLIDE 10

Semi-parametric methods

– DP marginal histograms (non-parametric) – Model dependence between attributes (parametric) Categories of Methods

SLIDE 11

Outline

Tabular data and histogram/range queries
Algorithms for low dimensional data

– Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, …

Algorithms for high dimensional data

SLIDE 12

Baseline algorithm

1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) – unit histogram 3. Use noisy counts to either…

1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample)

Tutorial: Differential Privacy in the Wild 12

Scatter plot of input data

Limitations

Granularity of discretization

– Coarse: detail lost – Fine: noise overwhelms signal

Noise accumulates: squared error

grows linearly with range

Module 3

SLIDE 13

September 20, 2016 13

DPCube: An early attempt [SDM 2010, ICDE 2012 demo]

Domain-based partitioning does not work very well

– Equi-width: equal bucket range – Uniformity assumption

Data-driven partitioning

– V-optimal: with the least frequency variance – Intuition: highest uniformity within each bucket

SLIDE 14

September 20, 2016 14

Histograms (review)

Divide data into buckets and store average (sum) for each bucket
Partitioning rules:

– Equi-width: equal bucket range – Equi-depth: equal frequency – V-optimal: with the least frequency variance

SLIDE 15

Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y … … … …

Original Records DP V-optimal Histogram Multi-dimensional partitioning

DPCube [SecureDM 2010, ICDE 2012 demo]

DP unit Histogram

DP Interface ε/2-DP ε/2-DP

Compute unit

histogram counts with differential privacy

Use DP unit histogram

for partitioning

Compute V-optimal

histogram counts with differential privacy

SLIDE 16

Use kd-tree for partitioning

Choose dimension and splitting point to split (minimize

variance)

Repeat until:

 Count of this node less than threshold  Variance or entropy of this node less than threshold

SLIDE 17

Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y … … … …

Original Records DP V-optimal Histogram Multi-dimensional partitioning

DPCube [SecureDM 2010, ICDE 2012 demo]

DP unit Histogram

DP Interface ε/2-DP ε/2-DP

Limitations:

– DP unit histogram very noisy – Affects the accuracy

f partitioning

SLIDE 18

Private Spatial decompositions [CPSSY 12]

quadtree kd-tree  Build: partitioning with differential privacy  Release: a private description of data distribution (in the form of bounding boxes and noisy counts)

18

SLIDE 19

19

Building a Private kd-tree

 Process to build a private kd-tree

Input: maximum height h, minimum leaf size L, data set
Choose dimension to split
Get (private) median in this dimension
Create child nodes and add noise to the counts
Recurse until:

 Max height is reached  Noisy count of this node less than L  Budget along the root-leaf path has used up

 The entire PSD satisfies DP by the composition property

SLIDE 20

Building PSDs – privacy budget allocation

 Budget is split between medians and counts at each node

– Tradeoff accuracy of division with accuracy of counts

 Budget is split across levels of the tree

– Privacy budget used along any root-leaf path should total  – Optimal budget allocation – Post processing with consistency check

SLIDE 21

Building PSDs – privacy budget allocation

 Budget is split between medians and counts at each node

– Tradeoff accuracy of division with accuracy of counts

 Budget is split across levels of the tree

– Privacy budget used along any root-leaf path should total  – Optimal budget allocation – Post processing with consistency check

Sequential composition Parallel composition

21

SLIDE 22

Data Transformations

 Can think of trees as a ‘data-dependent’ transform of input  Can apply other data transformations  General idea:

– Apply transform of data – Add noise in the transformed space (based on sensitivity) – Publish noisy coefficients, or invert transform (post-processing)

 Goal: pick a transform that preserves good properties of data

– And which has low sensitivity, so noise does not corrupt Original Data Transform Coefficients Noisy Coefficients Noise Private Data Invert 22

SLIDE 23

Linear transformations

Approach

– Discretize domain to finest granularity cells – Use Laplace mechanism to answer batch of queries, each of which is linear combination of cell counts

Examples

– Hierarchical: Trees [HRMS10,QYL13], full height quadtree [CPSSY12] – Haar Wavelet [XWG10] – Discrete Fourier transform [BCDKMT07]

Inverting transformation

– Some transformations (e.g. tree) have redundancy (over- constrained), so require pseudo-inverse

Tutorial: Differential Privacy in the Wild 23 Module 3

SLIDE 24

Lossy transformations

Variants

– Drop “small” coefficients:

Quad-tree with early stopping (noisy count threshold)
Fourier coefficients: EFPA [ACC12], [RN10]

– Data-adaptive discretization:

PrivTree [ZXX16], KD-Tree [CPSSY12], DAWA [LHMY14], [DNRR15], [QYL13], [BLR08]

– Data-adaptive measurement:

MWEM [HLM12], DualQuery [GAHRW14]

– Randomized transforms: sketches and compressed sensing

JL Transform [BBDS12], Compressive mechanism [LZWY11]
“Inverting” transformation

– Because lossy, they are under-constrained, requires estimation

Error rates depend on input

– Can be much lower (trades off small bias for lower variance) – Warrants careful empirical evaluation; algorithms are “data dependent”

Tutorial: Differential Privacy in the Wild 24 Module 3

SLIDE 25

Empirical benchmarks

[HMMCZ16] propose a novel evaluation framework for

standardized evaluation of privacy algorithms.

Study of algorithms for range query answering over 1 and 2D
Benchmark website www.dpcomp.org

Tutorial: Differential Privacy in the Wild 25

[HMMCZ16]

One finding from [HMMCZ16]: Some data-dependent algorithms fail to

ffer benefits at

larger scales (no. of tuples)

SLIDE 26

Outline

Tabular data and histogram/range queries
Algorithms for low dimensional data

– Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, …

Algorithms for high dimensional data

– Copula functions [LXJ14] – Bayesian networks [ZCPSX14]

SLIDE 27

Parametric methods Non-parametric methods

DPCopula: Motivation

Original data Histogram Synthetic data Perturbation

Fit the data to a distribution, make inferences about parameters Learn empirical distribution through histograms

e.g. PrivacyOnTheMap e.g. PSD , Privelet, FP, P-HP

SLIDE 28

DPCopula

A semi-parametric method

Non-parametric estimation for each dimension

Age Hours /week Income

42 64 30K 31 82 60K 28 40 20K 43 36 80K

… … …

Original data set

Hours/week Age Income

DP Marginal Histograms Dependence structure

Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …

DP synthetic data set

Parametric estimation for dependence

SLIDE 29

Privacy guarantee

Age Hours /week Income

42 64 30K 31 82 60K 28 40 20K 43 36 80K

… … …

Original data set

Hours/week Age Income

DP Marginal Histograms Dependence structure

Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …

DP synthetic data set

1



2



    

2 1

Differential privacy
Differential privacy
Differential privacy

SLIDE 30

Challenges

Age Hours /week Income

42 64 30K 31 82 60K 28 40 20K 43 36 80K

… … …

Original data set

Hours/week Age Income

DP Marginal Histograms Dependence structure

Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …

DP synthetic data set

SLIDE 31



A multi-Gaussian density for :

) , , , (

2 1 m

u u u C 

) , , , (

2 1 m

X X X  ) (

i i X

F

I is the identity matrix P is a correlation matrix:

3 1 132 . 108 . 132 . 1 053 . 108 . 053 . 1             m P

SLIDE 32

Gaussian copula: models the dependence with arbitrary margins Gaussian distribution: models the joint distribution

SLIDE 33

Age Age Hours urs /week ek Inco come me

42 64 30K 31 82 60K 28 40 20K 43 36 80K

… … …

Origi igina nal data ta set et

Hours/week Age Income

Step ep 2: Comput utin ing g DP corre relati tion

n matri

rix x throu rough gh DP MLE (Maxi ximum mum Likeli eliho hood

d Esti

timat mation

n

Age Hours /week Incom

me

42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …

Step ep 3: Sampling ng DP synt nthet hetic data ta Step ep 1: Comput utin ing g DP margin ginal His isto togr grams ms

           1 132 . 108 . 132 . 1 053 . 108 . 053 . 1 ~ P

DP synt nthet hetic ic data ta set DP margin rginal hist stogra

grams

ms DP corre rrela lati tion

n matri

rix DP depend endenc ence e stru ructu ture re

MLE

SLIDE 34

 Overview

Age Hours /week Incom

me

Gende der 42 64 30K F 31 82 60K M 28 40 20K F 43 36 80K M

… … …

…

Age Age Hours /week Income

me

42 64 30K 28 40 20K … … … Age Age Hours /week Income

me

31 82 60K 43 36 80K … … …

DPCop Copul ula DPCop Copul ula Gend nder r = F Gend nder r = M

Age Age Hours /week Income

me

36 62 28K 30 42 18K … … … Age Age Hours /week Income

me

34 76 52K 31 32 69K … … …

) / 1 ( ~

1 1

 Lap n n  

) / 1 ( ~

2 2

 Lap n n  

1

~ n

2

~ n

SLIDE 35

 Datasets

US Census data: 4 attributes, 100,000 records

Brazil data: 8 attributes, 188,846 records Synthetic data

 Comparison:

PSD, Privelet+, FP, P-HP

 Metrics:

Random range-count queries with random query predicates covering all attributes Relative error: Absolute error:

SLIDE 36



Query accuracy vs. differential privacy budget

SLIDE 37

 Query accuracy vs. query range size

SLIDE 38

 Gaussian dependence assumption  Pair-wise attribute correlation does not scale

with high dimensions

SLIDE 39

sensitive database 𝐸 synthetic database 𝐸∗ convert noisy distribution + noise sample a set of low-dim distributions noisy low-dim distributions + noise convert approximate full-dim tuple distribution sample

SLIDE 40

 A 5-dimensional database:

age workclass education title income Pr 𝑏𝑕𝑓 Pr 𝑥𝑝𝑠𝑙 | 𝑏𝑕𝑓 Pr 𝑓𝑒𝑣 | 𝑏𝑕𝑓 Pr 𝑢𝑗𝑢𝑚𝑓 | 𝑥𝑝𝑠𝑙 Pr 𝑗𝑜𝑑𝑝𝑛𝑓 | 𝑥𝑝𝑠𝑙

SLIDE 41

 A 5-dimensional database:

age workclass education title income Pr 𝑏𝑕𝑓 ⋅ Pr 𝑥𝑝𝑠𝑙 | 𝑏𝑕𝑓 ⋅ Pr 𝑓𝑒𝑣 | 𝑏𝑕𝑓 ⋅ Pr 𝑢𝑗𝑢𝑚𝑓 | 𝑥𝑝𝑠𝑙 ⋅ Pr 𝑗𝑜𝑑𝑝𝑛𝑓 | 𝑥𝑝𝑠𝑙 Pr ∗ ≈

SLIDE 42

 STEP 1: Choose a suitable Bayesian network 𝒪

must in a differentially private way

 STEP 2: Compute conditional distributions implied by 𝒪

straightforward to do under differential privacy
inject noise – Laplace mechanism

 STEP 3: Generate synthetic data by sampling from 𝒪

post-processing: no privacy issues

SLIDE 43

 Finding optimal 1-degree Bayesian network was solved

in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information 𝐽 of its edges ⇔ finding the maximum spanning tree, where the weight of edge (𝑌, 𝑍) is mutual information 𝐽(𝑌, 𝑍).

SLIDE 44

 Build a 1-degree BN for database 𝐵 𝐶 𝐷 𝐸 Alan Bob Cykie 1 1 1 David Eric 1 1 Frank 1 1 George Helen 1 1 1 Ivan Jack 1 1

SLIDE 45

 Start from a random attribute 𝐵