Sequence Data Continuous Aggregates Distance-based sampling - PowerPoint PPT Presentation

Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns

CS573 Data Privacy and Security Differential Privacy – Sequence Data Li Xiong

Continuous Aggregates with Differential Privacy t1 t2 t3 a 100 90 100 b 20 50 20 20 10 20 c

Baseline • Compute a perturbed histogram at each time point – large perturbation error

Baseline • Compute one perturbed histogram at a sampled time point – large sampling error

Distance-based Adaptive Sampling • Only generate a new histogram when change is significant • Need to tune distance threshold Haoran Li, Li Xiong, Xiaoqian Jiang, Jinfei Liu. Differentially Private Histogram Publication for Dynamic Datasets: An Adaptive Sampling Approach. CIKM 2015

Transformation-Based • Represent the original counts over time as a time series • Baseline: Laplace perturbation at each time point

Discrete Fourier Transform [Rastogi and Nath, SIGMOD 10] Aggregate time series X Discrete Fourier Transform Retain only the first d coefficients to Laplace Perturbation reduce sensitivity Inverse DFT Released series R • Higher accuracy (perturbation error O(d) + reconstruction error) • Offline or batch processing only

Model-Based Filtering and Sampling • Represent the original counts over time as a time series • Use model-based prediction to de-noise

State-Space Model Process original 𝑦 𝑙+1 𝑦 𝑙 𝑨 𝑙 𝑨 𝑙+1 Perturbed

Filtering: State-Space Model Process 𝑦 𝑙+1 𝑦 𝑙 Perturbation 𝑨 𝑙 𝑨 𝑙+1 • Process Model 𝑦 𝑙+1 = 𝑦 𝑙 + 𝜕 𝜕~ℕ(0, 𝑅) Process noise • Measurement Model 𝑨 𝑙 = 𝑦 𝑙 + 𝜉 𝜉~𝑀𝑏𝑞(𝜇) Measurement noise • Given noisy measurement 𝑨 𝑙 , how to estimate true state 𝑦 𝑙 ?

Posterior Estimation • Denote ℤ 𝑙 = 𝑨 1 , … , 𝑨 𝑙 - noisy observations up to k • Posterior estimate: 𝑦 𝑙 = 𝐹(𝑦 𝑙 |ℤ 𝑙 ) • Posterior distribution: 𝑔 𝑦 𝑙 ℤ 𝑙 = 𝑔 𝑦 𝑙 ℤ 𝑙−1 𝑔(𝑨 𝑙 |𝑦 𝑙 ) 𝑔 𝑨 𝑙 ℤ 𝑙−1 • Challenge : 𝑔 𝑨 𝑙 ℤ 𝑙−1 and 𝑔 𝑦 𝑙 ℤ 𝑙−1 are difficult to compute when 𝑔 𝑨 𝑙 𝑦 𝑙 = 𝑔 𝜉 is no not Gaussian

Filtering: Solutions • Option 1 : Approximate measurement noise with Gaussian 𝜉~ℕ(0, 𝑆) → the Kalman filter • Option 2 : Estimate posterior density by Monte-Carlo method 𝑂 𝑗 𝜀(𝑦 𝑙 − 𝑦 𝑙 𝑗 ) 𝑔 𝑦 𝑙 ℤ 𝑙 = 𝜌 𝑙 𝑗=1 𝑗 , 𝜌 𝑙 𝑗 } 1 𝑂 is a set of weighted samples/particles. where {𝑦 𝑙 → particle filters • Liyue Fan, Li Xiong. Adaptively Sharing Real-Time Aggregates with Differential Privacy. IEEE TKDE, 2013

17 Adaptive Sampling • Adaptive sampling - adjust sampling rate based on feedback (error between posterior and prior estimate)

Adaptive Sampling: PID Control • Feedback error: measures how well the data model describes the current trend • PID error ( Δ ): compound of proportional , integral , and derivative errors • Proportional: current error • Integral: integral of errors in recent time window • Derivative: change rate of errors • Determines a new sampling interval: Δ−𝜊 𝐽 ′ = 𝐽 + 𝜄(1 − 𝑓 𝜊 ) where 𝜄 represents the magnitude of change and 𝜊 is the set point for sampling process.

Evaluation: Data Sets • Synthetic Data with 1000 data points: • Linear: process model • Logistic: 𝑦 𝑙 = 𝐵(1 + 𝑓 −𝑙 ) −1 • Sinusoidal: 𝑦 𝑙 = 𝐵 ∗ 𝑡𝑗𝑜(𝜕𝑙 + 𝜒) • Flu: CDC flu data 2006-2010, 209 data points • Traffic: UW/intelligent transportation systems research 2003-2004, 540 data points • Unemployment: ST. Louis Federal Reserve Bank, 478 data points Traffic Flu

Illustration: Original data stream vs. released data stream • FAST provides less data volume and higher data utility/integrity with formal privacy guarantee

Results: Utility vs. Privacy Flu Traffic Unemployment 21

Multi-dimensional time-series or spatial- temporal data: Traffic monitoring • Goal: release 𝐒 𝑑 for each cell c • Approaches: Temporal modeling and spatial partitioning Liyue Fan, Li Xiong, Vaidy Sunderam. Differentially Private Multi-Dimensional Time- Series Release for Traffic Monitoring. DBSec, 2013 (best student paper award)

Temporal modeling and estimation • Univariate time-series modeling for each cell • Road network based process variance modeling 𝑦 𝑙+1 = 𝑦 𝑙 + 𝜕 Small value for Sparse cells; 𝜕~ℕ(0, 𝑅) Large value for Dense cells.

Spatial Partitioning and Estimation • Data has sparse and uniform regions and is dynamically changing • Failed attempt: dynamic feedback-driven partitioning • Approach: road network density based Quad-tree partitioning partitioning

Results on Brinkhoff moving objects data Dataset: 500K objects at the beginning; 25K new objects at every timestamp; 100 timestamps

Frequent Sequential Patterns with Differential Privacy S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based Candidate Pruning. ICDE 2015

Frequent sequential pattern mining with medical data • Longitudinal observations • Genome sequences

Non-private frequent sequence mining (FSM) – An Example Database D C 1 : cand 1-seqs F 1 : freq 1-seqs ID Record Sequence Sup. Sequence Sup. a → c → d { a } 100 3 { a } 3 Scan D b → c → d 200 { b } 3 { b } 3 a → b → c → e → d 300 { c } 4 { c } 4 d → b { d } 400 4 { d } 4 a → d → c → d 500 { e } 1 C 2 : cand 2-seqs C 2 : cand 2-seqs Sequence Sup. Sequence { a → a } 0 { a → a } { a → b } 1 { a → b } { a → c } 3 { a → c } { a → d } 3 { a → d } F 3 : freq 2-seqs { b → a } 0 { b → a } Sequence Sup. { b → b } 2 { b → b } Scan D { a → c } 3 { b → c } 2 { b → c } { a → d } 3 { b → d } 1 { b → d } { c → d } 4 { c → a } 0 { c → a } { c → b } 0 { c → b } { c → c } 0 { c → c } { c → d } 4 { c → d } C 3 : cand 3-seqs F 3 : freq 3-seqs { d → a } 0 { d → a } Scan D Sequence Sequence Sup. { d → b } 1 { d → b } { a → b → c } { a → b → c } 3 { d → c } 1 { d → c } { d → d } 0 { d → d }

Naïve Private FSM C 1 : cand 1-seqs Database D F 1 : freq 1-seqs noise Sequence Sup. ID Record Lap ( |C 1 | / ε 1 ) { a } 3 0.2 Sequence Noisy Sup. a → c → d 100 Scan D { b } 3 -0.4 { a } 3.2 b → c → d 200 { c } 4 0.4 a → b → c → e → d { c } 4.4 300 { d } 4 -0.5 d → b 400 { d } 3.5 { e } 1 0.8 a → d → c → d 500 C 2 : cand 2-seqs C 2 : cand 2-seqs noise Sequence Sup. Sequence { a → a } 0.2 0 { a → a } Lap ( |C 2 | / ε 2 ) { a → c } F 2 : freq 2-seqs 3 0.3 { a → c } Sequence Noisy Sup. { a → d } 3 0.2 { a → d } { a → c } 3.3 { c → a } Scan D 0 -0.5 { c → a } { a → d } 3.2 { c → c } 0 0.8 { c → c } { c → d } 4.2 { c → d } 4 0.2 { c → d } { d → c } 3.1 { d → a } 0 0.3 { d → a } { d → c } 1 2.1 { d → c } { d → d } 0 -0.5 { d → d } C 3 : cand 3-seqs C 3 : cand 3-seqs Lap ( |C 3 | / ε 3 ) Sequence F 3 : freq 3-seqs noise Sequence Sup. Scan D { a → c → d } Sequence Noisy Sup. { a → c → d } 3 0 { a → c → d } 3 { a → d → c } { a → d → c } 1 0.3

Challenges • The large number of generated candidate sequences • It leads to a large amount of perturbation noise required by differential privacy

Observations • Observation 1 • The number of real frequent sequences is much smaller than the number of candidate sequences • In MSNBC, for θ = 0.02, n (2- seq ) = 32 vs. n ( cand 2- seq ) = 225 • In BIBLE, for θ = 0.15, n (2- seq ) = 21 vs. n ( cand 2- seq ) = 254 • Observation 2 • The frequencies of most patterns in a small part of the database are approximately equal to their frequencies in the original database • Reason: the frequency of a pattern can be considered as the probability of a record containing this pattern

PFS 2 Algorithm • PFS 2 Algorithm • Differentially P rivate F requent S equence Mining via S ampling-based Candidate Pruning • Basic Idea • Mining frequent sequences in order of increasing length • Using k th sample database for pruning candidate k -sequences • Partitioning the original database by random sampling Original Database Partition …… 1 st sample database 2 nd sample database m th sample database

Sequence Data Continuous Aggregates Distance-based sampling - PowerPoint PPT Presentation

Sequence Data Continuous Aggregates Distance-based sampling Transformation-based Model-based filtering and sampling Frequent sequential patterns CS573 Data Privacy and Security Differential Privacy Sequence Data Li Xiong

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Th The Boll ll Weevil vil In 1892 a great deal of attention both national and local was

The Browser as a Secure Platform for Loosely Coupled, Private-Data Mashups Ben Adida C enter for

Modelling Dependence with Copulas and Applications to Risk Management Filip Lindskog, RiskLab,

Estimating risk Maximilian Kasy Department of Economics, Harvard University May 4, 2018 1 / 17

Extending Cloud Foundry with new Services #cloudcredo #cloudfoundry 4832 Roadmap What is

Peer Review Professional Development of Academic Teaching Staff in the Netherlands Setting a

Presentation Objectives 1. Identify Division of Public Health (DPH) and other resources to support

Integrated Modeling and Verification of Processes and Data Verification Diego Calvanese, Marco

Sequence Data Continuous Aggregates Distance-based sampling - PowerPoint PPT Presentation

Sequence Data Continuous Aggregates Distance-based sampling Transformation-based Model-based filtering and sampling Frequent sequential patterns CS573 Data Privacy and Security Differential Privacy Sequence Data Li Xiong

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Th The Boll ll Weevil vil In 1892 a great deal of attention both national and local was

The Browser as a Secure Platform for Loosely Coupled, Private-Data Mashups Ben Adida C enter for

Modelling Dependence with Copulas and Applications to Risk Management Filip Lindskog, RiskLab,

Estimating risk Maximilian Kasy Department of Economics, Harvard University May 4, 2018 1 / 17

Extending Cloud Foundry with new Services #cloudcredo #cloudfoundry 4832 Roadmap What is

Peer Review Professional Development of Academic Teaching Staff in the Netherlands Setting a

Presentation Objectives 1. Identify Division of Public Health (DPH) and other resources to support

Integrated Modeling and Verification of Processes and Data Verification Diego Calvanese, Marco

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or