Change detection in multi-dimensional datasets and time series - PowerPoint PPT Presentation

Change detection in multi-dimensional datasets and time series Andrea De Simone andrea.desimone@sissa.it Univ. Camerino, 2019-02-26 [DS, Jacques – arXiv:1807.06038]

� Outline 1 Two-Sample Test: Intro & Motivation 2 Nearest Neighbors Two-Sample Test (NN2ST) 3 Gaussian Examples 4 Outlook: Time Series Data Andrea De Simone Univ. Camerino, 2019-02-26 1 / 18

� Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . Benchmark Sample Trial Sample 4 4 3 3 2 2 1 x 2 1 x 2 0 0 1 1 2 2 3 2 1 0 1 2 3 4 5 2 1 0 1 2 3 4 x 1 x 1 Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18

� Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . « Are B , T drawn from the same probability distribution? » easy… easy. . . Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18

� Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . « Are B , T drawn from the same probability distribution? » … hard! . . . hard Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18

� Two-Sample Test Why is it important? • detect departures from benchmark • find anomalous points (outliers) • check if observed data are compatible with expectations • detect changes in underlying distributions • real-time detect events/shifts in time series Andrea De Simone Univ. Camerino, 2019-02-26 3 / 18

� Two-Sample Test Desiderata for a statistical test (1) model-independent no assumption about underlying physical model to interpret data − → more general (2) non-parametric compare two samples as a whole (not just their means, etc.) − → fewer assumptions, no max likelihood estim. (3) un-binned high-dim feature space partitioned without rectangular bins − → retain full multi-dim info of data Andrea De Simone Univ. Camerino, 2019-02-26 4 / 18

� Two-Sample Test Recipe (1) Density Estimator − → reconstruct PDF from samples (2) Test Statistic (TS) − → “measure distance” between PDFs (3) TS distribution − → associate probabilities to TS under null hypothesis H 0 : p B = p T (4) p -value − → if p < α then reject H 0 Let’s build the Nearest Neighbors Two-Sample Test (NN2ST) Andrea De Simone Univ. Camerino, 2019-02-26 5 / 18

� 1. Density Estimator Divide space in square bins? ✓ easy ✓ B ✓ can use simple statistics (e.g. χ 2 ) ✓ ✘ ✗ hard/slow/impossible in high- D Need un-binned, multi-variate approach Find PDFs Find PDF estimators ˆ p B , ˆ p T , e.g. based on densities of points: T e.g. based on density of points p B,T ( x ) = ρ B,T ( x ) ˆ N B,T Nearest Neighbors! [Schilling 1986, Henze 1988] [Wang et al. 2005-2006, Perez-Cruz. 2008] Andrea De Simone Univ. Camerino, 2019-02-26 6 / 18

� 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18

� 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18

� 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . • Find the distance r j,T of the • K th -NN of x j in T . T r j,T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18

� 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . • Find the distance r j,T of the • K th -NN of x j in T . T • Estimate PDFs: r j,T 1 x j K p B ( x j ) ˆ = ω D r D N B j,B K 1 p T ( x j ) ˆ = N T − 1 ω D r D j,T Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18

� 2. Test Statistic • Measure the “distance” between 2 PDFs • Define Test Statistic (to detect under-/over-densities) N T � 1 log ˆ p T ( x j ) TS( T ) ≡ N T p B ( x j ) ˆ j =1 • Form NN-estimated PDFs: N T � TS( T ) = D log r j,B N B + log N T r j,T N T − 1 j =1 • Related to Kullback-Leibler divergence as: TS( T ) = ˆ D KL (ˆ p T || ˆ p B ) � D KL ( p || q ) ≡ � q ( x ) d x � R D p ( x ) log p ( x ) • Theorem: this estimator converges to D KL ( p B || p T ), in the large sample limit [Wang et al. – 2005, 2006] Andrea De Simone Univ. Camerino, 2019-02-26 8 / 18

� 3. Test Statistic Distribution How is TS distributed? Permutation test! Assume p B = p T . Union set U = T ∪ B . B T e Compute the test Random reshuffle U T statistic TS n on ( � B , � T ). e B B e B B Repeat many times. Distribution of TS under H 0 : f (TS | H 0 ) ← { TS n } [asymptotically normal with zero mean] Andrea De Simone Univ. Camerino, 2019-02-26 9 / 18

� NN2ST: Summary INPUT: T ≡ { x 1 , . . . , x N T } iid Trial sample: ∼ p T , x i , x ′ i ∈ R D N B } iid B ≡ { x ′ 1 , . . . , x ′ Benchmark sample: ∼ p B p B , p T K : number of nearest neighbors unknown N perm : number of permutations OUTPUT: p -value of the null hypothesis H 0 : p B = p T [check compatibility between 2 samples] [detect changes in underlying distributions] Andrea De Simone Univ. Camerino, 2019-02-26 11 / 18

� NN2ST: Summary Test Statistic Benchmark sample TS obs K-NN density ratio estimation permutation test Trial sample -|TS obs | |TS obs | p value TS distribution Python code: github.com/de-simone/NN2ST [DS, Jacques – arXiv:1807.06038] Andrea De Simone Univ. Camerino, 2019-02-26 12 / 18

� NN2ST: Summary ✓ general, model-independent ✓ solid math foundations ✓ fast, no optimization ✓ sensitive to unspecified signals ✗ need to run for each sample pair ✗ permutation test is bottleneck Andrea De Simone Univ. Camerino, 2019-02-26 13 / 18

� NN2ST on Gaussian Samples Random samples from D = 2, D -dimensional Gaussians � � � � 1 . 0 1 . 2 µ B = , µ T = , = N ( µ B , Σ B ) , p B 1 . 0 1 . 2 N ( µ T , Σ T ) . p T = Σ B = Σ T = I 2 . 0.08 K = 3 K = 20 0.07 0.06 0.05 0.04 TS 0.03 Convergence to exact 0.02 KL divergence 0.01 0.00 2 3 4 5 6 7 10 10 10 10 10 10 N B Andrea De Simone Univ. Camerino, 2019-02-26 14 / 18

� NN2ST on Gaussian Samples Dataset µ Σ B 1 D I D T G 0 1 D I D N B = N T = 20 000 T G 1 1 . 12 D I D � � 0 . 95 0 . 1 = 5 K 0 T G 2 1 D 0 . 1 0 . 8 N perm = 1 000 0 I D − 2 T G 3 1 . 15 D I D 0 10 5 Z=5 10 4 10 17 Z=5 10 8 10 29 10 12 41 10 p-value 10 p -value 16 53 10 10 65 20 10 10 T G 0 77 24 10 10 T G 1 T G 2 89 10 28 10 T G 3 3 4 5 6 2 3 4 5 6 7 8 9 10 10 10 10 10 dimension D N B more data, more power higher D , more power Andrea De Simone Univ. Camerino, 2019-02-26 15 / 18

� Outlook: time series data [Caveat Emptor: very preliminary!] Real-time detection of changes in data streams: variation in underlying mechanism generating data. T , B samples: windows of time series data, ending at discrete times t, t ′ T t = { x t − N +1 , . . . , x t } , B t ′ = { x t ′ − N +1 , . . . , x t ′ } , ( N B = N T ≡ N ) . Trial window sliding forward with time. Benchmark window anchored or rolling. • anchored B window: t ′ = N − → B t ′ = { x 1 , . . . , x N } Captures cumulative changes over time. • adjacent windows: t ′ = t − N − → B t ′ = { x t − 2 N +1 , . . . , x t − N } Captures “rate of change” at current time. Andrea De Simone Univ. Camerino, 2019-02-26 16 / 18

� Outlook: time series data Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18

� Outlook: time series data adjacent vs. anchored windows Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18

� Outlook: time series data ◮ Feature space can be high-dimensional: prices (OHLC), prices of related markets, indicators, volumes, . . . ◮ Reduce false alarms with persistence factor γ ( ∼ 1)%. H 0 rejected γ · N times in a row − → detected change in market conditions Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18

� Take-Home Messages (1) Proposed a new statistical test: NN2ST (2) Model-independent and suitable for high- D data (3) Excellent results on static datasets (4) Promising applications for change detection in time series data Andrea De Simone Univ. Camerino, 2019-02-26 18 / 18

Change detection in multi-dimensional datasets and time series - PowerPoint PPT Presentation

Change detection in multi-dimensional datasets and time series Andrea De Simone andrea.desimone@sissa.it Univ. Camerino, 2019-02-26 [DS, Jacques arXiv:1807.06038] Outline 1 Two-Sample Test: Intro & Motivation 2 Nearest

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Multi-Dimensional Reflective BSDE July 29 2010, Cornell University By Qinghua Li, Columbia

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Three-dimensional Radial Visualization of High-dimensional Continuous or Discrete Datasets Fan

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

Multi Multi-dimensional Data and Spatial Range dimensional Data and Spatial Range Query in

Multi- -dimensional Data and dimensional Data and Spatial Range Spatial Range Multi Query in

Multi-dimensional Dependency Grammar as Graph Description Ralph Debusmann and Gert Smolka

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Multi-Dimensional Gas Flows Tai-Ping Liu Academia Sinica, Taiwan Stanford University Final

Visualizing Multi-dimensional Data S E T H H O R R I G A N C O M P U T E R V I S U A L I Z A T

Storing and Processing Multi-dimensional Scientific Datasets Alan Sussman UMIACS &

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Today Arrays One-dimensional Machine-Level Programming IV: Data Multi-dimensional

University of Amsterdam and Euvision Technologies at ILSVRC2013 Koen van de Sande Daniel

Welcome Natural Capital Expedition Built Environment Anglique Laskewitz, VBDO Liuzhou

Dealing with uncertainty in railway traffic management and disruption management April 26,

Building bridges between services and e-Infrastructure in structural biology Alexandre Bonvin

Guiding New Physics Searches with Unsupervised Learning [DS, Jacques - 1807.06038]

MAKING RESEARCH ON SYMMETRIC FUNCTIONS USING MUPAD-COMBINAT Francois Descouens Laboratoire

Targeted Proteomics Environment Status of the Skyline open-source software project six years

EVIDENTIAL STATISTICS Reforming the Introductory Course in Applied Statistics for Non-Majors

Change detection in multi-dimensional datasets and time series - PowerPoint PPT Presentation

Change detection in multi-dimensional datasets and time series Andrea De Simone andrea.desimone@sissa.it Univ. Camerino, 2019-02-26 [DS, Jacques arXiv:1807.06038] Outline 1 Two-Sample Test: Intro & Motivation 2 Nearest

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Multi-Dimensional Reflective BSDE July 29 2010, Cornell University By Qinghua Li, Columbia

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Three-dimensional Radial Visualization of High-dimensional Continuous or Discrete Datasets Fan

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

Multi Multi-dimensional Data and Spatial Range dimensional Data and Spatial Range Query in

Multi- -dimensional Data and dimensional Data and Spatial Range Spatial Range Multi Query in

Multi-dimensional Dependency Grammar as Graph Description Ralph Debusmann and Gert Smolka

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Multi-Dimensional Gas Flows Tai-Ping Liu Academia Sinica, Taiwan Stanford University Final

Visualizing Multi-dimensional Data S E T H H O R R I G A N C O M P U T E R V I S U A L I Z A T

Storing and Processing Multi-dimensional Scientific Datasets Alan Sussman UMIACS &amp;

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Today Arrays One-dimensional Machine-Level Programming IV: Data Multi-dimensional

University of Amsterdam and Euvision Technologies at ILSVRC2013 Koen van de Sande Daniel

Welcome Natural Capital Expedition Built Environment Anglique Laskewitz, VBDO Liuzhou

Dealing with uncertainty in railway traffic management and disruption management April 26,

Building bridges between services and e-Infrastructure in structural biology Alexandre Bonvin

Guiding New Physics Searches with Unsupervised Learning [DS, Jacques - 1807.06038]

MAKING RESEARCH ON SYMMETRIC FUNCTIONS USING MUPAD-COMBINAT Francois Descouens Laboratoire

Targeted Proteomics Environment Status of the Skyline open-source software project six years

EVIDENTIAL STATISTICS Reforming the Introductory Course in Applied Statistics for Non-Majors

Storing and Processing Multi-dimensional Scientific Datasets Alan Sussman UMIACS &