Change detection in multi-dimensional datasets and time series Andrea De Simone andrea.desimone@sissa.it Univ. Camerino, 2019-02-26 [DS, Jacques – arXiv:1807.06038]
� Outline 1 Two-Sample Test: Intro & Motivation 2 Nearest Neighbors Two-Sample Test (NN2ST) 3 Gaussian Examples 4 Outlook: Time Series Data Andrea De Simone Univ. Camerino, 2019-02-26 1 / 18
� Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . Benchmark Sample Trial Sample 4 4 3 3 2 2 1 x 2 1 x 2 0 0 1 1 2 2 3 2 1 0 1 2 3 4 5 2 1 0 1 2 3 4 x 1 x 1 Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18
� Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . « Are B , T drawn from the same probability distribution? » easy… easy. . . Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18
� Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . « Are B , T drawn from the same probability distribution? » … hard! . . . hard Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18
� Two-Sample Test Why is it important? • detect departures from benchmark • find anomalous points (outliers) • check if observed data are compatible with expectations • detect changes in underlying distributions • real-time detect events/shifts in time series Andrea De Simone Univ. Camerino, 2019-02-26 3 / 18
� Two-Sample Test Desiderata for a statistical test (1) model-independent no assumption about underlying physical model to interpret data − → more general (2) non-parametric compare two samples as a whole (not just their means, etc.) − → fewer assumptions, no max likelihood estim. (3) un-binned high-dim feature space partitioned without rectangular bins − → retain full multi-dim info of data Andrea De Simone Univ. Camerino, 2019-02-26 4 / 18
� Two-Sample Test Recipe (1) Density Estimator − → reconstruct PDF from samples (2) Test Statistic (TS) − → “measure distance” between PDFs (3) TS distribution − → associate probabilities to TS under null hypothesis H 0 : p B = p T (4) p -value − → if p < α then reject H 0 Let’s build the Nearest Neighbors Two-Sample Test (NN2ST) Andrea De Simone Univ. Camerino, 2019-02-26 5 / 18
� 1. Density Estimator Divide space in square bins? ✓ easy ✓ B ✓ can use simple statistics (e.g. χ 2 ) ✓ ✘ ✗ hard/slow/impossible in high- D Need un-binned, multi-variate approach Find PDFs Find PDF estimators ˆ p B , ˆ p T , e.g. based on densities of points: T e.g. based on density of points p B,T ( x ) = ρ B,T ( x ) ˆ N B,T Nearest Neighbors! [Schilling 1986, Henze 1988] [Wang et al. 2005-2006, Perez-Cruz. 2008] Andrea De Simone Univ. Camerino, 2019-02-26 6 / 18
� 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18
� 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18
� 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . • Find the distance r j,T of the • K th -NN of x j in T . T r j,T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18
� 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . • Find the distance r j,T of the • K th -NN of x j in T . T • Estimate PDFs: r j,T 1 x j K p B ( x j ) ˆ = ω D r D N B j,B K 1 p T ( x j ) ˆ = N T − 1 ω D r D j,T Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18
� 2. Test Statistic • Measure the “distance” between 2 PDFs • Define Test Statistic (to detect under-/over-densities) N T � 1 log ˆ p T ( x j ) TS( T ) ≡ N T p B ( x j ) ˆ j =1 • Form NN-estimated PDFs: N T � TS( T ) = D log r j,B N B + log N T r j,T N T − 1 j =1 • Related to Kullback-Leibler divergence as: TS( T ) = ˆ D KL (ˆ p T || ˆ p B ) � D KL ( p || q ) ≡ � q ( x ) d x � R D p ( x ) log p ( x ) • Theorem: this estimator converges to D KL ( p B || p T ), in the large sample limit [Wang et al. – 2005, 2006] Andrea De Simone Univ. Camerino, 2019-02-26 8 / 18
� 3. Test Statistic Distribution How is TS distributed? Permutation test! Assume p B = p T . Union set U = T ∪ B . B T e Compute the test Random reshuffle U T statistic TS n on ( � B , � T ). e B B e B B Repeat many times. Distribution of TS under H 0 : f (TS | H 0 ) ← { TS n } [asymptotically normal with zero mean] Andrea De Simone Univ. Camerino, 2019-02-26 9 / 18
� 4. p -value • Find ˆ µ, ˆ σ : mean, variance of f (TS | H 0 ) • Standardize the TS: TS → TS ′ ≡ TS − ˆ µ ˆ σ • TS ′ distributed according to f ′ (TS ′ | H 0 ) = ˆ σ TS ′ | H 0 ) σf (ˆ µ + ˆ • Two-sided p -value � ∞ f ′ (TS ′ | H 0 ) d TS ′ p = 2 | TS obs | -|TS obs | |TS obs | p value Andrea De Simone Univ. Camerino, 2019-02-26 10 / 18
� NN2ST: Summary INPUT: T ≡ { x 1 , . . . , x N T } iid Trial sample: ∼ p T , x i , x ′ i ∈ R D N B } iid B ≡ { x ′ 1 , . . . , x ′ Benchmark sample: ∼ p B p B , p T K : number of nearest neighbors unknown N perm : number of permutations OUTPUT: p -value of the null hypothesis H 0 : p B = p T [check compatibility between 2 samples] [detect changes in underlying distributions] Andrea De Simone Univ. Camerino, 2019-02-26 11 / 18
� NN2ST: Summary Test Statistic Benchmark sample TS obs K-NN density ratio estimation permutation test Trial sample -|TS obs | |TS obs | p value TS distribution Python code: github.com/de-simone/NN2ST [DS, Jacques – arXiv:1807.06038] Andrea De Simone Univ. Camerino, 2019-02-26 12 / 18
� NN2ST: Summary ✓ general, model-independent ✓ solid math foundations ✓ fast, no optimization ✓ sensitive to unspecified signals ✗ need to run for each sample pair ✗ permutation test is bottleneck Andrea De Simone Univ. Camerino, 2019-02-26 13 / 18
� NN2ST on Gaussian Samples Random samples from D = 2, D -dimensional Gaussians � � � � 1 . 0 1 . 2 µ B = , µ T = , = N ( µ B , Σ B ) , p B 1 . 0 1 . 2 N ( µ T , Σ T ) . p T = Σ B = Σ T = I 2 . 0.08 K = 3 K = 20 0.07 0.06 0.05 0.04 TS 0.03 Convergence to exact 0.02 KL divergence 0.01 0.00 2 3 4 5 6 7 10 10 10 10 10 10 N B Andrea De Simone Univ. Camerino, 2019-02-26 14 / 18
� NN2ST on Gaussian Samples Dataset µ Σ B 1 D I D T G 0 1 D I D N B = N T = 20 000 T G 1 1 . 12 D I D � � 0 . 95 0 . 1 = 5 K 0 T G 2 1 D 0 . 1 0 . 8 N perm = 1 000 0 I D − 2 T G 3 1 . 15 D I D 0 10 5 Z=5 10 4 10 17 Z=5 10 8 10 29 10 12 41 10 p-value 10 p -value 16 53 10 10 65 20 10 10 T G 0 77 24 10 10 T G 1 T G 2 89 10 28 10 T G 3 3 4 5 6 2 3 4 5 6 7 8 9 10 10 10 10 10 dimension D N B more data, more power higher D , more power Andrea De Simone Univ. Camerino, 2019-02-26 15 / 18
� Outlook: time series data [Caveat Emptor: very preliminary!] Real-time detection of changes in data streams: variation in underlying mechanism generating data. T , B samples: windows of time series data, ending at discrete times t, t ′ T t = { x t − N +1 , . . . , x t } , B t ′ = { x t ′ − N +1 , . . . , x t ′ } , ( N B = N T ≡ N ) . Trial window sliding forward with time. Benchmark window anchored or rolling. • anchored B window: t ′ = N − → B t ′ = { x 1 , . . . , x N } Captures cumulative changes over time. • adjacent windows: t ′ = t − N − → B t ′ = { x t − 2 N +1 , . . . , x t − N } Captures “rate of change” at current time. Andrea De Simone Univ. Camerino, 2019-02-26 16 / 18
� Outlook: time series data Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18
� Outlook: time series data adjacent vs. anchored windows Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18
� Outlook: time series data ◮ Feature space can be high-dimensional: prices (OHLC), prices of related markets, indicators, volumes, . . . ◮ Reduce false alarms with persistence factor γ ( ∼ 1)%. H 0 rejected γ · N times in a row − → detected change in market conditions Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18
� Take-Home Messages (1) Proposed a new statistical test: NN2ST (2) Model-independent and suitable for high- D data (3) Excellent results on static datasets (4) Promising applications for change detection in time series data Andrea De Simone Univ. Camerino, 2019-02-26 18 / 18
Recommend
More recommend