Consistent Change-point Detection with Kernels Damien Garreau 1 - PowerPoint PPT Presentation

Consistent Change-point Detection with Kernels Damien Garreau 1 Sylvain Arlot 2 1 Inria, DI ENS 2 Université Paris-Sud, Laboratoire de Mathématiques d’Orsay April 6, 2016 1

An example: shot detection in a movie 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 1200 1400 2

An example: shot detection, cont. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 1200 1400 3

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion 4

Goals We want to: ◮ detect abrupt changes in the distribution of the data ◮ deal with interesting (structured) data: each point is a curve, a graph, a histogram, a persistence diagram... 5

The change-point problem ◮ X arbitrary (measurable) set, n < + ∞ , and X 1 , . . . , X n ∈ X sequence of independent random variables. � 1 , . . . , n � , P X i the distribution of X i . ◮ ∀ i ∈ The change-point problem can be formalized as follows: ◮ Given ( X i ) 1 ≤ i ≤ n , we want to find the locations of the abrupt changes in the sequence P X 1 , . . . , P X n . 6

Notations � 1 , . . . , n + 1 � , the set of sequences of D − 1 ◮ Take any D ∈ change-points is defined by � ( τ 0 , . . . , τ D ) ∈ N D +1 , 0 = τ 0 < τ 1 < · · · < τ D = n � . T D n · · = ◮ τ 1 , . . . , τ D − 1 are the change-points , τ is a segmentation of � into D τ segments. � 1 , . . . , n ◮ τ ⋆ the true segmentation, D ⋆ = D τ ⋆ the true number of change-points. 7

In pictures Here, X = R , D ⋆ = 3, and τ ⋆ = (0 , 50 , 70 , 100). ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 X 0.0 −0.5 ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● t0 t1 t2 t3 ● 0 20 40 60 80 100 Time 8

In pictures, cont. It is not an easy question: ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 ? ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X 0.5 0.0 −0.5 ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100 Time 9

Summary ◮ With finite sample size, it is not easy to recover the true change-points in presence of noise. ◮ When X = R d and the changes occur in the first moments of the distribution, the problem has already received considerable attention, cf. [Basseville and Nikiforov, 1993]. ◮ Kernel change-point detection can tackle more subtle changes / less conventional data. 10

Plan Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion 11

Kernels: a quick reminder ◮ Let k : X × X → R be a semidefinite kernel. ◮ k is a measurable function s.t. ∀ x 1 , . . . , x m ∈ X , the matrix ( k ( x i , x j )) 1 ≤ i,j ≤ m is positive semi-definite . Think inner product . ◮ Examples include ◮ the linear kernel k ( x, y ) = � x, y � , ◮ the Gaussian kernel k ( x, y ) = exp( − � x − y � 2 / (2 h 2 )), ◮ the histogram kernel k ( x, y ) = � p k =1 min( x k , y k ), ◮ . . . 12

The kernel least-squares criterion ◮ Intuition: least-squares criterion D τ ℓ · = 1 � � � ( X i − X τ ℓ − 1 +1 ,τ ℓ ) 2 . R n ( τ ) · n ℓ =1 i = τ ℓ − 1 +1 ◮ Define n · = 1 � � R n ( τ ) · k ( X i , X i ) − n i =1   τ ℓ τ ℓ D − 1 1 � � �  .  k ( X i , X j ) n τ ℓ − τ ℓ − 1 ℓ =1 i = τ ℓ − 1 +1 j = τ ℓ − 1 +1 ◮ This is just a kernelized version, the two definitions coincide when X = R and k ( x, y ) = xy . 13

Most important slide of the talk We investigate the properties of least − squares criterion � � � �� τ ∈ arg min R n ( τ ) + pen( τ ) � , � �� τ ∈T n penalty function where pen is a function increasing with D τ . 14

Constant mean and variance Constant mean and variance: the distribution of X i is chosen among B (0 . 5), N (0 . 5 , 0 . 25) and Γ(1 , 0 . 5). 4 3 2 1 0 −1 0 100 200 300 400 500 600 700 800 900 1000 (courtesy of [Arlot et al., 2012]) 15

Constant mean and variance, cont. 0.6 0.6 0.5 0.5 Freq. of selected chgpts Freq. of selected chgpts 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 Position Position 0.6 Freq. of selected chgpts 0.5 0.4 0.3 0.2 0.1 0 100 200 300 400 500 600 700 800 900 1000 Position Linear, Hermite, and Gaussian kernels (courtesy of [Arlot et al., 2012]). 16

More notations ◮ Along with the kernel k comes a reproducing kernel Hilbert space H endowed with �· , ·� H . ◮ There exists a mapping Φ : X → H such that, for any x, y ∈ X , k ( x, y ) = � Φ( x ) , Φ( y ) � H . ◮ The algorithm is looking for breaks in the “mean” of Y i · · = Φ( X i ) ∈ H . ◮ Whenever possible, define µ ⋆ i the mean of Y i ; it satisfies � µ ⋆ i , g � H = E [ g ( X i )] = E [ � Y i , g � H ] . ∀ g ∈ H , ◮ We write Y i = µ ⋆ i + ε i . 18

Hypothesis ◮ H is separable. ◮ Bounded data/kernel: ∃ M ∈ (0 , + ∞ ) , ∀ 1 ≤ i ≤ n, k ( X i , X i ) ≤ M 2 . (Db) ◮ Finite variance: � � � ε i � 2 ∀ 1 ≤ i ≤ n, v i · · = E ≤ V < + ∞ . (V) H Under (Db), an oracle inequality has been proven. → See [Arlot et al., 2012] for the result. − 19

Dimension selection, light version ◮ Assume that (Db) holds true; ◮ Suppose that pen( · ) is “large enough”; ◮ Suppose that ∆ 2 × Γ is “large enough”, where � � ∆ · · = inf µ ⋆ � µ ⋆ i − µ ⋆ � H is the size of the smallest i � = µ ⋆ i +1 i +1 jump in H , and Γ depends only on the geometry of τ ⋆ ; ◮ Then, with high probability, D � τ = D ⋆ . If k is characteristic, we recover all the changes in P X i . 20

Dimension selection Theorem Let y be a positive number. Assume that (Db) holds true and that � � �� 2 � pen( τ ) = CD τ M 2 4 + y + log n 1 + 2 ∀ τ ∈ T n , , n D τ with C ≥ (2 D ⋆ + 1)(5 + y + log D ⋆ ) . Suppose that � � � CD ⋆ M 2 y + log n ∆ 2 × Γ . D ⋆ n τ = D ⋆ � ≥ 1 − e − y . � D � Then P 21

Distance between segmentations ◮ We consider only segmentation with the same number of segments D ⋆ . ◮ Several possibilities, equivalent under assumptions regarding · = 1 Λ τ · n min λ ∈ τ | λ | . ◮ We focus on � � � � d ∞ ( τ 1 , τ 2 ) · � τ 1 i − τ 2 · = max � . i 1 ≤ i ≤ D ⋆ − 1 22

Localization of the change-points, light version ◮ Assume that D ⋆ is known and that (V) holds true. ◮ Take δ n > 0, and choose � τ in � τ ∈ T n , Λ τ ≥ δ n , D τ = D ⋆ � . T D ⋆ n ( δ n ) · · = ◮ Then, for any 0 < x < Λ τ ⋆ , � 1 � 1 � � + 1 � V τ, τ ⋆ ) ≥ x n d ∞ ( � . P nx δ n x ◮ This goes to 0 whenever δ n → 0 and nδ n → + ∞ . 23

Localization of the change-points Theorem Assum that (V) holds true. Take δ n > 0 and choose � � � . τ ∈ arg min R n ( τ ) � τ ∈T D⋆ n ( δ n ) Suppose that δ n ≤ Λ τ ⋆ . Then, for any 0 < x ≤ Λ τ ⋆ , � � + ( D ⋆ ) 3 ∆ 2 τ, τ ⋆ ) ≥ x ) � V D ⋆ 1 P (d ∞ ( � . nx ∆ 2 x ∆ 2 δ n For instance: take δ n = n − 1 / 2 : d ∞ (ˆ τ, τ ⋆ ) = o P ( n − 1 / 2 ) . 24

Take away message ◮ Kernelized version of the change-point detection procedure of [Lebarbier, 2005]. ◮ Detection of changes in the distribution, not only the first moments. ◮ Possible to deal with structured data more efficiently. ◮ Under reasonable assumptions and for a class of penalty functions, ◮ we dispose of an oracle inequality ◮ the procedure is consistent ◮ it recovers the true localization of the change-points 26

Future work ◮ Exchange the hypothesis and still prove our results (in progress); ◮ Tackle dependency structures within the X i s as in [Lavielle and Moulines, 2000]; ◮ Learn how to choose the kernel; ◮ Find interesting data! 27

Consistent Change-point Detection with Kernels Damien Garreau 1 - PowerPoint PPT Presentation

Consistent Change-point Detection with Kernels Damien Garreau 1 Sylvain Arlot 2 1 Inria, DI ENS 2 Universit Paris-Sud, Laboratoire de Mathmatiques dOrsay April 6, 2016 1 An example: shot detection in a movie 0.7 0.6 0.5 0.4 0.3 0.2

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Launching Kernels Dr Eric McCreath Research School of Computer Science The Australian National

Modelling covariance kernels for nonstationary random fields Christopher G. Small University of

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Changepoint detection in network measurements Allen B. Downey 1 Fundamental problem: Predict

Metric properties of large graphs Propri et es m etriques des grands graphes PhD

Spectra of magnetic chain graphs Pavel Exner Doppler Institute for Mathematical Physics and

15-388/688 - Practical Data Science: Graph and network processing J. Zico Kolter Carnegie Mellon

Solutions of Equations in One Variable Fixed-Point Iteration II Numerical Analysis (9th Edition)

Assessment and Diagnosis of Psychoactive Substance Use Disorders Winter 2005 Glenn Maynard, LPC

Dual Priority Scheduling is Not Optimal Pontus Ekberg Uppsala University ECRTS 2019 Stuttgart,

Parameter estimation for pedestrian dynamics models Susana Gomes Mathematics Institute,