Consistent Change-point Detection with Kernels Damien Garreau 1 Sylvain Arlot 2 1 Inria, DI ENS 2 Université Paris-Sud, Laboratoire de Mathématiques d’Orsay April 6, 2016 1
An example: shot detection in a movie 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 1200 1400 2
An example: shot detection, cont. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 1200 1400 3
Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion 4
Goals We want to: ◮ detect abrupt changes in the distribution of the data ◮ deal with interesting (structured) data: each point is a curve, a graph, a histogram, a persistence diagram... 5
The change-point problem ◮ X arbitrary (measurable) set, n < + ∞ , and X 1 , . . . , X n ∈ X sequence of independent random variables. � 1 , . . . , n � , P X i the distribution of X i . ◮ ∀ i ∈ The change-point problem can be formalized as follows: ◮ Given ( X i ) 1 ≤ i ≤ n , we want to find the locations of the abrupt changes in the sequence P X 1 , . . . , P X n . 6
Notations � 1 , . . . , n + 1 � , the set of sequences of D − 1 ◮ Take any D ∈ change-points is defined by � ( τ 0 , . . . , τ D ) ∈ N D +1 , 0 = τ 0 < τ 1 < · · · < τ D = n � . T D n · · = ◮ τ 1 , . . . , τ D − 1 are the change-points , τ is a segmentation of � into D τ segments. � 1 , . . . , n ◮ τ ⋆ the true segmentation, D ⋆ = D τ ⋆ the true number of change-points. 7
In pictures Here, X = R , D ⋆ = 3, and τ ⋆ = (0 , 50 , 70 , 100). ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 X 0.0 −0.5 ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● t0 t1 t2 t3 ● 0 20 40 60 80 100 Time 8
In pictures, cont. It is not an easy question: ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 ? ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X 0.5 0.0 −0.5 ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100 Time 9
Summary ◮ With finite sample size, it is not easy to recover the true change-points in presence of noise. ◮ When X = R d and the changes occur in the first moments of the distribution, the problem has already received considerable attention, cf. [Basseville and Nikiforov, 1993]. ◮ Kernel change-point detection can tackle more subtle changes / less conventional data. 10
Plan Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion 11
Kernels: a quick reminder ◮ Let k : X × X → R be a semidefinite kernel. ◮ k is a measurable function s.t. ∀ x 1 , . . . , x m ∈ X , the matrix ( k ( x i , x j )) 1 ≤ i,j ≤ m is positive semi-definite . Think inner product . ◮ Examples include ◮ the linear kernel k ( x, y ) = � x, y � , ◮ the Gaussian kernel k ( x, y ) = exp( − � x − y � 2 / (2 h 2 )), ◮ the histogram kernel k ( x, y ) = � p k =1 min( x k , y k ), ◮ . . . 12
The kernel least-squares criterion ◮ Intuition: least-squares criterion D τ ℓ · = 1 � � � ( X i − X τ ℓ − 1 +1 ,τ ℓ ) 2 . R n ( τ ) · n ℓ =1 i = τ ℓ − 1 +1 ◮ Define n · = 1 � � R n ( τ ) · k ( X i , X i ) − n i =1 τ ℓ τ ℓ D − 1 1 � � � . k ( X i , X j ) n τ ℓ − τ ℓ − 1 ℓ =1 i = τ ℓ − 1 +1 j = τ ℓ − 1 +1 ◮ This is just a kernelized version, the two definitions coincide when X = R and k ( x, y ) = xy . 13
Most important slide of the talk We investigate the properties of least − squares criterion � � � �� � � τ ∈ arg min R n ( τ ) + pen( τ ) � , � �� � τ ∈T n penalty function where pen is a function increasing with D τ . 14
Constant mean and variance Constant mean and variance: the distribution of X i is chosen among B (0 . 5), N (0 . 5 , 0 . 25) and Γ(1 , 0 . 5). 4 3 2 1 0 −1 0 100 200 300 400 500 600 700 800 900 1000 (courtesy of [Arlot et al., 2012]) 15
Constant mean and variance, cont. 0.6 0.6 0.5 0.5 Freq. of selected chgpts Freq. of selected chgpts 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 Position Position 0.6 Freq. of selected chgpts 0.5 0.4 0.3 0.2 0.1 0 100 200 300 400 500 600 700 800 900 1000 Position Linear, Hermite, and Gaussian kernels (courtesy of [Arlot et al., 2012]). 16
Plan Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion 17
More notations ◮ Along with the kernel k comes a reproducing kernel Hilbert space H endowed with �· , ·� H . ◮ There exists a mapping Φ : X → H such that, for any x, y ∈ X , k ( x, y ) = � Φ( x ) , Φ( y ) � H . ◮ The algorithm is looking for breaks in the “mean” of Y i · · = Φ( X i ) ∈ H . ◮ Whenever possible, define µ ⋆ i the mean of Y i ; it satisfies � µ ⋆ i , g � H = E [ g ( X i )] = E [ � Y i , g � H ] . ∀ g ∈ H , ◮ We write Y i = µ ⋆ i + ε i . 18
Hypothesis ◮ H is separable. ◮ Bounded data/kernel: ∃ M ∈ (0 , + ∞ ) , ∀ 1 ≤ i ≤ n, k ( X i , X i ) ≤ M 2 . (Db) ◮ Finite variance: � � � ε i � 2 ∀ 1 ≤ i ≤ n, v i · · = E ≤ V < + ∞ . (V) H Under (Db), an oracle inequality has been proven. → See [Arlot et al., 2012] for the result. − 19
Dimension selection, light version ◮ Assume that (Db) holds true; ◮ Suppose that pen( · ) is “large enough”; ◮ Suppose that ∆ 2 × Γ is “large enough”, where � � ∆ · · = inf µ ⋆ � µ ⋆ i − µ ⋆ � H is the size of the smallest i � = µ ⋆ i +1 i +1 jump in H , and Γ depends only on the geometry of τ ⋆ ; ◮ Then, with high probability, D � τ = D ⋆ . If k is characteristic, we recover all the changes in P X i . 20
Dimension selection Theorem Let y be a positive number. Assume that (Db) holds true and that � � �� 2 � pen( τ ) = CD τ M 2 4 + y + log n 1 + 2 ∀ τ ∈ T n , , n D τ with C ≥ (2 D ⋆ + 1)(5 + y + log D ⋆ ) . Suppose that � � � CD ⋆ M 2 y + log n ∆ 2 × Γ . D ⋆ n τ = D ⋆ � ≥ 1 − e − y . � D � Then P 21
Distance between segmentations ◮ We consider only segmentation with the same number of segments D ⋆ . ◮ Several possibilities, equivalent under assumptions regarding · = 1 Λ τ · n min λ ∈ τ | λ | . ◮ We focus on � � � � d ∞ ( τ 1 , τ 2 ) · � τ 1 i − τ 2 · = max � . i 1 ≤ i ≤ D ⋆ − 1 22
Localization of the change-points, light version ◮ Assume that D ⋆ is known and that (V) holds true. ◮ Take δ n > 0, and choose � τ in � τ ∈ T n , Λ τ ≥ δ n , D τ = D ⋆ � . T D ⋆ n ( δ n ) · · = ◮ Then, for any 0 < x < Λ τ ⋆ , � 1 � 1 � � + 1 � V τ, τ ⋆ ) ≥ x n d ∞ ( � . P nx δ n x ◮ This goes to 0 whenever δ n → 0 and nδ n → + ∞ . 23
Localization of the change-points Theorem Assum that (V) holds true. Take δ n > 0 and choose � � � . τ ∈ arg min R n ( τ ) � τ ∈T D⋆ n ( δ n ) Suppose that δ n ≤ Λ τ ⋆ . Then, for any 0 < x ≤ Λ τ ⋆ , � � + ( D ⋆ ) 3 ∆ 2 τ, τ ⋆ ) ≥ x ) � V D ⋆ 1 P (d ∞ ( � . nx ∆ 2 x ∆ 2 δ n For instance: take δ n = n − 1 / 2 : d ∞ (ˆ τ, τ ⋆ ) = o P ( n − 1 / 2 ) . 24
Plan Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion 25
Take away message ◮ Kernelized version of the change-point detection procedure of [Lebarbier, 2005]. ◮ Detection of changes in the distribution, not only the first moments. ◮ Possible to deal with structured data more efficiently. ◮ Under reasonable assumptions and for a class of penalty functions, ◮ we dispose of an oracle inequality ◮ the procedure is consistent ◮ it recovers the true localization of the change-points 26
Future work ◮ Exchange the hypothesis and still prove our results (in progress); ◮ Tackle dependency structures within the X i s as in [Lavielle and Moulines, 2000]; ◮ Learn how to choose the kernel; ◮ Find interesting data! 27
Recommend
More recommend