segmentation of counting processes and dynamical models
play

Segmentation of Counting Processes and Dynamical Models PhD Thesis - PowerPoint PPT Presentation

Segmentation of Counting Processes and Dynamical Models PhD Thesis Defense Mokhtar Zahdi Alaya June 27, 2016 Plan Motivations 1 Learning the intensity of time events with change-points 2 Piecewise constant intensity Estimation procedure


  1. Segmentation of Counting Processes and Dynamical Models PhD Thesis Defense Mokhtar Zahdi Alaya June 27, 2016

  2. Plan Motivations 1 Learning the intensity of time events with change-points 2 Piecewise constant intensity Estimation procedure Change-points detection + Numerical experiments Binarsity 3 Features binarization Binarsity penalization Generalized linear models + binarsity High-dimensional time-varying Aalen and Cox models 4 Weighted ( ℓ 1 + ℓ 1 )-TV penalization Theoretical guaranties Algorithm + Numerical experiments Conclusion + Perspectives 5

  3. Plan Motivations 1 Learning the intensity of time events with change-points 2 Piecewise constant intensity Estimation procedure Change-points detection + Numerical experiments Binarsity 3 Features binarization Binarsity penalization Generalized linear models + binarsity High-dimensional time-varying Aalen and Cox models 4 Weighted ( ℓ 1 + ℓ 1 )-TV penalization Theoretical guaranties Algorithm + Numerical experiments Conclusion + Perspectives 5

  4. Weighted and unweighted TV For a chosen positive vector of weights ˆ w , we define the (discrete) weighted total-variation (TV) by p � w j | β j − β j − 1 | , for all β ∈ R p . � β � TV , ˆ w = ˆ j =2 w ≡ 1, then we define the unweighted TV by If ˆ p � | β j − β j − 1 | , for all β ∈ R p . � β � TV = j =2

  5. Motivations for using TV Appropriate for multiple change-points estimation. − → Partitioning a nonstationary signal into several contiguous stationary segments of variable duration [Harchaoui and evy-Leduc (2010)] . L´ Widely used in sparse signal processing and imaging (2D) [Chambolle et al. (2010)] . Enforces sparsity in the discrete gradient, which is desirable for applications with features ordered in some meaningful way [Tibshirani et al. (2005)] .

  6. Plan Motivations 1 Learning the intensity of time events with change-points 2 Piecewise constant intensity Estimation procedure Change-points detection + Numerical experiments Binarsity 3 Features binarization Binarsity penalization Generalized linear models + binarsity High-dimensional time-varying Aalen and Cox models 4 Weighted ( ℓ 1 + ℓ 1 )-TV penalization Theoretical guaranties Algorithm + Numerical experiments Conclusion + Perspectives 5

  7. Counting process: stochastic setup N = { N ( t ) } 0 ≤ t ≤ 1 is a counting process. Doob-Meyer decomposition: N ( t ) = Λ 0 ( t ) + M ( t ) , 0 ≤ t ≤ 1 . � �� � � �� � compensator martingale The intensity of N is defined by λ 0 ( t ) dt = d Λ 0 ( t ) = P [ N has a jump in [ t , t + dt ) |F ( t )] , where F ( t ) = σ ( N ( s ) , s ≤ t ).

  8. Piecewise constant intensity Assume that L 0 � β 0 ,ℓ 1 ( τ 0 ,ℓ − 1 ,τ 0 ,ℓ ] ( t ) , 0 ≤ t ≤ 1 . λ 0 ( t ) = ℓ =1 { τ 0 , 0 = 0 < τ 0 , 1 < · · · < τ 0 , L 0 − 1 < τ 0 , L 0 = 1 } : set of true change-points. { β 0 ,ℓ : 1 ≤ ℓ ≤ L 0 } : set of jump sizes of λ 0 . L 0 : number of true change-points.

  9. Assumption on observations Data We observe n i.i.d copies of N on [0 , 1], denoted N 1 , . . . , N n . � n � We define ¯ N n ( I ) = 1 i =1 N i ( I ) , N i ( I ) = I dN i ( t ) , for any n interval I ⊂ [0 , 1] . This assumption is equivalent to observing a single process N with intensity n λ 0 (only used to have a notion of growing observations with an increasing n ).

  10. A procedure based on total-variation penalization We introduce the least-squares functional � 1 � 1 n λ ( t ) 2 dt − 2 � R n ( λ ) = λ ( t ) dN i ( t ) , n 0 0 i =1 [Reynaud-Bouret (2003, 2006), Ga¨ ıffas and Guilloux (2012)] . Fix m = m n ≥ 1, an integer that shall go to infinity as n → ∞ . We approximate λ 0 in the set of nonnegative piecewise constant functions on [0 , 1] given by m � � � β j , m λ j , m : β = [ β j , m ] 1 ≤ j ≤ m ∈ R m Λ m = λ β = , + j =1 where λ j , m = √ m 1 I j , m � j − 1 m , j � et I j , m = . m

  11. A procedure based on total-variation penalization The estimator of λ 0 is defined by m � ˆ ˆ λ = λ ˆ β = β j , m λ j , m . j =1 where ˆ β is giving by � � ˆ β = argmin R n ( λ β ) + � β � TV , ˆ . w β ∈ R m + We consider the dominant term � �� j − 1 m log m �� ¯ w j ≈ ˆ N n m , 1 . n

  12. Oracle inequality with fast rate The linear space Λ m is endowed by the norm �� 1 � λ � = λ 2 ( t ) dt . 0 Let ˆ S to be the support of the discrete gradient of ˆ β , � � ˆ j : ˆ β j , m � = ˆ S = β j − 1 , m for j = 2 , . . . , m . Let ˆ L to be the estimated number of change-points defined by: ˆ L = | ˆ S | .

  13. Oracle inequality with fast rate The estimator ˆ λ satisfies the following: Theorem 1 Fix x > 0 and let the data-driven weights ˆ w defined as above. Assume that ˆ L satisfies ˆ L ≤ L max . Then, we have � 2 + 6( L max + 2( L 0 − 1)) max λ − λ 0 � 2 ≤ inf � � � ˆ w 2 � λ β − λ 0 1 ≤ j ≤ m ˆ j β ∈ R m + � � � λ 0 � ∞ x + L max (1 + log m ) + C 1 n � � 2 x + L max (1 + log m ) m + C 2 , n 2 with a probability larger than 1 − L max e − x .

  14. Oracle inequality with fast rate Let ∆ β, max = max 1 ≤ ℓ,ℓ ′ ≤ L 0 | β 0 ,ℓ − β 0 ,ℓ ′ | , be the maximum of jump size of λ 0 . Corollary We have 2( L 0 − 1)∆ 2 � λ β − λ 0 � 2 ≤ β, max . m Our procedure has a fast rate of convergence of order ( L max ∨ L 0 ) m log m . n An optimal tradeoff between approximation and complexity is given by the choice: If L max = O ( m ) ⇒ m ≈ n 1 / 3 . If L max = O (1) ⇒ m ≈ n 1 / 2 .

  15. Consistency of change-points detection There is an unavoidable non-parametric bias of approximation. The approximate change-points sequence ( j ℓ m ) 0 ≤ ℓ ≤ L 0 is defined as the right-hand side boundary of the unique interval I j ℓ , m that contains the true change-point τ 0 ,ℓ . � � j ℓ − 1 m , j ℓ τ 0 ,ℓ ∈ , for ℓ = 1 , . . . , L 0 − 1, where j 0 = 0 and j L 0 = m by m convention. τ 0 ,ℓ − 1 τ 0 ,ℓ τ 0 ,ℓ +1 t ˆ τ ℓ I j ℓ − 1 , m I j ℓ , m I j ℓ +1 , m Let ˆ S = { ˆ j 1 , . . . , ˆ L } with ˆ j 1 < · · · < ˆ L , and ˆ j 0 = 0 and ˆ j ˆ j ˆ j ˆ L +1 = m . We define simply ˆ j ℓ m for ℓ = 0 , . . . , ˆ ˆ τ ℓ = L + 1 .

  16. Consistency of change-points detection We can’t recover the exact position of two change-points if they lie on the same interval I j , m . Minimal distance between true change-points Assume that there is a positive constant c ≥ 8 such that 1 ≤ ℓ ≤ L 0 | τ 0 ,ℓ − τ 0 ,ℓ − 1 | > c min m . − → The change-points of λ 0 are sufficiently far apart. − → There cannot be more than one change-point in the “high-resolution” intervals I j , m . The procedure will be able to recover the (unique) intervals I j ℓ , m , for ℓ = 0 , . . . , L 0 , where the change-point belongs.

  17. Consistency of change-points detection ∆ j , min = 1 ≤ ℓ ≤ L 0 − 1 | j ℓ +1 − j ℓ | , the minimum distance between min two consecutive terms in the change-points of λ 0 . ∆ β, min = 1 ≤ q ≤ m − 1 | β 0 , q +1 , m − β 0 , q , m | , the smallest jump size of min the projection λ 0 , m of λ 0 onto Λ m . ( ε n ) n ≥ 1 , a non-increasing and positive sequence that goes to zero as n → ∞ . Technical Assumptions We assume that ∆ j , min , ∆ β, min and ( ε n ) n ≥ 1 satisfy √ nm ε n ∆ β, min √ n ∆ j , min ∆ β, min √ log m → ∞ and √ m log m → ∞ , as n → ∞ .

  18. Consistency of change-points detection Theorem 2 Under the given Assumptions, and if ˆ L = L 0 − 1, then the change-points estimators { ˆ L } satisfy τ 1 , . . . , ˆ τ ˆ � � 1 ≤ ℓ ≤ L 0 − 1 | ˆ max τ ℓ − τ 0 ,ℓ | ≤ ε n → 1 , as n → ∞ . P If m ≈ n 1 / 3 , Theorem 2 holds with ε n ≈ n − 1 / 3 , ∆ β, min = n − 1 / 6 et ∆ j , min ≥ 6 . m ≈ n 1 / 2 , Theorem 2 holds with ε n ≈ n − 1 / 2 , ∆ β, min = n − 1 / 6 et ∆ j , min ≥ 6 .

  19. Proximal operator + algorithm We are interested in computing a solution x ⋆ = argmin x ∈ R p { g ( x ) + h ( x ) } , where g is smooth and h is simple (prox-calculable). The proximal operator prox h of a proper, lower semi-continuous, convex function h : R m → ( −∞ , ∞ ] , is defined as � 1 � 2 � v − x � 2 , for all v ∈ R m . prox h ( v ) = argmin 2 + h ( x ) x ∈ R m Proximal gradient descent (PGD) algorithm is based on x ( k ) − ε k ∇ g ( x ( k ) ) x ( k +1) = prox ε k h � � . [Daubechies et al. (2004) (ISTA) , Beck and Teboulle (2009) (FISTA)]

  20. Proximal operator of the weighted TV penalization We have � 1 � ˆ 2 � N − β � 2 β = argmin 2 + � β � TV , ˆ , w β ∈ R m + where N = [ N j ] 1 ≤ j ≤ m ∈ R m + is given by � √ m ¯ N n ( I 1 , m ) , . . . , √ m ¯ � N = N n ( I m , m . Then ˆ β = prox �·� TV , ˆ w ( N ) . Modification of Condat’s algorithm [Condat (2013)] . If we have a feasible dual variable ˆ u , we can compute the primal solution ˆ β, by Fenchel duality. The Karush-Kuhn-Tucker (KKT) optimality conditions characterize the unique solutions ˆ β and ˆ u .

Recommend


More recommend