algorithmes gradient proximaux pour l inf erence
play

Algorithmes Gradient-Proximaux pour linf erence statistique - PowerPoint PPT Presentation

Algorithmes Gradient-Proximaux pour linf erence statistique Algorithmes Gradient-Proximaux pour linf erence statistique Gersende Fort Institut de Math ematiques de Toulouse, CNRS Toulouse, France Algorithmes Gradient-Proximaux


  1. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Gersende Fort Institut de Math´ ematiques de Toulouse, CNRS Toulouse, France

  2. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Jean-Fran¸ cois Aujol (IMB, Bordeaux, France) Eric Moulines (Ecole Polytechnique, France) Adeline Samson et Edouard Ollier (Univ. Grenoble Alpes, France). Charles Dossal (IMT). Laurent Risser (IMT) → On Perturbed Proximal-Gradient algorithms (JMLR, 2017) ֒ → Stochastic Proximal Gradient Algorithms for Penalized Mixed ֒ Models (Stat & Computing, 2018) ֒ → Acceleration for perturbed Proximal Gradient algorithms (work in progress) ֒ → Algorithmes Gradient Proximaux Stochastiques (GRETSI, 2017)

  3. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Outline Motivations Pharmacokinetic General Case: Latent Variable Models Votes in the US congress General case: Discrete graphical models Conclusion, part I Penalized ML through Perturbed Stochastic-Gradient algorithms Asymptotic behavior of the algorithm Numerical illustration

  4. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates

  5. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Example of model F : monocompartimental, with digestive absorption � � exp( − Cl F ( t, [ln Cl , ln V , ln A ]) = C ( Cl,V,A,D ) V t ) − exp( − A t ) For each patient i ,         β 1 ,Cl Z i 1 ,Cl + · · · + β K,Cl Z i ln Cl β 0 ,Cl d Cl ,i K,Cl idem, with covariates Z i ln V = β 0 ,V  + k,V and coefficients β k,V  + d V ,i       idem, with covariates Z i ln A β 0 ,A k,A and coefficients β k,A d A ,i i

  6. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Statistical analysis: estimation of θ = ( β, σ 2 , Ω) , under sparsity constraints on β selection of the covariates based on ˆ β . ֒ → Penalized Maximum Likelihood

  7. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation : Pharmacokinetic (2/2) Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = f ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L d i ∼ N L (0 , Ω) and independent of ǫ • Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: Complete likelihood: the distribution of { Y ij , X i ; 1 ≤ i ≤ N, 1 ≤ j ≤ J i } has an explicit expression. � N � N � � J i � � � N ( f ( t ij , X j ) , σ 2 )[ Y ij ] N L ( Z i β, Ω)[ X i ] i =1 j =1 i =1 Likelihood: the distribution of { Y ij ; 1 ≤ i ≤ N, 1 ≤ j ≤ J i } is not explicit . ML: here, the likelihood is not concave .

  8. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models General case: Latent variable models The log-likelihood of the observations Y is of the form (dependende upon Y is omitted) � θ �→ log L ( θ ) L ( θ ) = p θ ( x ) µ ( d x ) , X where µ is a σ -finite positive measure on a set X. x collects the missing/latent data. previous example: x ← ( X 1 , · · · , XN ) , µ ← lebesgue on R LN In these models, the complete likelihood p θ ( x ) can be evaluated explicitly, the likelihood has no closed expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). → What about the gradient of the (log)-likelihood ? ֒

  9. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models Latent variable model: Gradient of the likelihood � log L ( θ ) = log p θ ( x ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � p θ ( z ) µ ( d z ) � �� � the a posteriori distribution

  10. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models Latent variable model: Gradient of the likelihood � log L ( θ ) = log p θ ( x ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � p θ ( z ) µ ( d z ) � �� � the a posteriori distribution The gradient of the log-likelihood � ∇ θ { log L ( θ ) } = H θ ( x ) π θ ( d x ) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y (known up to a constant) For all ( x, θ ) , H θ ( x ) can be evaluated.

  11. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Votes in the US congress Motivation 2: relationships in a graph (1/2) p nodes in a graph (e.g. p senators from the US congress) each node takes values in {− 1 , 1 } (e.g. each node codes for no/yes in a vote) N pictures of the graph (e.g. N votes) Model: Each observation Y ( i ) ∈ {− 1 , 1 } p ; i.i.d. observations with distribution � p � p − 1 p � � � π θ ( y ) ∝ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 Statistical Analysis: estimation of θ , under penalty (sparse graph, regularization N << p 2 / 2 ) classification of the nodes → Penalized Maximum Likelihood ֒

  12. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Votes in the US congress Motivation 2: relationships in a graph (2/2) Model: Each observation Y ( n ) ∈ {− 1 , 1 } p ; i.i.d. observations with distribution � p � p − 1 p π θ ( y ) = 1 � � � Z θ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 def = ( Y (1) , · · · , Y ( N ) ) Log-Likelihood: Y � N � N � � p p − 1 p � � � � � Y ( n ) Y ( n ) Y ( n ) − N log Z θ ℓ ( θ ) = θ i + θ ij i i j i =1 n =1 i =1 j = i +1 n =1 = � Θ , S ( Y ) � − N log Z θ = � Ψ( θ ) , S ( Y ) � + Φ( θ ) Likelihood : not explicit � p � p − 1 p � � � � def Z θ = exp θ i y i + θ ij y i y j y ∈{− 1 , 1 } p i =1 i =1 j = i +1 ML: here, the likelihood is concave.

  13. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models General Case: Discrete graphical models N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y ( i ) in X p with distribution   p 1 � � def y = ( y 1 , · · · , y p ) �→ π θ ( y ) = Z θ exp θ kk B ( y k , y k ) + θ kj B ( y k , y j )   k =1 1 ≤ j<k ≤ p = 1 �� �� θ, ¯ Z θ exp B ( y ) where ¯ B is a symmetric function. θ is a symmetric p × p matrix. the normalizing constant (partition function) Z θ can not be computed - sum over | X | p terms.

  14. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models Markov random field: Likelihood ◮ Likelihood of the form (scalar product between matrices = Frobenius inner product) � � N 1 θ, 1 � ¯ N log L ( θ ) = B ( Y i ) − log Z θ N i =1 The likelihood is untractable.

  15. Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models Markov random field: Gradient of the likelihood ◮ Gradient of the form � 1 � N � = 1 � ¯ ¯ ∇ θ N log L ( θ ) B ( Y i ) − B ( y ) π θ ( y ) µ ( d y ) N X p i =1 with 1 �� �� def θ, ¯ π θ ( y ) = Z θ exp B ( y ) . The gradient of the (log)-likelihood is untractable

Recommend


More recommend