Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Gersende Fort Institut de Math´ ematiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse, France
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) ֒ → On Perturbed Proximal-Gradient algorithms (JMLR, 2016) Edouard Ollier (ENS Lyon, France) Adeline Samson (Univ. Grenoble Alpes, France). → Penalized inference in Mixed Models by Proximal Gradients ֒ methods (work in progress) Jean-Fran¸ cois Aujol (IMB, Bordeaux, France) Charles Dossal (IMB, Bordeaux, France). ֒ → Acceleration for perturbed Proximal Gradient algorithms (work in progress)
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Motivation : Pharmacokinetic (1/2) N patients. For patient i , observations { Y ij , 1 ≤ j ≤ J } : evolution of the concentration at times t ij , 1 ≤ j ≤ J . Initial dose D . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = f ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Motivation : Pharmacokinetic (1/2) N patients. For patient i , observations { Y ij , 1 ≤ j ≤ J } : evolution of the concentration at times t ij , 1 ≤ j ≤ J . Initial dose D . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = f ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Statistical analysis: estimation of ( β, σ 2 , Ω) , under sparsity constraints on β selection of the covariates based on ˆ β . ֒ → Penalized Maximum Likelihood
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Motivation : Pharmacokinetic (2/2) Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = f ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L d i ∼ N L (0 , Ω) and independent of ǫ • Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: The distribution of { Y ij , X i ; 1 ≤ i ≤ N, 1 ≤ j ≤ J } has an explicit expression. The distribution of { Y ij ; 1 ≤ i ≤ N, 1 ≤ j ≤ J } does not have an explicit expression; at least, the marginal distribution of the previous one.
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Outline Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Example 2: Discrete graphical model (Markov random field) Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Penalized Maximum Likelihood inference with untractable Likelihood N observations : Y = ( Y 1 , · · · , Y N ) A parametric statistical model θ ∈ Θ ⊆ R d θ �→ L ( θ ) likelihood of the observations θ �→ g ( θ ) A penalty constraint on the parameter θ : for sparsity constraints on θ . Usually, g non-smooth and convex. Goal: Computation of � � − 1 θ �→ argmin θ ∈ Θ N log L ( θ ) + g ( θ ) when the likelihood L has no closed form expression, and can not be evaluated.
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Example 1: Latent variable model The log-likelihood of the observations Y is of the form � θ �→ log L ( θ ) L ( θ ) = p θ ( x ) µ ( d x ) , X where µ is a positive σ -finite measure on a set X. x are the missing/latent data; ( x, Y ) are the complete data. In these models, the complete likelihood p θ ( x ) can be evaluated explicitly, the likelihood has no closed expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). → What about the gradient of the (log)-likelihood ? ֒
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model � log L ( θ ) = log p θ ( x ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � p θ ( z ) µ ( d z ) � �� � the a posteriori distribution
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model � log L ( θ ) = log p θ ( x ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � p θ ( z ) µ ( d z ) � �� � the a posteriori distribution The gradient of the log-likelihood � � � − 1 ∇ θ N log L ( θ ) = H θ ( x ) π θ ( d x ) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y . For all ( x, θ ) , H θ ( x ) can be evaluated.
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Approximation of the gradient � � � − 1 ∇ θ N log L ( θ ) = H θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 use i.i.d. samples from π θ to define a Monte Carlo approximation: not 2 possible, in general. use m samples from a non stationary Markov chain { X j,θ , j ≥ 0 } with 3 unique stationary distribution π θ , and define a Monte Carlo approximation. MCMC samplers provide such a chain.
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Approximation of the gradient � � � − 1 ∇ θ N log L ( θ ) = H θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 use i.i.d. samples from π θ to define a Monte Carlo approximation: not 2 possible, in general. use m samples from a non stationary Markov chain { X j,θ , j ≥ 0 } with 3 unique stationary distribution π θ , and define a Monte Carlo approximation. MCMC samplers provide such a chain. Stochastic approximation of the gradient a biased approximation � E [ h ( X j,θ )] � = h ( x ) π θ ( d x ) . If the Markov chain is ergodic ”enough”, the bias vanishes when j → ∞ .
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Example 2: Discrete graphical model (Markov random field) N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y i in X p with distribution p 1 � � def y = ( y 1 , · · · , y p ) �→ π θ ( y ) = Z θ exp θ kk B ( y k , y k ) + θ kj B ( y k , y j ) k =1 1 ≤ j<k ≤ p = 1 �� �� θ, ¯ Z θ exp B ( y ) where B is a symmetric function. θ is a symmetric p × p matrix. the normalizing constant (partition function) Z θ can not be computed - sum over | X | p terms.
Inf´ erence p´ enalis´ ee dans les mod` eles ` a vraisemblance non explicite par des algorithmes gradient-proximaux perturb´ es Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Likelihood and its gradient in Markov random field ◮ Likelihood of the form (scalar product between matrices = Frobenius inner product) � � N 1 θ, 1 � ¯ N log L ( θ ) = B ( Y i ) − log Z θ N i =1 The likelihood is untractable.
Recommend
More recommend