Convergence of perturbed Proximal Gradient algorithms Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math´ ematiques de Toulouse CNRS and Univ. Paul Sabatier Toulouse, France
Convergence of perturbed Proximal Gradient algorithms Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) ֒ → On Perturbed Proximal-Gradient algorithms (JMLR, 2016) Jean-Fran¸ cois Aujol (IMB, Bordeaux, France) Charles Dossal (IMB, Bordeaux, France). → Acceleration for perturbed Proximal Gradient algorithms (work ֒ in progress) Edouard Ollier (ENS Lyon, France) Adeline Samson (Univ. Grenoble Alpes, France). ֒ → Penalized inference in Mixed Models by Proximal Gradient methods (work in progress)
Convergence of perturbed Proximal Gradient algorithms Motivation : Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , observations { Y ij , 1 ≤ j ≤ J i } : evolution of the concentration at times t ij , 1 ≤ j ≤ J i . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates
Convergence of perturbed Proximal Gradient algorithms Motivation : Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , observations { Y ij , 1 ≤ j ≤ J i } : evolution of the concentration at times t ij , 1 ≤ j ≤ J i . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Example of model F : monocompartimental, oral administration � � exp( − Cl F ( t, [ln Cl , ln V , ln A ]) = C ( Cl,V,A,D ) V t ) − exp( − A t ) For each patient i , β 1 ,Cl Z i 1 ,Cl + · · · + β K,Cl Z i ln Cl β 0 ,Cl d Cl ,i K,Cl idem, with covariates Z i ln V = β 0 ,V + k,V and coefficients β k,V + d V ,i idem, with covariates Z i ln A β 0 ,A k,A and coefficients β k,A d A ,i i
Convergence of perturbed Proximal Gradient algorithms Motivation : Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , observations { Y ij , 1 ≤ j ≤ J i } : evolution of the concentration at times t ij , 1 ≤ j ≤ J i . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Statistical analysis: estimation of θ = ( β, σ 2 , Ω) , under sparsity constraints on β selection of the covariates based on ˆ β . ֒ → Penalized Maximum Likelihood
Convergence of perturbed Proximal Gradient algorithms Motivation : Pharmacokinetic (2/2) Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = f ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L d i ∼ N L (0 , Ω) and independent of ǫ • Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: Likelihood: not explicit. Complete likelihood: the distribution of { Y ij , X i ; 1 ≤ i ≤ N, 1 ≤ j ≤ J } has an explicit expression. ML: here, the likelihood is not concave.
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Outline Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Example 2: Discrete graphical model (Markov random field) Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis Conclusion
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Penalized Maximum Likelihood inference with intractable Likelihood N observations : Y = ( Y 1 , · · · , Y N ) A parametric statistical model θ ∈ Θ ⊆ R d the dependance upon Y is omitted θ �→ L ( θ ) likelihood of the observations θ �→ g ( θ ) ≥ 0 A penalty term on the parameter θ : for sparsity constraints on θ . Usually, g non-smooth and convex. Goal: Computation of � 1 � θ �→ argmax θ ∈ Θ N log L ( θ ) − g ( θ ) when the likelihood L has no closed form expression, and can not be evaluated.
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Example 1: Latent variable model The log-likelihood of the observations Y is of the form � θ �→ log L ( θ ) L ( θ ) = p θ ( x ) µ ( d x ) , X where µ is a positive σ -finite measure on a set X. x collects the missing/latent data. In these models, the complete likelihood p θ ( x ) can be evaluated explicitly, the likelihood has no closed form expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). → What about the gradient of the (log)-likelihood ? ֒
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model � log L ( θ ) = log p θ ( x ) µ ( d x ) X Under regularity conditions, θ �→ log L ( θ ) is C 1 and � X ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � X p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � X p θ ( z ) µ ( d z ) X � �� � the a posteriori distribution
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model � log L ( θ ) = log p θ ( x ) µ ( d x ) X Under regularity conditions, θ �→ log L ( θ ) is C 1 and � X ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � X p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � X p θ ( z ) µ ( d z ) X � �� � the a posteriori distribution The gradient of the log-likelihood � ∇ θ { log L ( θ ) } = ∂ θ log p θ ( x ) π θ ( d x ) X is an intractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y . For all ( x, θ ) , ∂ θ log p θ ( x ) can be evaluated.
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Approximation of the gradient � ∇ θ { log L ( θ ) } = ∂ θ log p θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 use i.i.d. samples from π θ to define a Monte Carlo approximation: not 2 possible, in general. use m samples from a non stationary Markov chain { X j,θ , j ≥ 0 } with 3 unique stationary distribution π θ , and define a Monte Carlo approximation. MCMC samplers provide such a chain.
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 1: Latent variable models Approximation of the gradient � ∇ θ { log L ( θ ) } = ∂ θ log p θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 use i.i.d. samples from π θ to define a Monte Carlo approximation: not 2 possible, in general. use m samples from a non stationary Markov chain { X j,θ , j ≥ 0 } with 3 unique stationary distribution π θ , and define a Monte Carlo approximation. MCMC samplers provide such a chain. Stochastic approximation of the gradient A biased approximation, since for MCMC samples X j,θ � E [ h ( X j,θ )] � = h ( x ) π θ ( d x ) . If the Markov chain is ergodic, the bias vanishes when j → ∞ .
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 2: Discrete graphical model (Markov random field) Example 2: Discrete graphical model (Markov random field) N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y i in X p with distribution p 1 � � def y = ( y 1 , · · · , y p ) �→ π θ ( y ) = Z θ exp θ kk B ( y k , y k ) + θ kj B ( y k , y j ) k =1 1 ≤ j<k ≤ p = 1 �� �� θ, ¯ Z θ exp B ( y ) where B is a symmetric function. θ is a symmetric p × p matrix. the normalizing constant (partition function) Z θ can not be computed - sum over | X | p terms.
Convergence of perturbed Proximal Gradient algorithms Penalized Maximum Likelihood inference in models with intractable likelihood Example 2: Discrete graphical model (Markov random field) Likelihood and its gradient in Markov random field ◮ Likelihood of the form (scalar product between matrices = Frobenius inner product) � � N 1 θ, 1 � ¯ N log L ( θ ) = B ( Y i ) − log Z θ N i =1 The likelihood is intractable.
Recommend
More recommend