stochastic perturbations of proximal gradient methods for
play

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth - PowerPoint PPT Presentation

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian


  1. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Gersende Fort LTCI, CNRS and Telecom ParisTech Paris, France Based on joint works with Eric Moulines (Ecole Polytechnique, France) Yves Atchad´ e (Univ. Michigan, USA) Jean-Fran¸ cois Aujol (Univ. Bordeaux, France) and Charles Dossal (Univ. Bordeaux, France) → On Perturbed Proximal-Gradient algorithms (2016-v3, arXiv) ֒

  2. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Outline Application: Penalized Maximum Likelihood inference in latent variable models Stochastic Gradient methods (case g = 0 ) Stochastic Proximal Gradient methods Rates of convergence High-dimensional logistic regression with random effects

  3. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Penalized Maximum Likelihood inference, latent variable model N observations : Y = ( Y 1 , · · · , Y N ) A negative normalized log-likelihood of the observations Y, in a latent variable model � θ �→ − 1 N log L ( Y , θ ) L ( Y , θ ) = p θ ( x, Y ) µ ( d x ) where θ ∈ Θ ⊂ R d . A penalty term on the parameter θ : θ �→ g ( θ ) for sparsity constraints on θ ; usually non-smooth and convex. Goal: Computation of � � − 1 θ �→ argmin θ ∈ Θ N log L ( Y , θ ) + g ( θ ) when the likelihood L has no closed form expression, and can not be evaluated.

  4. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Latent variable model: example (Generalized Linear Mixed Models) GLMM Y 1 , · · · , Y N : indep. observations from a Generalized Linear Model. Linear predictor p q � � η i = X i,k β k + Z i,ℓ U ℓ k =1 ℓ =1 � �� � � �� � fixed effect random effect where X, Z : covariate matrices β ∈ R p : fixed effect parameter U ∈ R q : random effect parameter

  5. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Latent variable model: example (Generalized Linear Mixed Models) GLMM Y 1 , · · · , Y N : indep. observations from a Generalized Linear Model. Linear predictor p q � � η i = X i,k β k + Z i,ℓ U ℓ k =1 ℓ =1 � �� � � �� � fixed effect random effect where X, Z : covariate matrices β ∈ R p : fixed effect parameter U ∈ R q : random effect parameter Example: logistic regression Y 1 , · · · , Y N binary independent observations: Bernoulli r.v. with mean p i = exp( η i ) / (1 + exp( η i )) N exp( Y i η i ) � ( Y 1 , · · · , Y N ) | U ≡ 1 + exp( η i ) i =1 Gaussian random effect: U ∼ N q .

  6. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Gradient of the log-likelihood � log L ( Y , θ ) = log p θ ( x, Y ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x, Y ) µ ( d x ) ∇ θ log L ( Y , θ ) = � p θ ( z, Y ) µ ( d z ) � p θ ( x, Y ) µ ( d x ) = ∂ θ log p θ ( x, Y ) � p θ ( z, Y ) µ ( d z ) � �� � the a posteriori distribution

  7. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Gradient of the log-likelihood � log L ( Y , θ ) = log p θ ( x, Y ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x, Y ) µ ( d x ) ∇ θ log L ( Y , θ ) = � p θ ( z, Y ) µ ( d z ) � p θ ( x, Y ) µ ( d x ) = ∂ θ log p θ ( x, Y ) � p θ ( z, Y ) µ ( d z ) � �� � the a posteriori distribution The gradient of the log-likelihood � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y . For all ( x, θ ) , H θ ( x ) can be evaluated.

  8. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Approximation of the gradient � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 Monte Carlo approximation with i.i.d. samples: not possible, in general. 2 Markov chain Monte Carlo approximations: sample a Markov chain 3 { X m,θ , m ≥ 0 } with stationary distribution π θ ( d x ) and set M � H θ ( x ) π θ ( d x ) ≈ 1 � H θ ( X m,θ ) M X m =1

  9. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models Approximation of the gradient � � � − 1 ∇ θ N log L ( Y , θ ) = H θ ( x ) π θ ( d x ) X Quadrature techniques: poor behavior w.r.t. the dimension of X 1 Monte Carlo approximation with i.i.d. samples: not possible, in general. 2 Markov chain Monte Carlo approximations: sample a Markov chain 3 { X m,θ , m ≥ 0 } with stationary distribution π θ ( d x ) and set M � H θ ( x ) π θ ( d x ) ≈ 1 � H θ ( X m,θ ) M X m =1 Stochastic approximation of the gradient a biased approximation � � M � 1 � E H θ ( X m,θ ) � = H θ ( x ) π θ ( d x ) . M m =1 if the chain is ergodic ”enough”, the bias vanishes when M → ∞ .

  10. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models To summarize, Problem: argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) when θ ∈ Θ ⊆ R d g convex non-smooth function (explicit). f is C 1 and its gradient is of the form � M H θ ( x ) π θ ( d x ) ≈ 1 � ∇ f ( θ ) = H θ ( X m,θ ) M m =1 where { X m,θ , m ≥ 0 } is the output of a MCMC sampler with target π θ .

  11. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Application: Penalized Maximum Likelihood inference in latent variable models To summarize, Problem: argmin θ ∈ Θ F ( θ ) with F ( θ ) = f ( θ ) + g ( θ ) when θ ∈ Θ ⊆ R d g convex non-smooth function (explicit). f is C 1 and its gradient is of the form � M H θ ( x ) π θ ( d x ) ≈ 1 � ∇ f ( θ ) = H θ ( X m,θ ) M m =1 where { X m,θ , m ≥ 0 } is the output of a MCMC sampler with target π θ . Difficulties: biased stochastic perturbation of the gradient gradient-based methods in the Stochastic Approximation framework (a fixed number of Monte Carlo samples) weaker conditions on the stochastic perturbation.

  12. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Gradient methods (case g = 0 ) Outline Application: Penalized Maximum Likelihood inference in latent variable models Stochastic Gradient methods (case g = 0 ) Stochastic Proximal Gradient methods Rates of convergence High-dimensional logistic regression with random effects

  13. Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the price of Markovian perturbations Stochastic Gradient methods (case g = 0 ) Perturbed gradient algorithm Algorithm: Given a stepsize/learning rate sequence { γ n , n ≥ 0 } : Initialisation: θ 0 ∈ Θ Repeat: compute H n +1 , an approximation of ∇ f ( θ n ) θ n +1 = θ n − γ n +1 H n +1 . set M. Bena¨ ım. Dynamics of stochastic approximation algorithms. S´ eminaire de Probabilit´ es de Strasbourg (1999) A. Benveniste, M. M´ etivier and P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer-Verlag, New York, 1990. V. Borkar. Stochastic Approximation: a dynamical systems viewpoint. Cambridge Univ. Press (2008). M. Duflo, Random Iterative Systems, Appl. Math. 34, Springer-Verlag, Berlin, 1997. H. Kushner, G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer Book (2003).

Recommend


More recommend