STOCHASTIC FISTA ALGORITHMS: SO FAST ? G. Fort 1 , L. Risser 1 , Y. Atchad´ e 2 , E. Moulines 3 , 1 IMT, Universit´ e de Toulouse & CNRS, F-31062 Toulouse, France. 2 Department of Statistics, Univ. of Michigan, 1085 South University Ave, Ann Arbor 48109, MI, USA. 3 CMAP, Ecole Polytechnique, Route de Saclay,91128 Palaiseau Cedex, France. ABSTRACT 2. PENALIZED MAXIMUM LIKELIHOOD ESTIMATION IN MODELS WITH INTRACTABLE LIKELIHOOD Motivated by challenges in Computational Statistics such as Pe- nalized Maximum Likelihood inference in statistical models with In this section, two classes of problems arising in Computational intractable likelihoods, we analyze the convergence of a stochastic Statistics, and illustrating the question (1) in the framework A1-A2, perturbation of the Fast Iterative Shrinkage-Thresholding Algo- are presented. The first situation corresponds to the computation of rithm (FISTA), when the stochastic approximation relies on a biased the Penalized Maximum Likelihood, or equivalently the Bayesian Monte Carlo estimation as it happens when the points are drawn Maximum a Posteriori estimator, in latent variable models. In that from a Markov chain Monte Carlo (MCMC) sampler. We first moti- case, g stands for the penalty term on parameter θ (in the Bayesian vate this general framework and then show a convergence result for context, the prior on the parameter); while f is the normalized neg- the perturbed FISTA algorithm. We discuss the convergence rate of ative log-likelihood: for latent variable models, it is of the form (see this algorithm and the computational cost of the Monte Carlo ap- e.g.[2]) proximation to reach a given precision. Finally, through a numerical example, we explore new directions for a better understanding of f ( θ ) = − ℓ N ( θ ) := − 1 � N log p ( x, θ )d µ ( x ) (2) these Proximal-Gradient based stochastic optimization algorithms. X Index Terms — Computational Statistics, Stochastic Approxi- where for any θ , p ( · , θ )d µ is the complete data likelihood and the mation, Markov chain Monte Carlo, Proximal-Gradient algorithms, latent variables x take values in X ( µ is a positive σ -finite measure, Nesterov acceleration. such as the Lebesgue measure when X ⊆ R p or the counting mea- sure when X is countable). In (2), the dependence upon the N ob- 1. INTRODUCTION servations is omitted. Under regularity conditions on the model, In various analyses, we are faced with solving: ∇ f ( θ ) = − 1 � ∂ θ log p ( x, θ ) d π θ ( x ) (3) N argmin θ ∈ Θ ( f ( θ ) + g ( θ )) , (1) X where where the set Θ and the functions f, g satisfy X p ( u, θ ) d µ ( u ) = p ( x, θ ) d µ ( x ) p ( x, θ ) d µ ( x ) A1 g : R d → [0 , + ∞ ] is convex, not identically + ∞ and lower d π θ ( x ) := (4) � exp( Nℓ N ( θ )) semi-continuous; f : R d → R ∪ { + ∞} is continuously differen- tiable on Θ := { θ ∈ R d : g ( θ ) + | f ( θ ) | < ∞} and its gradient is is the a posteriori distribution (of the latent variables, given the ob- L -Lipschitz on Θ ; servations, when the parameter is θ ) which is known up to a nor- malizing constant. In this example, the computation of the gradient and the gradient ∇ f is numerically intractable. Motivated by situ- ∇ f is not explicit; the gradient is an expectation with respect to a ations arising in Computational Statistics (see the examples in Sec- distribution known up to a normalizing constant; this integral can be tion 2), we consider the case when approximated by a Monte Carlo sum computed from the output of an MCMC sampler (see e.g. [3, Chapter 6]), thus providing a biased A2 for any θ ∈ R d , ∇ f ( θ ) = � X H ( θ, x ) π θ (d x ) where X is a stochastic approximation of the exact gradient: note indeed that if topological space endowed with its Borel σ -field, π θ is a probability { X j,θ , j ≥ 0 } is a (non stationary) ergodic Markov chain produced measure on X and H : R × X → R d is measurable. In addition, by an MCMC sampler with target d π θ , then for any positive mea- x �→ H ( θ, x ) is π θ -integrable for any θ ∈ R d , surable function h and only an approximation of ∇ f ( θ ) is available, possibly a stochas- � m � 1 � tic approximation and if such, possibly a biased one. In the present � E h ( X j,θ ) − h d π θ � = 0 m paper, our main contribution is to address a convergence analysis of j =1 a numerical tool to solve Eq.(1), namely a Stochastic perturbation of FISTA (see [1]), in the challenging situation when the perturbation but this bias vanishes when m → ∞ (see e.g. [4, Chapter 13]). comes from a stochastic and biased approximation of ∇ f . The second situation corresponds to the computation of the Pe- This work is partially supported by ANR-11-LABX-0040-CIMI within nalized Maximum Likelihood estimator in a binary graphical model. the program ANR-11-IDEX-0002-02
Recommend
More recommend