Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization Julien Mairal INRIA LEAR, Grenoble Gargantua workshop, LJK, November 2013 Julien Mairal Incremental and Stochastic MM Algorithms 1/28
A simple optimization principle g ( θ ) f ( θ ) b κ Objective: min θ ∈ Θ f ( θ ) Principle called Majorization-Minimization [Lange et al., 2000]; quite popular in statistics and signal processing. Julien Mairal Incremental and Stochastic MM Algorithms 2/28
In this work g ( θ ) f ( θ ) b κ scalable Majorization-Minimization algorithms; for convex or non-convex and smooth or non-smooth problems; References J. Mairal. Optimization with First-Order Surrogate Functions. ICML’13; J. Mairal. Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization. NIPS’13. Julien Mairal Incremental and Stochastic MM Algorithms 3/28
Setting: First-Order Surrogate Functions g ( θ ) f ( θ ) h ( θ ) b κ g ( θ ′ ) ≥ f ( θ ′ ) for all θ ′ in arg min θ ∈ Θ g ( θ ); the approximation error h △ = g − f is differentiable, and ∇ h is L -Lipschitz. Moreover, h ( κ ) = 0 and ∇ h ( κ ) = 0. Julien Mairal Incremental and Stochastic MM Algorithms 4/28
The Basic MM Algorithm Algorithm 1 Basic Majorization-Minimization Scheme 1: Input: θ 0 ∈ Θ (initial estimate); N (number of iterations). 2: for n = 1 , . . . , N do Compute a surrogate g n of f near θ n − 1 ; 3: Minimize g n and update the solution: 4: θ n ∈ arg min g n ( θ ) . θ ∈ Θ 5: end for 6: Output: θ N (final estimate); Julien Mairal Incremental and Stochastic MM Algorithms 5/28
Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates : f is L -smooth (differentiable + L -Lipschitz gradient). g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 . Minimizing g yields a gradient descent step θ ← κ − 1 L ∇ f ( κ ). Julien Mairal Incremental and Stochastic MM Algorithms 6/28
Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates : f is L -smooth (differentiable + L -Lipschitz gradient). g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 . Minimizing g yields a gradient descent step θ ← κ − 1 L ∇ f ( κ ). Proximal Gradient Surrogates : f = f 1 + f 2 with f 1 smooth. g : θ �→ f 1 ( κ ) + ∇ f 1 ( κ ) ⊤ ( θ − κ ) + L 2 � θ − κ � 2 2 + f 2 ( θ ) . Minimizing g amounts to one step of the forward-backward, ISTA, or proximal gradient descent algorithm. [Beck and Teboulle, 2009, Combettes and Pesquet, 2010, Wright et al., 2008, Nesterov, 2007] . Julien Mairal Incremental and Stochastic MM Algorithms 6/28
Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming : f = f 1 + f 2 with f 2 smooth and concave. g : θ �→ f 1 ( θ ) + f 2 ( κ ) + ∇ f 2 ( κ ) ⊤ ( θ − κ ) . When f 1 is convex, the algorithm is called DC-programming. Julien Mairal Incremental and Stochastic MM Algorithms 7/28
Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming : f = f 1 + f 2 with f 2 smooth and concave. g : θ �→ f 1 ( θ ) + f 2 ( κ ) + ∇ f 2 ( κ ) ⊤ ( θ − κ ) . When f 1 is convex, the algorithm is called DC-programming. Quadratic Surrogates : f is twice differentiable, and H is a uniform upper bound of ∇ 2 f : g : θ �→ f ( κ ) + ∇ f ( κ ) ⊤ ( θ − κ ) + 1 2( θ − κ ) ⊤ H ( θ − κ ) . Actually a big deal in statistics and machine learning [B¨ ohning and Lindsay, 1988, Khan et al., 2010, Jebara and Choromanska, 2012] . Julien Mairal Incremental and Stochastic MM Algorithms 7/28
Examples of First-Order Surrogate Functions More Exotic Surrogates : Consider a smooth approximation of the trace (nuclear) norm p � � ( θ ⊤ θ + µ I ) 1 / 2 � � f µ : θ �→ Tr = λ i ( θ ⊤ θ ) + µ, i =1 f ′ : H �→ Tr H 1 / 2 � � is concave on the set of p.d. matrices and ∇ f ′ ( H ) = (1 / 2) H − 1 / 2 . g µ : θ �→ f µ ( κ ) + 1 � � ( κ ⊤ κ + µ I ) − 1 / 2 ( θ ⊤ θ − κ ⊤ κ ) 2 Tr , which yields the algorithm of Mohan and Fazel [2012]. Julien Mairal Incremental and Stochastic MM Algorithms 8/28
Examples of First-Order Surrogate Functions Variational Surrogates : f ( θ 1 ) △ = min θ 2 ∈ Θ 2 ˜ f ( θ 1 , θ 2 ), where ˜ f is “smooth” w.r.t θ 1 and strongly convex w.r.t θ 2 : g : θ 1 �→ ˜ f ( θ 1 , κ ⋆ 2 ) with κ ⋆ △ ˜ = arg min f ( κ 1 , θ 2 ) . 2 θ 2 ∈ Θ 2 Saddle-Point Surrogates : f ( θ 1 ) △ = max θ 2 ∈ Θ 2 ˜ f ( θ 1 , θ 2 ), where ˜ f is “smooth” w.r.t θ 1 and strongly concave w.r.t θ 2 : 2 ) + L ′′ g : θ 1 �→ ˜ f ( θ 1 , κ ⋆ 2 � θ 1 − κ 1 � 2 2 . Jensen Surrogates : f ( θ ) △ = ˜ f ( x ⊤ θ ), where ˜ f is L -smooth. Choose a weight vector w in R p + such that � w � 1 = 1 and w i � = 0 whenever x i � =0. � x i p � � ( θ i − κ i ) + x ⊤ κ g : θ �→ w i f , w i i =1 Julien Mairal Incremental and Stochastic MM Algorithms 9/28
Theoretical Guarantees for non-convex problems: f ( θ n ) monotically decreases and ∇ f ( θ n , θ − θ n ) lim inf n → + ∞ inf ≥ 0 , � θ − θ n � 2 θ ∈ Θ which is an asymptotic stationary point condition. for convex ones: f ( θ n ) − f ⋆ = O (1 / n ). for µ - strongly convex ones: the convergence rate is linear O ((1 − µ/ L ) n ). the convergence rates and the proof techniques are the same as for proximal gradient methods [Nesterov, 2007, Beck and Teboulle, 2009]. Julien Mairal Incremental and Stochastic MM Algorithms 10/28
New Majorization-Minimization Algorithms Given f : R p → R and Θ ⊆ R p , our goal is to solve min θ ∈ Θ f ( θ ) . We introduce algorithms for non-convex and convex optimization: a block coordinate scheme for separable surrogates; an incremental algorithm dubbed MISO for separable functions f ; a stochastic algorithm for minimizing expectations; Also several variants for convex optimization : an accelerated one (Nesterov’s like); a “Frank-Wolfe” majorization-minimization algorithm. Julien Mairal Incremental and Stochastic MM Algorithms 11/28
Incremental Optimization: MISO Suppose that f splits into many components: T f ( θ ) = 1 � f t ( θ ) . T t =1 Recipe Incrementally update an approximate surrogate 1 � T t =1 g t ; T add some heuristics for practical implementations. Related (Inspiring) Work for Convex Problems related to SAG [Schmidt et al., 2013] and SDCA [Shalev-Schwartz and Zhang, 2012], but offers different update rules. Julien Mairal Incremental and Stochastic MM Algorithms 12/28
Incremental Optimization: MISO Algorithm 2 Incremental Scheme MISO 1: Input: θ 0 ∈ Θ; N (number of iterations). 0 of f t near θ 0 for all t ; 2: Choose surrogates g t 3: for n = 1 , . . . , N do t n and choose a surrogate g ˆ n of f ˆ Randomly pick up one index ˆ t n t n 4: △ near θ n − 1 . Set g t = g t n − 1 for t � = ˆ t n ; n Update the solution: 5: T 1 � g t θ n ∈ arg min n ( θ ) T θ ∈ Θ t =1 . 6: end for 7: Output: θ N (final estimate); Julien Mairal Incremental and Stochastic MM Algorithms 13/28
Incremental Optimization: MISO Update Rule for Proximal Gradient Surrogates We want to minimize 1 � T t =1 f t 1 ( θ ) + f 2 ( θ ). T T 1 f 1 ( κ t ) + ∇ f 1 ( κ t ) ⊤ ( θ − κ t ) + L � 2 � θ − κ t � 2 θ n = arg min 2 + f 2 ( θ ) T θ ∈ Θ t =1 2 � � T T �� 1 1 κ t − 1 + 1 � � � � ∇ f t 1 ( κ t ) = arg min � θ − Lf 2 ( θ ) . � � 2 � T LT � θ ∈ Θ t =1 t =1 � 2 Then, randomly draw one index t n , and update κ t n ← θ n . Remark t =1 κ t by θ n − 1 yields SAG [Schmidt et al., 2013]; � T 1 remove f 2 , replace T replace L by µ is “close” to SDCA [Shalev-Schwartz and Zhang, 2012]; Julien Mairal Incremental and Stochastic MM Algorithms 14/28
Incremental Optimization: MISO Theoretical Guarantees for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate becomes O ( T / n ). for µ - strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O ((1 − µ/ ( TL )) n ). Julien Mairal Incremental and Stochastic MM Algorithms 15/28
Incremental Optimization: MISO Theoretical Guarantees for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate becomes O ( T / n ). for µ - strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O ((1 − µ/ ( TL )) n ). Remarks for µ -strongly convex problems, the rates of SDCA and SAG are better: µ/ ( LT ) is replaced by O (min( µ/ L , 1 / T )); MISO with minorizing surrogates is close to SDCA with “similar” convergence rates (details to be written yet). Julien Mairal Incremental and Stochastic MM Algorithms 15/28
Recommend
More recommend