Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting accelerated gradient methods with a rough strong convexity estimate Olivier Fercoq Joint work with Zheng Qu 20 March 2017 1/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Minimisation of composite functions Minimise the “strongly” convex composite function F x ∈ R N { F ( x ) = f ( x ) + ψ ( x ) } min • f : R N → R , convex, differentiable, with L -Lipschitz gradient • ψ : R N → R ∪ { + ∞} , convex, with simple proximal operator y ∈ R N ψ ( y ) + 1 2 � x − y � 2 prox ψ ( x ) = arg min L • F = f + g features some kind of strong convexity 2/28
Setup Restarting FISTA Restarting APPROX Adaptive restart The local error bound property Let X ∗ be the set of optimal solutions such that ∀ x ∗ ∈ X ∗ , ∀ x ∈ R n , F ∗ = F ( x ∗ ) ≤ F ( x ). Assumption There exists s > 0 and µ F ( s ) > 0 such that if dist L ( x , X ∗ ) ≤ s , F ( x ) ≥ F ∗ + µ F ( s ) dist L ( x , X ∗ ) 2 2 Examples: - F ( x ) = φ ( Ax ) with ∇ 2 φ ( x ) > 0, ∀ x 2 � Ax − b � 2 + λ � x � 1 - F ( x ) = 1 Local error bound for s > 0 ⇒ local error bound ∀ compact set 3/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: FISTA Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k � �∇ f ( y k ) , x − y k � + 1 2 � x − y k � 2 � x k +1 = arg min x ∈ R N L + ψ ( x ) z k +1 = z k + 1 θ k ( x k +1 − y k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 4/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: APG Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k �∇ f ( y k ) , z − y k � + θ k � 2 � z − z k � 2 � z k +1 = arg min z ∈ R N L + ψ ( z ) x k +1 = y k + θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 5/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: APPROX Choose x 0 ∈ dom ψ . Set θ 0 = τ n and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k Randomly generate S k ∼ ˆ S for i ∈ S k do k � + θ k nv i z i �∇ i f ( y k ) , z − y i 2 τ | z − z i k | 2 + ψ i ( z ) � � k +1 = arg min z ∈ R ni end for x k +1 = y k + n τ θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 6/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Accelerated gradient methods µ F = 0 µ F > 0 is known FISTA Beck & Teboulle Vandenberghe APG Nesterov Nesterov dual APG Nesterov Nesterov APPROX Fercoq & Richt´ arik Lin, Lu & Xiao O (1 − √ µ F ) k ) O (1 / k 2 ) The algorithms that guarantee linear convergence depend explicitly on µ F (e.g. θ k = √ µ F , ∀ k ) 7/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Restart when µ F is known Proposition (Nesterov: Conditional restarting at x k ) Let ( x k , z k ) be the iterates of FISTA. We have F ( x k ) − F ( x ∗ ) ≤ θ 2 k − 1 ( F ( x 0 ) − F ( x ∗ )) . µ F Moreover, given α < 1 , if � 1 k ≥ 2 − 1 , αµ F then F ( x k ) − F ( x ∗ ) ≤ α ( F ( x 0 ) − F ( x ∗ )) . 8/28
Setup Restarting FISTA Restarting APPROX Adaptive restart FISTA with restart Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k �∇ f ( y k ) , x − y k � + 1 � 2 � x − y k � 2 � x k +1 = arg min x ∈ R N L + ψ ( x ) z k +1 = z k + 1 θ k ( x k +1 − y k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 � � � 1 if k ≡ 0 mod 2 αµ F − 1 then θ k +1 = θ 0 z k +1 = x k +1 end if end for Issue: the algorithm still depends on µ F 9/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Methods when µ F is not known • Dual APG with adaptive restart [Nesterov] 1. Start with x 0 and an estimate µ of µ F . 2. Perform periodic restart as if µ were smaller than µ F 3. If the “gradient” is not small enough at the time of restart, decrease µ and go back to step 1. → Annoying transient phase (go back to x 0 ) • Heuristic adaptive restart [O’Donoghue & Candes] - If F ( x k +1 ) > F ( x k ), then restart → Works well in practice but no guarantee 10/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Our goal • Perform periodic restart with an arbitrary frequency • Show convergence at a linear rate • Result for FISTA, APG and APPROX 11/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Complexity without restart Proposition The iterates of FISTA and APG satisfy for all k ≥ 1 , 1 ( F ( x k ) − F ∗ ) + 1 2 dist L ( z k , X ∗ ) 2 ≤ 1 2 dist L ( x 0 , X ∗ ) 2 θ 2 k − 1 1 2 dist L ( x k , X ∗ ) 2 ≤ 1 2 dist L ( x 0 , X ∗ ) 2 → First inequality is a direct consequence of classical results using dist L ( z k , X ∗ ) ≤ � z k − x ∗ � L → The second is a stability result 12/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Unconditional restarting Theorem (Restarting for FISTA and APG) Let ( x k , z k ) be the iterates of FISTA or APG. Let σ ∈ [0 , 1] and ¯ x k = (1 − σ ) x k + σ z k . We have for µ F = µ F (dist L ( x 0 , X ∗ )) , 1 x k , X ∗ ) 2 ≤ 1 � σ, 1 − σµ F � dist L ( x 0 , X ∗ ) 2 2 dist L (¯ 2 max θ 2 k − 1 13/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 definition of ¯ x k = (1 − σ ) x k + σ z k 14/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 Rearrange 14/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 F ( x k ) − F ∗ + θ 2 0 , 1 − σ − σµ F � 1 σ � 2dist( x k , X ∗ ) 2 + � dist( z k , X ∗ ) 2 � k − 1 ≤ max θ 2 θ 2 2 k − 1 k − 1 max(0 , x ) ≥ x and local error bound 14/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 F ( x k ) − F ∗ + θ 2 0 , 1 − σ − σµ F � 1 σ � 2dist( x k , X ∗ ) 2 + � dist( z k , X ∗ ) 2 � k − 1 ≤ max θ 2 θ 2 2 k − 1 k − 1 1 0 , 1 − σ − σµ F � 1 2 dist L ( x 0 , X ∗ ) 2 + σ � x k , X ∗ ) 2 ≤ max 2 dist L ( x 0 , X ∗ ) 2 2 dist L (¯ θ 2 k − 1 σ, 1 − σµ F � 1 � 2 dist L ( x 0 , X ∗ ) 2 = max θ 2 k − 1 Complexity of FISTA/APG + stability 14/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Nb iters to reach F ( x k ) − F ( x ∗ ) ≤ 10 − 10 min x ∈ R N 1 2 � Ax − b � 2 2 + λ � x � 1 , N = 4 (iris dataset) 10 − 3 10 − 4 10 − 5 10 − 6 10 − 8 µ est 1 0.1 0.01 Dual APG with 447 398 265 162 163 163 163 156 adaptive restart FISTA- µ 751 352 170 173 264 291 277 277 FISTA restarted: at x , Proposition 751 687 297 160 198 278 278 278 at ¯ x , Theorem 633 274 168 211 278 278 278 278 if F ( x k +1 ) > F ( x k ) 121 APG- µ 751 351 340 882 2580 7453 > 1e4 > 1e4 APG restarted: at x , Proposition 751 684 297 189 311 894 1471 4488 at ¯ x , Theorem 632 275 173 281 794 1310 3977 > 1e4 if F ( x k +1 ) > F ( x k ) > 1e4 751: Proximal gradient > 1e4 : APG 15/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting Accelerated coordinate descent Expected separable overapproximation ( E [ | ˆ S | ] = τ ): � � S ] )] ≤ F ( x k ) + τ �∇ f ( x k ) , h � + 1 2 � h � 2 E [ F ( x + h [ ˆ v n Choose x 0 ∈ dom ψ . Set θ 0 = τ n and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k Randomly generate S k ∼ ˆ S for i ∈ S k do k � + θ k nv i k | 2 + ψ i ( z ) z i � �∇ i f ( y k ) , z − y i 2 τ | z − z i � k +1 = arg min z ∈ R ni end for x k +1 = y k + n τ θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 16/28
Setup Restarting FISTA Restarting APPROX Adaptive restart Complexity of APPROX without restart ∆( x ) = 1 − θ 0 ( F ( x ) − F ∗ ) + 1 dist v ( x , X ∗ ) 2 θ 2 2 θ 2 0 0 Proposition The iterates of APPROX satisfy for all k ≥ 1 , 1 ( F ( x k ) − F ∗ ) + 1 � dist v ( z k , X ∗ ) 2 � E ≤ ∆( x 0 ) θ 2 2 θ 2 0 k − 1 k γ i E [ F ( x i ) − F ∗ ] + 1 − θ 0 � k E [∆( x k )] ≤ ∆( x 0 ) − E [ F ( x k ) − F ∗ ] θ 2 θ 2 i − 1 0 i =0 where γ i i γ i i γ i k ≥ 0 , � k = 1 and x k = � k z i 17/28
Recommend
More recommend