katalyst boosting convex katayusha for non convex
play

Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a - PowerPoint PPT Presentation

Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang zaiyi.czy@alibaba-inc.com 2019-06-10 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1


  1. Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang zaiyi.czy@alibaba-inc.com 2019-06-10 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20

  2. Overview Introduction 1 Katalyst Algorithm and Theoretical Guarantee 2 Experiments 3 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 2 / 20

  3. Problem Definition Problem Definition � n x ∈ R d φ ( x ) := 1 min f i ( x ) + ψ ( x ) (1) n i =1 we can obtain a better gradient complexity w.r.t. sample size n and accuracy ǫ via variance reduced method (Johnson & Zhang, 2013) (SVRG-type). We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20

  4. Problem Definition Problem Definition � n x ∈ R d φ ( x ) := 1 min f i ( x ) + ψ ( x ) (1) n i =1 we can obtain a better gradient complexity w.r.t. sample size n and accuracy ǫ via variance reduced method (Johnson & Zhang, 2013) (SVRG-type). We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015). Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20

  5. Assumptions { f i } are L -smooth. ψ can be non-smooth but convex. φ is µ -weakly convex. Definition 1 ( L -smoothness) A function f is Lipschitz smooth with constant L if its derivatives are Lipschitz continuous with constant L , that is �∇ f ( x ) − ∇ ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d Definition 2 2 � x � 2 is (Weak convexity) A function φ is µ -weakly convex, if φ ( x ) + µ convex. Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 4 / 20

  6. Comparisons with Related Work Table 1: Comparison of gradient complexities of variance reduction based algorithms for finding ǫ -stationary point of (1). ∗ marks the result is only valid when L /µ ≤ √ n . Algorithms L /µ ≥ Ω( n ) L /µ ≤ O ( n ) Non-smooth ψ O ( n 2 / 3 L /ǫ 2 ) O ( n 2 / 3 L /ǫ 2 ) SAGA (Reddi et al., 2016) Yes O ( √ nL µ/ǫ 2 ) O (( µ n + √ nL µ ) /ǫ 2 ) � � RapGrad (Lan & Yang, 2018) indicator function O ( n 2 / 3 L /ǫ 2 ) O ( n 2 / 3 L /ǫ 2 ) SVRG (Reddi et al., 2016) Yes ∗ O ( n 2 / 3 L 2 / 3 µ 1 / 3 /ǫ 2 ) Natasha1 (Allen-Zhu, 2017) NA Yes O ( n 3 / 4 √ L µ/ǫ 2 ) O (( µ n + n 3 / 4 √ L µ ) /ǫ 2 ) � � RepeatSVRG (Allen-Zhu, 2017) Yes O ( nL /ǫ 2 ) O ( nL /ǫ 2 ) 4WD-Catalyst (Paquette et al., 2018) Yes O ( √ nL /ǫ 2 ) O ( √ nL /ǫ 2 ) SPIDER (Fang et al., 2018) No O ( √ nL /ǫ 2 ) O ( √ nL /ǫ 2 ) SNVRG (Zhou et al., 2018) No O ( √ nL µ/ǫ 2 ) � � O (( µ n + L ) /ǫ 2 ) Katalyst (this work) Yes Our bound is proved optimal up to a logarithmic factor by a recent work (Zhou & Gu, 2019). Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 5 / 20

  7. Overview Introduction 1 Katalyst Algorithm and Theoretical Guarantee 2 Experiments 3 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 6 / 20

  8. Interpretation - Our Basic Idea 0.6 0.5 0.4 0.3 0.2 0.1 0 x 0 -0.1 x 1 -0.2 -1 -0.5 0 0.5 1 Step 1 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 7 / 20

  9. Interpretation - Our Basic Idea 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 x 2 x 1 -0.2 -1 -0.5 0 0.5 1 Step > 1 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 8 / 20

  10. A Unified Framework Meta Algorithm Algorithm 1: Stagewise-SA( w 0 , { η s } , µ , { w s } ) : a non-increasing sequence { w s } , x 0 ∈ dom( ψ ), γ = (2 µ ) − 1 ; Input 1 for s = 1 , . . . , S do f s ( · ) = φ ( · ) + 1 2 γ � · − x s − 1 � 2 ; 2 x s = Katyusha( f s , x s − 1 , K s , µ, L + µ ) // x s is usually an averaged 3 solution; 4 end Output: x τ , τ is randomly chosen from { 0 , . . . , S } according to the w τ +1 probabilities p τ = k =0 w k +1 , τ = 0 , . . . , S ; � S ) + γ − 1 − µ � n f s ( x ) = 1 ( f i ( x ) + µ � x − x s − 1 � 2 + ψ ( x ) 2 � x − x s − 1 � 2 n 2 � �� � � �� � i =1 ˆ ˆ f i ( x ) ψ ( x ) Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 9 / 20

  11. Algorithm Algorithm 2: Katyusha( f , x 0 , K , σ, � L ) 2 , τ 1 = min { � n σ Initialize: τ 2 = 1 3 L , 1 1 2 } , η = 3 τ 1 L , θ = 1 + ησ, m = x 0 ← x 0 ; ⌈ log(2 τ 1 +2 /θ − 1) ⌉ + 1, y 0 = ζ 0 = � log θ 1 for k = 0 , . . . , K − 1 do u k = ∇ ˆ x k ); f ( � 2 for t = 0 , . . . , m − 1 do 3 j = km + t ; 4 x k + (1 − τ 1 − τ 2 ) y j ; x j = τ 1 ζ j + τ 2 � 5 ∇ j +1 = u k + ∇ ˆ � f i ( x j +1 ) − ∇ ˆ x k ); f i ( � 6 2 η � ζ − ζ j � 2 + � � ∇ j +1 , ζ � + ˆ ζ j +1 = arg min ζ 1 ψ ( ζ ); 7 y j +1 = arg min y 3 � 2 � y − x j +1 � 2 + � � ∇ j +1 , y � + ˆ L ψ ( ζ ); 8 end 9 � m − 1 t =0 θ t y sm + t +1 x k +1 = � ; 10 � m − 1 j =0 θ t 11 end x K ; Output : � Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 10 / 20

  12. Theory Theorem 3 Let w s = s α , α > 0 , γ = 2 µ , � 1 L = L + µ , σ = µ , and in each call of Katyusha let � � log( D s ) � N σ L , 1 1 τ 1 = min { 2 } , step size η = L , τ 2 = 1 / 2 , θ = 1 + ησ , and K s = , 3 � 3 τ 1 � m log( θ ) � log(2 τ 1 +2 /θ − 1) � + 1 , where D s = max { 4ˆ L /µ, ˆ L 3 /µ 3 , L 2 s /µ 2 } . Then we have that m = log θ max { E [ �∇ φ γ ( x τ +1 ) � 2 ] , E [ L 2 � x τ +1 − z τ +1 � 2 ] } ≤ 34 µ ∆ φ ( α + 1) + 98 µ ∆ φ ( α + 1) ( S + 1) α I α< 1 , S + 1 where z = prox γφ ( x ) , τ is randomly chosen from { 0 , . . . , S } according to probabilities w τ +1 p τ = k =0 w k +1 , τ = 0 , . . . , S. Furthermore, the total gradient complexity for finding x τ +1 such � S that max( E [ �∇ φ γ ( x τ +1 ) � 2 ] , L 2 E [ � x τ +1 − z τ +1 � 2 ]) ≤ ǫ 2 is � � L � 1 �  � , n ≥ 3 L  O ( µ n + n µ L ) log 4 µ ,   ǫ 2 µǫ N ( ǫ ) = � L � 1 �� �  n ≤ 3 L   O nL µ log , 4 µ . ǫ 2 µǫ Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 11 / 20

  13. Theory Theorem 4 � log( D ) � Suppose ψ = 0 . With the same parameter values as in Theorem 3 except that K = , m log( θ ) where D = max(48ˆ L /µ, 2ˆ L 3 /µ 3 ) . The total gradient complexity for finding x τ +1 such that E [ �∇ φ ( x τ +1 ) � 2 ] ≤ ǫ 2 is � 1 � � L �  � , n ≥ 3 L  ( µ n + n µ L ) log 4 µ ,  O  µ ǫ 2 N ( ǫ ) = � 1 �� � L �  n ≤ 3 L   O nL µ log , 4 µ . ǫ 2 µ Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 12 / 20

  14. Overview Introduction 1 Katalyst Algorithm and Theoretical Guarantee 2 Experiments 3 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 13 / 20

  15. Experiments I Squared hinge loss + (log-sum penalty (LSP) / transformed ℓ 1 penalty (TL1)). TL1, rcv1, λ = 1 /n LSP, rcv1, λ = 1 /n TL1, realsim, λ = 1 /n LSP, realsim, λ = 1 /n 0 0 0 0 Katalyst Katalyst Katalyst Katalyst -0.1 proxSVRG proxSVRG proxSVRG proxSVRG -0.2 -0.2 -0.2 -0.2 proxSVRG-mb proxSVRG-mb proxSVRG-mb proxSVRG-mb log 10 (objective) 4WD-Catalyst log 10 (objective) 4WD-Catalyst log 10 (objective) 4WD-Catalyst log 10 (objective) 4WD-Catalyst -0.3 -0.4 -0.4 -0.4 -0.4 -0.5 -0.6 -0.6 -0.6 -0.6 -0.8 -0.8 -0.8 -0.7 -0.8 -1 -1 -1 -0.9 -1 -1.2 -1.2 -1.2 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 number of gradients/n number of gradients/n number of gradients/n number of gradients/n TL1, rcv1, λ = 0 . 1 /n LSP, rcv1, λ = 0 . 1 /n TL1, realsim, λ = 0 . 1 /n LSP, realsim, λ = 0 . 1 /n 0 0 0 0 Katalyst Katalyst Katalyst Katalyst -0.2 -0.2 -0.2 -0.2 proxSVRG proxSVRG proxSVRG proxSVRG proxSVRG-mb -0.4 proxSVRG-mb proxSVRG-mb -0.4 proxSVRG-mb -0.4 -0.4 log 10 (objective) 4WD-Catalyst log 10 (objective) 4WD-Catalyst log 10 (objective) 4WD-Catalyst log 10 (objective) 4WD-Catalyst -0.6 -0.6 -0.6 -0.6 -0.8 -0.8 -0.8 -0.8 -1 -1 -1 -1 -1.2 -1.2 -1.2 -1.2 -1.4 -1.4 -1.4 -1.4 -1.6 -1.6 -1.6 -1.6 -1.8 -1.8 -1.8 -2 -1.8 -2 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 number of gradients/n number of gradients/n number of gradients/n number of gradients/n Figure 1: Comparison of different algorithms for two tasks on different datasets Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 14 / 20

  16. Experiments II We use Smoothed SCAD given in (Lan & Yang, 2018),  λ ( x 2 + ǫ ) if ( x 2 + ǫ ) 1 1 2 , 2 ≤ λ,      2 γλ ( x 2 + ǫ ) 2 − ( x 2 + ǫ ) − λ 2 1     , 2( γ − 1) R λ,γ,ǫ ( x ) =  if λ < ( x 2 + ǫ ) 1  2 < γλ,      λ 2 ( γ + 1)   , otherwise , 2 where γ > 2, λ > 0, and ǫ > 0. Then the problem is n d � � x ∈ R d φ ( x ) := 1 i x − b i ) 2 + ρ ( a ⊤ min R λ,γ,ǫ ( x i ) 2 n 2 i =1 i =1 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 15 / 20

Recommend


More recommend