safe screening for the generalized conditional gradient
play

Safe screening for the generalized conditional gradient method Yifan - PowerPoint PPT Presentation

Safe screening for the generalized conditional gradient method Yifan Sun Joint work with Francis Bach at INRIA Paris March 2020 1 / 20 Problem statement minimize f ( x ) + ( ( x )) x f : R n R convex and smooth : R n R +


  1. Safe screening for the generalized conditional gradient method Yifan Sun Joint work with Francis Bach at INRIA Paris March 2020 1 / 20

  2. Problem statement minimize f ( x ) + φ ( κ ( x )) x f : R n → R convex and smooth κ : R n → R + gauge function (norm-like function that promotes sparsity) φ : R + → R + convex, monotonically nondecreasing Examples � i x ) + � x � 2 log( b i a T min − sparse logistic regression 1 x i 1 2 x T Qx + 1 T x min s . t . 0 ≤ x ≤ C support vector machine x n � min L ( Kx − b ) s . t . | x i − x i − 1 | ≤ ǫ image denoising x i =2 2 / 20

  3. Generalized conditional gradient method (gCGM) minimize f ( x ) + φ ( κ ( x )) � �� � x h ( x ) Scheme s ( t ) = argmin ∇ f ( x ( t ) ) T s + h ( s ) generalized LMO s x ( t +1) = (1 − θ ( t ) ) x ( t ) + θ ( t ) s ( t ) convex merging if h ( x ) constrains x ∈ P → vanilla CGM easy LMO for many sparse norms (1-norm, nuclear norm, group norm,...) CGM: Frank & Wolfe ’56, Dunn & Harshbarger ’78, Clarkson ’10, Lacoste-Julien & Jaggi ’13,... gCGM: Yu et al ’17, Harchaoui et al ’15, Bredies and Lorenz ’08, Bach ’12 ... 3 / 20

  4. Goals min f ( x ) + φ ( κ ( x )) ���� � �� � x loss sparsity Screening : Given x ( t ) → x ∗ , where x ∗ is a sparse combination of essential atoms, can I omit nonessential atoms based on x ( t ) ? Method : Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method : Is there a finite time t when we can exactly recover the support of x ∗ from x ( t ) ? 4 / 20

  5. Goals min f ( x ) + φ ( κ ( x )) ���� � �� � x loss sparsity Screening : Given x ( t ) → x ∗ , where x ∗ is a sparse combination of essential atoms, can I omit nonessential atoms based on x ( t ) ? Method : Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method : Is there a finite time t when we can exactly recover the support of x ∗ from x ( t ) ? 4 / 20

  6. Generalized sparsity Atoms : P = { p 1 , ..., p m } ⊂ R n (e.g. vertices of polytope, basis vectors, eigenvectors) Gauge function : κ P ( x ) is the solution to � m � m � � minimize c i : c i p i = x c i ≥ 0 i =1 i =1 includes norms, seminorms, convex cone indicators positive homogenous, subadditive, not necessarily symmetric Examples P = cols of P , then κ P ( x ) = � P − 1 x � 1 P = { βe 1 , ..., βe n } , then κ P ( x ) = β � x � 1 + δ + ( x ) n � P = {± ( e 2 − e 1 ) , ..., ± ( e n − e n − 1 ) } , then κ P ( x ) = | x i − x i − 1 | (smoothing) i =2 Freund ’87, Chandresekaran ’12, Friedlander ’13 5 / 20

  7. Generalized screening � m � m � � minimize c i : c i p i = x (1) c i ≥ 0 i =1 i =1 Support : (may not be unique) supp P ( x ) = { p i ∈ P : c i > 0 in (1) } Screening : Given x ≈ x ∗ , can we safely guarantee some p ∈ P is not in supp P ( x ∗ ) ? Ghaoui et al ’12, Fercoq et al ’15, Xiang and Ramadge ’12; Wang et al. ’14; Liu et al. ’13; Malti and Herzet ’16; Ndiaye et al. ’15; Bonnefoy et al. ’15; Zhou and Zhao ’15, ... many 6 / 20

  8. Support recovery optimality condition minimize f ( x ) + φ ( κ P ( x )) x Property If p ∈ supp P ( x ∗ ) then −∇ f ( x ∗ ) T p = max p ′ ∈P −∇ f ( x ∗ ) T p ′ . Example: Squared penalty f ( x ) + 1 2 � x � 2 min 1 x Then at optimality, with z ∗ := −∇ f ( x ∗ ) , � | z ∗ i | = � x ∗ � 1 x ∗ if i � = 0 | z ∗ i | ≤ � x ∗ � 1 x ∗ if i = 0 . 7 / 20

  9. Support recovery optimality condition minimize f ( x ) + φ ( κ P ( x )) x Property If p ∈ supp P ( x ∗ ) then −∇ f ( x ∗ ) T p = max p ′ ∈P −∇ f ( x ∗ ) T p ′ . Example: Polyhedral constraint min f ( x ) subject to � Ax � 1 ≤ c x Then P = col( A − 1 ) and for A = ( a 1 , ..., a n ) , A − 1 = ( b 1 , ..., b n ) , i x ∗ � = 0 ⇒ | b T i ∇ f ( x ∗ ) | = � B ∇ f ( x ∗ ) � ∞ . a T Proof : Normal cone condition: −∇ f ( x ∗ ) T y ≤ −∇ f ( x ∗ ) T x ∗ , ∀ y : � Ay � 1 ≤ c At optimality, −∇ f ( x ∗ ) = A T u so u T Ay u T Ax ∗ . attainable require = � u � ∞ � Ax ∗ � 1 max = � u � ∞ � Ay � 1 = � �� � y = c 7 / 20

  10. Observation for gradient screening minimize f ( x ) + φ ( � x � 1 ) x Optimality conditions |∇ f ( x ∗ ) | i < �∇ f ( x ∗ ) � ∞ ⇒ x ∗ i = 0 If x ( t ) ≈ x ∗ , by smoothness ∇ f ( x ( t ) ) ≈ ∇ f ( x ∗ ) If �∇ f ( x ( t ) ) − ∇ f ( x ∗ ) � ∞ ≤ ǫ , then �∇ f ( x ( t ) ) � ∞ − |∇ f ( x ( t ) ) i | > 2 ǫ ⇒ x ∗ i = 0 8 / 20

  11. Observation for gradient screening (generalized) minimize f ( x ) + φ ( κ P ( x )) x Optimality conditions −∇ f ( x ∗ ) T p < max p ′ ∈P −∇ f ( x ∗ ) T p ′ ⇒ p �∈ supp P ( x ∗ ) Measure of gradient error: p, − p ∈P ( z − z ∗ ) T p P ( z − z ∗ ) := max σ � (“dual gauge”) P ( ∇ f ( x ) − ∇ f ( x ∗ )) ≤ ǫ , then If σ � p ′ ∈P |∇ f ( x ) T p ′ | − |∇ f ( x ) T p | > 2 ǫ ⇒ p �∈ supp P ( x ∗ ) max If I know ǫ , this condition does not depend on x ∗ ! 9 / 20

  12. Gap bounds gradient error Primal-dual pair ( P ) min Ψ( x ) := f ( x ) + h ( x ) x Ω( z ) := − f ∗ ( − z ) − h ∗ ( z ) ( D ) max z where f ∗ ( z ) = sup x x T z − f ( x ) is the convex conjugate. Gap bounds gradient error gap ( x, −∇ f ( x )) = Ψ( x ) − Ω( −∇ f ( x )) ( x − x ∗ ) T ( ∇ f ( x ) − ∇ f ( x ∗ )) ≥ smooth f 1 P ( ∇ f ( x ∗ ) − ∇ f ( x )) 2 ≥ Lσ � Smoothness : for � P = P ∪ −P : f ( x ) − f ( y ) ≤ ∇ f ( y ) T ( y − x ) + L P ( x − y ) 2 , 2 κ � ∀ x, y. 10 / 20

  13. Theorem (Screening) For any x , any p ∈ P , � p ′ ∈P −∇ f ( x ) T ( p ′ − p ) > 2 max L gap ( x, −∇ f ( x )) implies that p �∈ supp P ( x ∗ ) . Theorem (Support recovery) The support is recovered when � i ≤ t gap ( x ( i ) , ∇ f ( x ( i ) )) < δ/ 4 , L min −∇ f ( x ∗ ) T ( p − p ′ ) where δ = min p ∈ supp P ( x ∗ ) , p ′ �∈ supp P ( x ∗ ) See also Ghaoui et al ’12, Ndiaye et al. ’15, ... 11 / 20

  14. Goals min f ( x ) + φ ( κ ( x )) ���� � �� � x loss sparsity Screening : Given x ( t ) → x ∗ , where x ∗ is a sparse combination of essential atoms, can I omit nonessential atoms based on x ( t ) ? Method : Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method : Is there a finite time t when we can exactly recover the support of x ∗ from x ( t ) ? 12 / 20

  15. Goals min f ( x ) + φ ( κ ( x )) ���� � �� � x loss sparsity Screening : Given x ( t ) → x ∗ , where x ∗ is a sparse combination of essential atoms, can I omit nonessential atoms based on x ( t ) ? Method : Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method : Is there a finite time t when we can exactly recover the support of x ∗ from x ( t ) ? 12 / 20

  16. Conditional gradient method (Warm-up) minimize f ( x ) x ∈P Vanilla CGM ( P = { x : � x � 1 ≤ 1 } ) s ( t ) = argmin ∇ f ( x ( t ) ) T s (LMO) s ∈P = − sign ( ∇ f ( x ( t ) ) k ) e k , |∇ f ( x ( t ) ) k | = �∇ f ( x ( t ) ) � ∞ x ( t +1) = (1 − θ ( t ) ) x ( t ) + θ ( t ) s ( t ) 13 / 20

  17. Conditional gradient method (Warm-up) minimize f ( x ) x ∈P Vanilla CGM ( P = { x : � x � 1 ≤ 1 } ) s ( t ) = argmin ∇ f ( x ( t ) ) T s (LMO) s ∈P = − sign ( ∇ f ( x ( t ) ) k ) e k , |∇ f ( x ( t ) ) k | = �∇ f ( x ( t ) ) � ∞ x ( t +1) = (1 − θ ( t ) ) x ( t ) + θ ( t ) s ( t ) 13 / 20

  18. Generalized conditional gradient method minimize f ( x ) + φ ( κ P ( x )) x s ( t ) = argmin ∇ f ( x ( t ) ) T s + φ ( κ P ( s )) = ξ ˆ s generalized LMO s ∇ f ( x ( t ) ) T ˆ ˆ s = argmin s direction � �� � s ∈P ˆ =: ν ξ = argmin ξν + φ ( ξ ) magnitude ξ ≥ 0 x ( t +1) = (1 − θ ( t ) ) x ( t ) + θ ( t ) s ( t ) convex merging Computational complexity: ≈ same as vanilla CGM (LMO + 1-D optimization) 14 / 20

  19. Generalized conditional gradient method minimize f ( x ) + φ ( κ P ( x )) x s ( t ) = argmin ∇ f ( x ( t ) ) T s + φ ( κ P ( s )) = ξ ˆ s generalized LMO s ∇ f ( x ( t ) ) T ˆ ˆ s = argmin s direction � �� � s ∈P ˆ =: ν ξ = argmin ξν + φ ( ξ ) magnitude ξ ≥ 0 x ( t +1) = (1 − θ ( t ) ) x ( t ) + θ ( t ) s ( t ) convex merging Computational complexity: ≈ same as vanilla CGM (LMO + 1-D optimization) 14 / 20

  20. Computing magnitude − ξ · ν + φ ( ξ ) = ( φ ∗ ) ′ ( ν ) ξ = argmin ξ ≥ 0 where φ ∗ ( ν ) = sup ξ ξ · ν − φ ( ξ ) is the convex conjugate Example : φ ( ξ ) = p − 1 ξ p � 0 , ν ≤ 1 p = 1 , ξ = (doesn’t work for LASSO) + ∞ , else. p = 2 , ξ = ν (not bounded) p = + ∞ , ξ = 1 (vanilla CGM) 15 / 20

  21. Assumptions on φ minimize f ( x ) + φ ( κ P ( x )) x Assumption The function φ : R + → R + is monotonically nondecreasing over all ξ ≥ 0 with subdifferentials not upper-bounded: ξ → + ∞ sup { ν : ν ∈ ∂φ ( ξ ) } → + ∞ (finite ξ always exists) lower-bounded by a quadratic function φ ( ξ ) ≥ µ φ ξ 2 − φ 0 ( ξ doesn’t grow too fast) for some µ φ > 0 and φ 0 ∈ R . 16 / 20

Recommend


More recommend