Safe screening for the generalized conditional gradient method Yifan - PowerPoint PPT Presentation

Safe screening for the generalized conditional gradient method Yifan Sun Joint work with Francis Bach at INRIA Paris March 2020 1 / 20

Problem statement minimize f ( x ) + φ ( κ ( x )) x f : R n → R convex and smooth κ : R n → R + gauge function (norm-like function that promotes sparsity) φ : R + → R + convex, monotonically nondecreasing Examples � i x ) + � x � 2 log( b i a T min − sparse logistic regression 1 x i 1 2 x T Qx + 1 T x min s . t . 0 ≤ x ≤ C support vector machine x n � min L ( Kx − b ) s . t . | x i − x i − 1 | ≤ ǫ image denoising x i =2 2 / 20

Generalized conditional gradient method (gCGM) minimize f ( x ) + φ ( κ ( x )) � �� x h ( x ) Scheme s ( t ) = argmin ∇ f ( x ( t ) ) T s + h ( s ) generalized LMO s x ( t +1) = (1 − θ ( t ) ) x ( t ) + θ ( t ) s ( t ) convex merging if h ( x ) constrains x ∈ P → vanilla CGM easy LMO for many sparse norms (1-norm, nuclear norm, group norm,...) CGM: Frank & Wolfe ’56, Dunn & Harshbarger ’78, Clarkson ’10, Lacoste-Julien & Jaggi ’13,... gCGM: Yu et al ’17, Harchaoui et al ’15, Bredies and Lorenz ’08, Bach ’12 ... 3 / 20

Goals min f ( x ) + φ ( κ ( x )) �� x loss sparsity Screening : Given x ( t ) → x ∗ , where x ∗ is a sparse combination of essential atoms, can I omit nonessential atoms based on x ( t ) ? Method : Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method : Is there a finite time t when we can exactly recover the support of x ∗ from x ( t ) ? 4 / 20

Generalized sparsity Atoms : P = { p 1 , ..., p m } ⊂ R n (e.g. vertices of polytope, basis vectors, eigenvectors) Gauge function : κ P ( x ) is the solution to � m � m � � minimize c i : c i p i = x c i ≥ 0 i =1 i =1 includes norms, seminorms, convex cone indicators positive homogenous, subadditive, not necessarily symmetric Examples P = cols of P , then κ P ( x ) = � P − 1 x � 1 P = { βe 1 , ..., βe n } , then κ P ( x ) = β � x � 1 + δ + ( x ) n � P = {± ( e 2 − e 1 ) , ..., ± ( e n − e n − 1 ) } , then κ P ( x ) = | x i − x i − 1 | (smoothing) i =2 Freund ’87, Chandresekaran ’12, Friedlander ’13 5 / 20

Generalized screening � m � m � � minimize c i : c i p i = x (1) c i ≥ 0 i =1 i =1 Support : (may not be unique) supp P ( x ) = { p i ∈ P : c i > 0 in (1) } Screening : Given x ≈ x ∗ , can we safely guarantee some p ∈ P is not in supp P ( x ∗ ) ? Ghaoui et al ’12, Fercoq et al ’15, Xiang and Ramadge ’12; Wang et al. ’14; Liu et al. ’13; Malti and Herzet ’16; Ndiaye et al. ’15; Bonnefoy et al. ’15; Zhou and Zhao ’15, ... many 6 / 20

Support recovery optimality condition minimize f ( x ) + φ ( κ P ( x )) x Property If p ∈ supp P ( x ∗ ) then −∇ f ( x ∗ ) T p = max p ′ ∈P −∇ f ( x ∗ ) T p ′ . Example: Squared penalty f ( x ) + 1 2 � x � 2 min 1 x Then at optimality, with z ∗ := −∇ f ( x ∗ ) , � | z ∗ i | = � x ∗ � 1 x ∗ if i � = 0 | z ∗ i | ≤ � x ∗ � 1 x ∗ if i = 0 . 7 / 20

Support recovery optimality condition minimize f ( x ) + φ ( κ P ( x )) x Property If p ∈ supp P ( x ∗ ) then −∇ f ( x ∗ ) T p = max p ′ ∈P −∇ f ( x ∗ ) T p ′ . Example: Polyhedral constraint min f ( x ) subject to � Ax � 1 ≤ c x Then P = col( A − 1 ) and for A = ( a 1 , ..., a n ) , A − 1 = ( b 1 , ..., b n ) , i x ∗ � = 0 ⇒ | b T i ∇ f ( x ∗ ) | = � B ∇ f ( x ∗ ) � ∞ . a T Proof : Normal cone condition: −∇ f ( x ∗ ) T y ≤ −∇ f ( x ∗ ) T x ∗ , ∀ y : � Ay � 1 ≤ c At optimality, −∇ f ( x ∗ ) = A T u so u T Ay u T Ax ∗ . attainable require = � u � ∞ � Ax ∗ � 1 max = � u � ∞ � Ay � 1 = � �� y = c 7 / 20

Observation for gradient screening minimize f ( x ) + φ ( � x � 1 ) x Optimality conditions |∇ f ( x ∗ ) | i < �∇ f ( x ∗ ) � ∞ ⇒ x ∗ i = 0 If x ( t ) ≈ x ∗ , by smoothness ∇ f ( x ( t ) ) ≈ ∇ f ( x ∗ ) If �∇ f ( x ( t ) ) − ∇ f ( x ∗ ) � ∞ ≤ ǫ , then �∇ f ( x ( t ) ) � ∞ − |∇ f ( x ( t ) ) i | > 2 ǫ ⇒ x ∗ i = 0 8 / 20

Observation for gradient screening (generalized) minimize f ( x ) + φ ( κ P ( x )) x Optimality conditions −∇ f ( x ∗ ) T p < max p ′ ∈P −∇ f ( x ∗ ) T p ′ ⇒ p �∈ supp P ( x ∗ ) Measure of gradient error: p, − p ∈P ( z − z ∗ ) T p P ( z − z ∗ ) := max σ � (“dual gauge”) P ( ∇ f ( x ) − ∇ f ( x ∗ )) ≤ ǫ , then If σ � p ′ ∈P |∇ f ( x ) T p ′ | − |∇ f ( x ) T p | > 2 ǫ ⇒ p �∈ supp P ( x ∗ ) max If I know ǫ , this condition does not depend on x ∗ ! 9 / 20

Gap bounds gradient error Primal-dual pair ( P ) min Ψ( x ) := f ( x ) + h ( x ) x Ω( z ) := − f ∗ ( − z ) − h ∗ ( z ) ( D ) max z where f ∗ ( z ) = sup x x T z − f ( x ) is the convex conjugate. Gap bounds gradient error gap ( x, −∇ f ( x )) = Ψ( x ) − Ω( −∇ f ( x )) ( x − x ∗ ) T ( ∇ f ( x ) − ∇ f ( x ∗ )) ≥ smooth f 1 P ( ∇ f ( x ∗ ) − ∇ f ( x )) 2 ≥ Lσ � Smoothness : for � P = P ∪ −P : f ( x ) − f ( y ) ≤ ∇ f ( y ) T ( y − x ) + L P ( x − y ) 2 , 2 κ � ∀ x, y. 10 / 20

Theorem (Screening) For any x , any p ∈ P , � p ′ ∈P −∇ f ( x ) T ( p ′ − p ) > 2 max L gap ( x, −∇ f ( x )) implies that p �∈ supp P ( x ∗ ) . Theorem (Support recovery) The support is recovered when � i ≤ t gap ( x ( i ) , ∇ f ( x ( i ) )) < δ/ 4 , L min −∇ f ( x ∗ ) T ( p − p ′ ) where δ = min p ∈ supp P ( x ∗ ) , p ′ �∈ supp P ( x ∗ ) See also Ghaoui et al ’12, Ndiaye et al. ’15, ... 11 / 20

Goals min f ( x ) + φ ( κ ( x )) �� x loss sparsity Screening : Given x ( t ) → x ∗ , where x ∗ is a sparse combination of essential atoms, can I omit nonessential atoms based on x ( t ) ? Method : Is gCGM well-defined? What kinds of φ can we use? Support recovery of screened method : Is there a finite time t when we can exactly recover the support of x ∗ from x ( t ) ? 12 / 20

Conditional gradient method (Warm-up) minimize f ( x ) x ∈P Vanilla CGM ( P = { x : � x � 1 ≤ 1 } ) s ( t ) = argmin ∇ f ( x ( t ) ) T s (LMO) s ∈P = − sign ( ∇ f ( x ( t ) ) k ) e k , |∇ f ( x ( t ) ) k | = �∇ f ( x ( t ) ) � ∞ x ( t +1) = (1 − θ ( t ) ) x ( t ) + θ ( t ) s ( t ) 13 / 20

Generalized conditional gradient method minimize f ( x ) + φ ( κ P ( x )) x s ( t ) = argmin ∇ f ( x ( t ) ) T s + φ ( κ P ( s )) = ξ ˆ s generalized LMO s ∇ f ( x ( t ) ) T ˆ ˆ s = argmin s direction � �� s ∈P ˆ =: ν ξ = argmin ξν + φ ( ξ ) magnitude ξ ≥ 0 x ( t +1) = (1 − θ ( t ) ) x ( t ) + θ ( t ) s ( t ) convex merging Computational complexity: ≈ same as vanilla CGM (LMO + 1-D optimization) 14 / 20

Computing magnitude − ξ · ν + φ ( ξ ) = ( φ ∗ ) ′ ( ν ) ξ = argmin ξ ≥ 0 where φ ∗ ( ν ) = sup ξ ξ · ν − φ ( ξ ) is the convex conjugate Example : φ ( ξ ) = p − 1 ξ p � 0 , ν ≤ 1 p = 1 , ξ = (doesn’t work for LASSO) + ∞ , else. p = 2 , ξ = ν (not bounded) p = + ∞ , ξ = 1 (vanilla CGM) 15 / 20

Assumptions on φ minimize f ( x ) + φ ( κ P ( x )) x Assumption The function φ : R + → R + is monotonically nondecreasing over all ξ ≥ 0 with subdifferentials not upper-bounded: ξ → + ∞ sup { ν : ν ∈ ∂φ ( ξ ) } → + ∞ (finite ξ always exists) lower-bounded by a quadratic function φ ( ξ ) ≥ µ φ ξ 2 − φ 0 ( ξ doesn’t grow too fast) for some µ φ > 0 and φ 0 ∈ R . 16 / 20

Safe screening for the generalized conditional gradient method Yifan - PowerPoint PPT Presentation

Safe screening for the generalized conditional gradient method Yifan Sun Joint work with Francis Bach at INRIA Paris March 2020 1 / 20 Problem statement minimize f ( x ) + ( ( x )) x f : R n R convex and smooth : R n R +

Metal - - screening screening Metal Thomas-Fermi (static) screening potential of point charge

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Colorectal Cancer Screening Fall 2018 Agenda CRC Screening Landscape Colonoscopy: The

DIABETIC EYE SCREENING What is Diabetic Eye Screening? Diabetic eye screening means taking a

Screening Controlled Substance Screening Controlled Substance Screening Controlled Substance

Diabetic Eye Screening Extended Screening Intervals Public Health England leads the NHS Screening

Toda flows, gradient flows and the generalized Flaschka map Anthony Bloch Dissipation and

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Generalized MPLS Signaling draft-ietf-mpls-generalized-signaling-05.txt

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Screening Screening By: Michael OReilly Technical Advisor FETP Thailand Session Objectives

Goals CRC/Screening Facts Available CRC Screening Tests Tools To Help you Talk About

Gaussian ensemble screening (GES): A new Gaussian ensemble screening (GES): A new approach to

1 2 3 4 5 F I N A N C I A L P R O T E C T I O N S S E R V I C E SCREENING The Screening

Microscopic Model of Charmonium Strong Decays J. Segovia , D.R. Entem and F. Fern andez

RAMBleed: Reading Bits in Memory Without Accessing Them Andrew Kwong, Daniel Genkin, Daniel

L 2 discrepancy of digit scrambled two-dimensional Hammersley point sets Friedrich Pillichshammer

Classicalization, Scrambling and Thermalization in QCD at high energies Raju Venugopalan

Theories of VHE emission from pulsar magnetospheres Kouichi HIROTANI +BH (IC 310) ASIAA,Taiwan

Exotic hadrons from BESIII Changzheng Yuan IHEP, Beijing (for the BESIII Collaboration)

Beam-Prof rofil ile M e Monitor I or Issues ues a at A0PI: The C Case f for r Reduced E

CS 839 Scribing Liang Shang, Siyang Chen 1 Introduction We are introducing unsupervised data