frank wolfe algorithms for saddle point problems
play

Frank-Wolfe Algorithms for Saddle Point problems author: Gauthier - PowerPoint PPT Presentation

Frank-Wolfe Algorithms for Saddle Point problems author: Gauthier Gidel, Supervisors: Simon Lacoste-Julien & Tony Jebara INRIA Paris, Sierra Team & Columbia University September 15 th 2016 September 15 th 2016 Gauthier Gidel, Simon


  1. Frank-Wolfe Algorithms for Saddle Point problems author: Gauthier Gidel, Supervisors: Simon Lacoste-Julien & Tony Jebara INRIA Paris, Sierra Team & Columbia University September 15 th 2016 September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  2. Overview ◮ Machine Learning needs to tackle complicated optimization problems ⇒ ML needs optimization. ◮ Frank-Wolfe algorithm (FW) gained in popularity in the last couple of years. ◮ It is a convex optimization algorithm solving constrained problems. ◮ We tried to extend FW to saddle point optimization which is non trivial (we partially answered a 30 years old conjecture). September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  3. Motivations: Games Zero-sum games with two players: ◮ Player 1 has actions { 1 , . . . , I } available. ◮ Player 2 has actions { 1 , . . . , J } available. ◮ If action i and action j , implies a reward M ij for Player 1 ◮ Two players play randomly, x ∈ ∆( | I | ) , y ∈ ∆( | J | ), E [ M ij ] = x ⊤ M y Nash equilibrium: ( x ∗ , y ∗ ) ∈ X × Y , ( x ∗ ) ⊤ M y ≤ ( x ∗ ) ⊤ M y ∗ ≤ x ⊤ M y ∗ ∀ ( x , y ) ∈ X × Y September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  4. Saddle point setting Let L : X × Y → R , where X and Y are convex and compact. • Intuition from two players games: ◮ L is a score function. ◮ P1 chooses action in X and want to minimize the score. ◮ P2 chooses action in Y and want to maximize the score. ◮ The saddle point is the couple of best choice for each player. • L is said to be convex-concave if: 1. ∀ y ∈ Y , x �→ L ( x , y ) is convex. 2. ∀ x ∈ X , y �→ L ( x , y ) is concave. • A saddle point is a couple ( x ∗ , y ∗ ) such that, ∀ ( x , y ) ∈ X × Y , L ( x ∗ , y ) ≤ L ( x ∗ , y ∗ ) ≤ L ( x , y ∗ ) September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  5. Motivations: mores applications Robust learning: 1 We want to learn n 1 � min ℓ ( f θ ( x i ) , y i ) + λ Ω( θ ) (1) n θ ∈ Θ i =1 with an uncertainty regarding the data: n � min θ ∈ Θ max ω i ℓ ( f θ ( x i ) , y i ) + λ Ω( θ ) (2) w ∈ ∆ n i =1 1 Junfeng Wen, Chun-Nam Yu, and Russell Greiner. “Robust Learning under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification.” In: ICML . 2014, pp. 631–639. September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  6. Standard approaches in literature The standard algorithm to solve Saddle point optimization is the projected gradient algorithm. x ( t +1) = P X ( x ( t ) − η ∇ x L ( x ( t ) , y ( t ) )) y ( t +1) = P Y ( y ( t ) + η ∇ y L ( x ( t ) , y ( t ) )) When the gradient is uniformly bounded, T 1 � x ( t ) , y ( t ) � � T →∞ ( x ∗ , y ∗ ) − → (3) T t =1 September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  7. The FW algorithm Initialize x (0) . For t = 0 , . . . , T do ◮ Compute: s ( t ) := argmin s , ∇ f ( x ( t ) � � . s ∈X ◮ Let γ t = 2 2+ t . ◮ Update: x ( t +1) = x ( t ) + γ t ( s ( t ) − x ( t ) ) Figure: One step of the FW end algorithm September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  8. SPFW Then a Saddle point version of Frank Wolfe algorithm is ◮ Let z (0) = ( x (0) , y (0) ) ∈ X × Y ◮ For t = 0 . . . T � ∇ x L ( x ( t ) , y ( t ) ) � ◮ Compute G = −∇ y L ( x ( t ) , y ( t ) ) ◮ Compute s ( t ) := argmin � s , G � s ∈X×Y ◮ Let γ t = 2 2+ t ◮ Update z ( t +1) := (1 − γ t ) z ( t ) + γ t s ( t ) ◮ return ( x ( T ) , y ( T ) ) September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  9. Advantages of SP-FW Why would we use SP-FW ? ◮ Only a LMO (linear oracle). ◮ Gap certificate for free. ◮ Simplicity of implementation. g t 2 ◮ Universal step size 2+ k , adaptive step size 2 C L , . . . ◮ Sparsity of the solution. ◮ Lots of improvement easily available. Block-coordinate, Away Step... When the constraint set is a “complicated” polytope the projection can be super hard whereas the LMO might be tractable. September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  10. Problems with Hard projection The structured SVM: n ω λ Ω( ω ) + 1 ˜ � min H i ( ω ) n i =1 where ˜ H i ( ω ) = max y ∈Y i L i ( y ) − � ω, φ i ( y ) � is the structured hinge loss. Then we can rewrite the problem as n 1 � � � y i ∈Y i L ⊤ i y i − ω ⊤ M i y i min max n Ω( ω ) ≤ R i =1 but as the function is bilinear α ∈ ∆( |Y| ) b T α − ω T Mα min max Ω( ω ) ≤ β If Ω( · ) is a group lasso norm with overlapping projection is hard. Projecting on Y is intractable. September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  11. Problems with hard projection University game: 1. Game between two universities ( A and B ). 2. Admitting d students and have to assign pairs of students into dorms. 3. The game has a payoff matrix M belonging to R ( d ( d − 1) / 2) 2 . 4. M ij,kl is the expected tuition that B gets (or A gives up) if A pairs student i with j and B pairs student k with l . 5. Here the actions are both in the marginal polytope of all perfect unipartite matchings . Hard to project on this polytope whereas the LMO can be solved efficiently with the blossom algorithm 2 . 2 J. Edmonds. “Paths, trees and flowers”. In: Canadian Journal of Mathematics (1965). September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  12. Our contributions Theoretical contributions: ◮ Introduced a SP extension of FW with away step and proved its convergence over a polytope under some conditions (strong convexity of the function big enough). Partially answering a 30 years old conjecture 3 . ◮ With step size γ t ∼ g t � (1 − ρ ) t/ 3 � h t = O (4) . 3 Janice H Hammond. “Solving asymmetric variational inequality problems and systems of equations with generalized nonlinear programming algorithms”. PhD thesis. Massachusetts Institute of Technology, 1984. September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  13. Toy experiments 1 10 0 10 0 10 Duality Gap Duality Gap −2 10 −1 10 τ =0.037 γ = 2/(2+k) τ =−0.22 γ heuristic τ =0.037 γ adaptive −2 10 τ =−2.4 γ heuristic −4 τ =0.29 γ adaptive 10 τ =−46 γ heuristic τ =0.38 γ adaptive τ =−4.6e+03 γ heuristic τ =0.49 γ adaptive τ =−4.6e+03 γ = 2/(2+k) −3 10 0 500 1000 1500 2000 0 100 200 300 400 500 600 Iteration Iteration Figure: SP-AFW on a toy Figure: SP-AFW on a toy example d = 30 example d = 30 with heuristic step-size September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  14. Experiments 1 −2 10 10 0 Primal Suboptimality 10 −3 10 Duality gap −1 10 SP−FW, γ = 2/(2+k) d=28 −4 10 SP−FW, γ = 1/(1+k) d=120 −2 SP−BCFW, γ = 2n/(2n+k) 10 d=496 Subgradient d=2016 SSG, γ = 1/L(k+1) 1/2 d=8128 SSG γ 2 = 0.1/L(k+1) 1/2 d=32640 −3 −5 10 10 0 1 2 2 3 4 10 10 10 10 10 10 Iteration Effective Pass Figure: SP-FW on the University Figure: Structural SVM with game. OCR dataset (highly regularized). September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  15. Conclusion ◮ There already exist a lot a saddle point problem in the machine learning literature and they are most of the time solved by a trick. ◮ There exist a few number of algorithm to solve SP problems directly ! (and they are not well known) ◮ SP-FW work on SPs and is the only algorithm existing able to solve some of these problem. September 15 th 2016 Gauthier Gidel, Simon Lacoste-Julien Frank-Wolfe Algorithms for SP

  16. Thank You !

Recommend


More recommend