Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes . with Sebastian Pokutta School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA, USA 37 th International Conference on Machine Learning July 12–18, 2020
Outline 1 Introduction 2 The Frank-Wolfe algorithm 3 Boosting Frank-Wolfe 4 Computational experiments 2/19
Introduction Let H be a Euclidean space (e.g., R n or R m × n ) and consider min f ( x ) s.t. x ∈ C where • f : H → R is a smooth convex function • C ⊂ H is a compact convex set, C = conv( V ) 3/19
Introduction Let H be a Euclidean space (e.g., R n or R m × n ) and consider min f ( x ) s.t. x ∈ C where • f : H → R is a smooth convex function • C ⊂ H is a compact convex set, C = conv( V ) Example • Sparse logistic regression • Low-rank matrix completion m 1 � 1 � ( Y i , j − X i , j ) 2 min ln(1 + exp( − y i a ⊤ min i x )) 2 |I| X ∈ R m × n m x ∈ R n ( i , j ) ∈I i =1 s.t. � X � nuc � τ s.t. � x � 1 � τ 3/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility 4/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility x t − γ t ∇ f ( x t ) x t +1 x t 4/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility • However, in many situations projections onto C are very expensive 4/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility • However, in many situations projections onto C are very expensive • This is an issue with the method of projections, not necessarily with the geometry of C : linear minimizations over C can still be relatively cheap 4/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility • However, in many situations projections onto C are very expensive • This is an issue with the method of projections, not necessarily with the geometry of C : linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ 1 / ℓ 2 / ℓ ∞ -ball O ( n ) O ( n ) ℓ p -ball, p ∈ ]1 , ∞ [ \{ 2 } O ( n ) N/A Nuclear norm-ball O (nnz) O ( mn min { m , n } ) O ( n 3 . 5 ) Flow polytope O ( n ) O ( n 3 ) Birkhoff polytope N/A Matroid polytope O ( n ln( n )) O (poly( n )) N/A: no closed-form exists and solution must be computed via nontrivial optimization 4/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility • However, in many situations projections onto C are very expensive • This is an issue with the method of projections, not necessarily with the geometry of C : linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ 1 / ℓ 2 / ℓ ∞ -ball O ( n ) O ( n ) ℓ p -ball, p ∈ ]1 , ∞ [ \{ 2 } O ( n ) N/A Nuclear norm-ball O (nnz) O ( mn min { m , n } ) O ( n 3 . 5 ) Flow polytope O ( n ) O ( n 3 ) Birkhoff polytope N/A Matroid polytope O ( n ln( n )) O (poly( n )) N/A: no closed-form exists and solution must be computed via nontrivial optimization 4/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility • However, in many situations projections onto C are very expensive • This is an issue with the method of projections, not necessarily with the geometry of C : linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ 1 / ℓ 2 / ℓ ∞ -ball O ( n ) O ( n ) ℓ p -ball, p ∈ ]1 , ∞ [ \{ 2 } O ( n ) N/A Nuclear norm-ball O (nnz) O ( mn min { m , n } ) O ( n 3 . 5 ) Flow polytope O ( n ) O ( n 3 ) Birkhoff polytope N/A Matroid polytope O ( n ln( n )) O (poly( n )) N/A: no closed-form exists and solution must be computed via nontrivial optimization 4/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility • However, in many situations projections onto C are very expensive • This is an issue with the method of projections, not necessarily with the geometry of C : linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ 1 / ℓ 2 / ℓ ∞ -ball O ( n ) O ( n ) ℓ p -ball, p ∈ ]1 , ∞ [ \{ 2 } O ( n ) N/A Nuclear norm-ball O (nnz) O ( mn min { m , n } ) O ( n 3 . 5 ) Flow polytope O ( n ) O ( n 3 ) Birkhoff polytope N/A Matroid polytope O ( n ln( n )) O (poly( n )) N/A: no closed-form exists and solution must be computed via nontrivial optimization 4/19
Introduction • A natural approach is to use any efficient method and add projections back onto C to ensure feasibility • However, in many situations projections onto C are very expensive • This is an issue with the method of projections, not necessarily with the geometry of C : linear minimizations over C can still be relatively cheap Feasible region C Linear minimization Projection ℓ 1 / ℓ 2 / ℓ ∞ -ball O ( n ) O ( n ) ℓ p -ball, p ∈ ]1 , ∞ [ \{ 2 } O ( n ) N/A Nuclear norm-ball O (nnz) O ( mn min { m , n } ) O ( n 3 . 5 ) Flow polytope O ( n ) O ( n 3 ) Birkhoff polytope N/A Matroid polytope O ( n ln( n )) O (poly( n )) N/A: no closed-form exists and solution must be computed via nontrivial optimization • Can we avoid projections? 4/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. x t 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: 5/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. x t 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: 5/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. x t 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: 5/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. v t x t 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: 5/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. v t x t 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: 5/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. v t x t x t +1 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: 5/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. v t x t x t +1 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: • x t +1 is obtained by convex combination of x t ∈ C and v t ∈ C , thus x t +1 ∈ C 5/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. v t x t x t +1 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: • x t +1 is obtained by convex combination of x t ∈ C and v t ∈ C , thus x t +1 ∈ C • FW uses linear minimizations (the “FW oracle”) instead of projections 5/19
The Frank-Wolfe algorithm The Frank-Wolfe algorithm (Frank & Wolfe, 1956) a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966): −∇ f ( x t ) Algorithm Frank-Wolfe (FW) Input: x 0 ∈ C , γ t ∈ [0 , 1]. v t x t x t +1 1: for t = 0 to T − 1 do 2: v t ← arg min �∇ f ( x t ) , v � v ∈V x t +1 ← x t + γ t ( v t − x t ) 3: • x t +1 is obtained by convex combination of x t ∈ C and v t ∈ C , thus x t +1 ∈ C • FW uses linear minimizations (the “FW oracle”) instead of projections • FW = pick a vertex (using gradient information) and move in that direction 5/19
Recommend
More recommend