Block Conditional Gradient Algorithms E. Pauwels joint work with A. Beck and S. Sabach. GdT Math´ ematiques de l’apprentissage September 24 2015 1 / 21
Context: large scale convex optimization Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles: ● x 1 . min x ∈ X � x , c � x = . . x N 2 / 21
Context: large scale convex optimization Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles: ● x 1 . min x ∈ X � x , c � x = . . x N Coordinate descent: Conditional gradient: Large dimension “Complex constraints” Distributed data Primal-dual interpretation 2 / 21
Context: large scale convex optimization Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles: ● x 1 . min x ∈ X � x , c � x = . . x N Coordinate descent: Conditional gradient: Large dimension “Complex constraints” Distributed data Primal-dual interpretation Theoretical properties and empirical performances? 2 / 21
Scope of the presentation Most results in the litterature hold for random block selection rules. Lacoste-Julien and co-authors analyzed the random block conditional gradient method (RBCG). ◮ Block-Coordinate Frank-Wolfe Optimization for Structural SVMs (ICML 2013) We propose a convergence analysis for the cyclic block variant (CBCG). 3 / 21
Scope of the presentation Most results in the litterature hold for random block selection rules. Lacoste-Julien and co-authors analyzed the random block conditional gradient method (RBCG). ◮ Block-Coordinate Frank-Wolfe Optimization for Structural SVMs (ICML 2013) We propose a convergence analysis for the cyclic block variant (CBCG). This presentation: focus on machine learning related aspects General introduction to linear oracle based optimization methods. Specification to (regularized) empirical risk minimization (ERM). Details about the application to structured SVM. (Taskar et. al., 2003 – Tsochantaridis et. al., 2005) 3 / 21
Outline 1. Context 2. Conditional Gradient algorithm 3. CG and convex duality 4. Block CG and L 2 regularized ERM 5. Results 4 / 21
Main idea Optimization setting: f : R n → R is convex, C 1 with L -Lipschitz gradient over X ⊂ R n which is convex and compact. ¯ f := min x ∈ X f ( x ) 5 / 21
Main idea Optimization setting: f : R n → R is convex, C 1 with L -Lipschitz gradient over X ⊂ R n which is convex and compact. ¯ f := min x ∈ X f ( x ) Start with x 0 ∈ X p k ∈ argmax y ∈ X ∇ f ( x k ) , x k − y � � x k +1 = (1 − α k ) x k + α k p k 0 ≤ α k ≤ 1 5 / 21
Main idea Optimization setting: f : R n → R is convex, C 1 with L -Lipschitz gradient over X ⊂ R n which is convex and compact. ¯ f := min x ∈ X f ( x ) Start with x 0 ∈ X p k ∈ argmax y ∈ X ∇ f ( x k ) , x k − y � � x k +1 = (1 − α k ) x k + α k p k 0 ≤ α k ≤ 1 Step size: α k = 2 Open loop k +2 x k +1 = argmin y ∈ [ x k , p k ] f ( y ) Exact line search x k +1 = argmin y ∈ [ x k , p k ] Q ( x k , y ) Approximate line search f ( y ) ≤ Q ( x , y ) := f ( x ) + �∇ f ( x ) , y − x � + L 2 � y − x � 2 2 (tangent quadratic upper bound, descent Lemma) . 5 / 21
Historical remarks Fifty years ago: First appearance for quadratic programs (Frank, Wolfe, 1956). f ( x k ) − ¯ f = O (1 / k ) (Polyak, Dunn, Dem’Yanov . . . , 60’s). For any ǫ > 0, it cannot be O (1 / k 1+ ǫ ) (Canon, Cullum, Polyak, 60’s) 6 / 21
Historical remarks Fifty years ago: First appearance for quadratic programs (Frank, Wolfe, 1956). f ( x k ) − ¯ f = O (1 / k ) (Polyak, Dunn, Dem’Yanov . . . , 60’s). For any ǫ > 0, it cannot be O (1 / k 1+ ǫ ) (Canon, Cullum, Polyak, 60’s) Recent developments (illustrations follow): Revival for large scale problems. Primal dual interpretation (Bach 2015) and convergence analysis (Jaggi 2013) Block decomposition variants (Lacoste-Julien et al. 2013) 6 / 21
Why is it interesting? O (1 / k 2 ) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice. 7 / 21
Why is it interesting? O (1 / k 2 ) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice. In some situations, projection does not constitute a practical alternative. Linear programs on convex sets attain their value at extreme points. 7 / 21
Why is it interesting? O (1 / k 2 ) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice. In some situations, projection does not constitute a practical alternative. Linear programs on convex sets attain their value at extreme points. Trace norm: For M ∈ R m × n , � M � ∗ = � i σ i , where { σ i } is the set of singular values of M . Projection on the trace norm ball is a thresholding of singular values → full SVD. Linear programming on the trace norm ball is finding the largest singular value → leading singular vector. 7 / 21
Outline 1. Context 2. Conditional Gradient algorithm 3. CG and convex duality 4. Block CG and L 2 regularized ERM 5. Results 8 / 21
Convex duality Recall that X is convex and compact. Define its support function g : R n → R n g : w → max x ∈ X � x , w � 9 / 21
Convex duality Recall that X is convex and compact. Define its support function g : R n → R n g : w → max x ∈ X � x , w � Given A ∈ R n × m and b ∈ R n , consider the problems 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X 9 / 21
Convex duality Recall that X is convex and compact. Define its support function g : R n → R n g : w → max x ∈ X � x , w � Given A ∈ R n × m and b ∈ R n , consider the problems 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X Weak duality: for any w ∈ R m and x ∈ X , P ( w ) + D ( x ) ≥ 0 Strong duality holds p + ¯ ¯ d = 0 9 / 21
Primal subgradient and dual conditional gradient g : w → max x ∈ X � x , w � ( x ∈ argmax ⇔ x ∈ ∂ g ( w )) 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X 10 / 21
Primal subgradient and dual conditional gradient g : w → max x ∈ X � x , w � ( x ∈ argmax ⇔ x ∈ ∂ g ( w )) 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X A conditional gradient step in the dual: p k : AA T x k − b , x k − y + g ( − AA T x k + b ) � � = � A T x k � 2 � b , x k � max 2 − y ∈ X = P ( A T x k ) + D ( x k ) 10 / 21
Primal subgradient and dual conditional gradient g : w → max x ∈ X � x , w � ( x ∈ argmax ⇔ x ∈ ∂ g ( w )) 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X A conditional gradient step in the dual: p k : AA T x k − b , x k − y + g ( − AA T x k + b ) � � = � A T x k � 2 � b , x k � max 2 − y ∈ X = P ( A T x k ) + D ( x k ) Consider the primal variable w k = A T x k : we have p k ∈ ∂ g ( − A w k + b ). w k +1 − w k = α k A T ( − x k + p k ) = − α k ∂ P ( w k ) 10 / 21
Primal subgradient and dual conditional gradient g : w → max x ∈ X � x , w � ( x ∈ argmax ⇔ x ∈ ∂ g ( w )) 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X A conditional gradient step in the dual: p k : AA T x k − b , x k − y + g ( − AA T x k + b ) � � = � A T x k � 2 � b , x k � max 2 − y ∈ X = P ( A T x k ) + D ( x k ) Consider the primal variable w k = A T x k : we have p k ∈ ∂ g ( − A w k + b ). w k +1 − w k = α k A T ( − x k + p k ) = − α k ∂ P ( w k ) Implicit subgradient steps in the primal! 10 / 21
Primal subgradient and dual conditional gradient The primal-dual interpretation holds in much more general settings (Bach 2015). Primal-dual convergence analysis, min i =1 ,..., k P ( w i ) + D ( x i ) = O (1 / k ) (Jaggi 2013). Automatic step size tuning for subgradient descent in the primal. 11 / 21
Outline 1. Context 2. Conditional Gradient algorithm 3. CG and convex duality 4. Block CG and L 2 regularized ERM 5. Results 12 / 21
✶ L 2 regularized ERM Consider a problem of the form: N λ 2 + 1 � 2 � w � 2 p = min ¯ g ( − A i w + b i ) (= P ( w )) N w ∈ R m i =1 2 � � N N λ 1 − 1 � � ¯ � � T x i d = min A i � x i , b i � (= D ( x )) � � 2 � N λ � N x i ∈ X , i =1 ,..., N � � i =1 i =1 2 13 / 21
Recommend
More recommend