Block-Coordinate Frank-Wolfe Optimization with applications to structured prediction Martin Jaggi CMAP, Ecole Polytechnique, Paris Optimization and Big Data Workshop, Edinburgh, 2013 / 5 / 2 Co-Authors: Simon Lacoste-Julien, Mark Schmidt and Patrick Pletscher
Outline • Two Old First-Order Optimization Algorithms • Coordinate Descent • The Frank-Wolfe Algorithm • Duality for Constrained Convex Optimization • Combining Frank-Wolfe and Coordinate Descent • Applications: Large Margin Prediction • binary SVMs • structural SVMs
Coordinate Descent
Coordinate Descent f ( x ) Selection of next coordinate: • the one of steepest desc. • cycle (hard to analyze!) • random sampling x R d
The Frank- Wolfe Algorithm Frank and Wolfe (1956) D ⊂ R d
f ( x ) x ∈ D f ( x ) min x D ⊂ R d
f ( x ) x ∈ D f ( x ) min x D ⊂ R d
f ( x ) x ∈ D f ( x ) min x D ⊂ R d
f ( x ) x ∈ D f ( x ) min x D ⊂ R d
f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x s D ⊂ R d Algorithm 1 Frank-Wolfe for k = 0 . . . K do s 0 , r f ( x ( k ) ) ⌦ ↵ Compute s := arg min s 0 2 D 2 Let γ := k +2 Update x ( k +1) := (1 � γ ) x ( k ) + γ s end for
f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x r f ( x ) D ⊂ R d Frank-Wolfe Gradient Descent (approx.) solve Cost per step Projection back to D linearized problem on D ✓ ✗ Sparse Solutions (in terms of used vertices)
Some Examples of Atomic Domains Suitable for Frank-Wolfe X Optimization Domain Complexity of one Frank-Wolfe Iteration Atoms A D = conv( A ) sup s 2 D h s , y i Complexity R n Sparse Vectors k . k 1 -ball k y k 1 O ( n ) R n Sign-Vectors k . k 1 -ball k y k 1 O ( n ) R n ` p -Sphere k . k p -ball k y k q O ( n ) R n Sparse Non-neg. Vectors Simplex ∆ n max i { y i } O ( n ) � ⇤ � � R n P Latent Group Sparse Vec. k . k G -ball max g 2 G g 2 G | g | � y ( g ) g p ˜ R m ⇥ n � " 0 � Matrix Trace Norm k . k tr -ball k y k op = � 1 ( y ) O N f / (Lanczos) R m ⇥ n Matrix Operator Norm k . k op -ball k y k tr = k ( � i ( y )) k 1 SVD R m ⇥ n Schatten Matrix Norms k ( � i ( . )) k p -ball k ( � i ( y )) k q SVD ˜ f ( n + m ) 1 . 5 / " 0 2 . 5 � R m ⇥ n � Matrix Max-Norm k . k max -ball O N O ( n 3 ) R n ⇥ n Permutation Matrices Birkho ff polytope R n ⇥ n Rotation Matrices SVD (Procrustes prob.) p ˜ S n ⇥ n Rank-1 PSD matrices � " 0 � � max ( y ) O N f / (Lanczos) { x ⌫ 0 , Tr( x )=1 } of unit trace ˜ PSD matrices f n 1 . 5 / " 0 2 . 5 � S n ⇥ n � O N { x ⌫ 0 , x ii 1 } of bounded diagonal Table 1: Some examples of atomic domains suitable for optimization using the Frank-Wolfe algorithm. J. 2013 Here SVD refers to the complexity of computing a singular value decomposition, which is O (min { mn 2 , m 2 n } ) . N f is the number of non-zero entries in the gradient of the objective func- tion f , and " 0 = 2 δ C f k +2 is the required accuracy for the linear subproblems. For any p 2 [1 , 1 ] , the conjugate value q is meant to satisfy 1 p + 1 q = 1 , allowing q = 1 for p = 1 and vice versa. Dudık et al. 2011, Tewari et al. 2011, J. 2011
f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x D ⊂ R d Primal Convergence: Primal-Dual Convergence: Algorithms obtain Algorithms obtain � 1 � 1 gap ( x ( k ) ) ≤ O � f ( x ( k ) ) − f ( x ∗ ) ≤ O � k k after steps . after steps . k k [ Frank & Wolfe 1956 ] [ Clarkson 2008, J. 2013 ]
A Simple Optimization Duality Original Problem x ∈ D f ( x ) min f ( x ) The Dual Value gap (x) ω ( x ) := s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min ω ( x ) Weak Duality x D ⊂ R d ω ( x ) ≤ f ( x ⇤ ) ≤ f ( x 0 )
Block-Separable Optimization Problems x ∈ D (1) × ··· × D ( n ) f ( x ) min x = ( x (1) , . . . , x ( n ) ) f ( x ) f ( x ) f ( x ) × × · · · × x (1) x ( i ) x ( n ) D ( i ) ∈ R d i D (1) ∈ R d 1 D ( n ) ∈ R d n
f ( x ) f ( x ) f ( x ) f ( x ) f ( x ) f ( x ) × × × · · · × × · · · × x (1) x (1) x ( n ) x ( n ) x ( i ) x ( i ) Algorithm 2: Uniform Coordinate Descent Algorithm 3: Block-Coordinate “Frank-Wolfe” Let x (0) 2 D Let x (0) 2 D for k = 0 . . . K do for k = 0 . . . K do Pick i 2 u.a.r. [ n ] Pick i 2 u.a.r. [ n ] D E D E s ( i ) , r ( i ) f ( x ( k ) ) s ( i ) , r ( i ) f ( x ( k ) ) Compute s ( i ) := arg min Compute s ( i ) := arg min s ( i ) ∈ D ( i ) s ( i ) ∈ D ( i ) � 2 + L i � � � s ( i ) � x ( i ) 2 n Let γ := k +2 n , or optimize γ by line-search 2 Update x ( k +1) := x ( k ) s ( i ) � x ( k ) Update x ( k +1) := x ( k ) s ( i ) � x ( k ) � � ( i ) + � � ( i ) + γ ( i ) ( i ) ( i ) ( i ) end end Hidden constant: Theorem: Curvature Algorithm obtains Nesterov (2012) i L f diam 2 ( D ( i ) ) accuracy ≤ P Richtárik, Takác ̌ (2012) 2 n � � O ``Huge-Scale’’ Coordinate Descent k +2 n (also in duality gap , after steps. k and with inexact subproblems )
Applications: Large Margin Prediction • Binary Support Vector Machine (no bias) • also: Ranking SVM ⌦ ↵ w , φ ( x i ) y i ≥ 1 − ξ i primal problem: λ 2 k w k 2 min w n + 1 n ↵o X ⌦ max 0 , 1 � w , φ ( x i ) y i n i =1
Binary SVM ⌦ ↵ w , φ ( x i ) y i ≥ 1 − ξ i primal dual 2 k w k 2 λ 2 k A α k 2 � b T α min λ min f ( α ) := w 2 R d α 2 R n n n ↵o X ⌦ + 1 s.t. 0 α i 1 8 i 2 [ n ] max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A • • d -dim n -dim • • unconstrained box-constrained • • non -smooth, strongly convex smooth, not strongly convex
Structural SVM ``joint’’ feature map φ : X × Y → R d large margin ``separation’’ ⌦ ↵ w , φ ( x i , y i ) − φ ( x i , y ) ≥ L ( y , y i ) − ξ i ∀ y primal problem: φ ( , 2 ) φ ( , 7 ) 2 k w k 2 λ min φ ( , 4 ) w 2 R d n n ↵o φ ( , 0 ) X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y | {z } i =1 ( i, y )-th column of A φ ( , 1 ) φ ( , 3 )
Structural SVM ``joint’’ feature map φ : X × Y → R d large margin ``separation’’ ⌦ ↵ w , φ ( x i , y i ) − φ ( x i , y ) ≥ L ( y , y i ) − ξ i ∀ y uxtecpsss d ( , ) φ primal problem: e t c e φ ( , ) p x e nuexpcted n 2 k w k 2 λ u min ( , ) φ w 2 R d n n ↵o X aaaaaaa ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) ( , ) φ n y 2 Y | {z } i =1 ( i, y )-th column of A decoding oracle donaudampfschifffahrtsgesellschaftskapitän |Y| = 26 42
Binary SVM primal dual 2 Y 2 k A α k 2 � b T α 2 k w k 2 λ λ min min f ( α ) := w 2 R d α 2 R n n n ↵o X s.t. 0 α i 1 8 i 2 [ n ] ⌦ + 1 max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A primal-dual Structural SVM correspondence w = A α primal dual 2 k A α k 2 � b T α 2 k w k 2 λ λ min f ( α ) := min α 2 R n ·|Y| w 2 R d P n s.t. y 2 Y α i ( y ) = 1 8 i 2 [ n ] n ↵o X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y and α i ( y ) � 0 8 i 2 [ n ] , 8 y 2 Y | {z } i =1 ( i, y )-th column of A
Binary SVM primal dual 2 Y 2 k A α k 2 � b T α 2 k w k 2 λ λ min min f ( α ) := w 2 R d α 2 R n n n ↵o X s.t. 0 α i 1 8 i 2 [ n ] ⌦ + 1 max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A • d -dim • n -dim • unconstrained • box-constrained • non -smooth, strongly convex • smooth, not strongly convex Optimization Algorithms primal dual • Frank-Wolfe batch ( n cost per iteration) • subgradient descent =cutting plane ( SVM-light ) O ( R 2 λε ) • coordinate descent (Hsieh 2008) ( 1 cost per iteration) • stochastic subgradient online =block-coordinate descent (SGD, Pegasos) =block-coordinate Frank-Wolfe
φ ( , 2 ) φ ( ,7) Structural SVM φ ( ,4) φ ( ,0) φ ( ,1) primal dual 2 k A α k 2 � b T α λ 2 k w k 2 min f ( α ) := λ min α 2 R n ·|Y| w 2 R d P n s.t. y 2 Y α i ( y ) = 1 8 i 2 [ n ] n ↵o X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y and α i ( y ) � 0 8 i 2 [ n ] , 8 y 2 Y | {z } i =1 ( i, y )-th column of A • d -dim • n |Y| - dim • unconstrained • block-constrained • non -smooth, strongly convex • smooth, not strongly convex Optimization Algorithms Optimization Algorithms primal dual • Frank-Wolfe batch ( n cost per iteration) • subgradient descent =cutting plane ( SVM-struct ) O ( R 2 λε ) • block coordinate descent (Nesterov) ( 1 cost per iteration) • stochastic subgradient online • block-coordinate Frank-Wolfe (SGD, Pegasos)
Recommend
More recommend