composite objective mirror descent
play

Composite Objective Mirror Descent John C. Duchi 1 , 3 Shai - PowerPoint PPT Presentation

Composite Objective Mirror Descent John C. Duchi 1 , 3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research 4 Toyota Technological Institute, Chicago


  1. Composite Objective Mirror Descent John C. Duchi 1 , 3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research 4 Toyota Technological Institute, Chicago June 29, 2010 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 1 / 22

  2. Large scale logistic regression Problem: n huge, n � 1 min log(1 + exp( � a i , x � )) + λ � x � 1 n x i =1 � �� � = f ( x ) “Usual” approach: online gradient descent (Zinkevich ’03). Let g t = ∇ log(1 + exp( � a t , x t � )) x t +1 = x t − η t g t − η t λ sign ( x t ) Then perform online to batch conversion Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 2 / 22

  3. Problems with usual approach ◮ Regret bound/convergence rate: set G = max t � g t + λ sign ( x t ) � 2 � � x ∗ � 2 G � f ( x T ) + λ � x T � 1 = f ( x ∗ ) + λ � x ∗ � 1 + O √ T √ But G = Θ( d )—additional penalty of sign ( x t ) ◮ No sparsity in x T Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 3 / 22

  4. Problems with usual approach ◮ Regret bound/convergence rate: set G = max t � g t + λ sign ( x t ) � 2 � � x ∗ � 2 G � f ( x T ) + λ � x T � 1 = f ( x ∗ ) + λ � x ∗ � 1 + O √ T √ But G = Θ( d )—additional penalty of sign ( x t ) ◮ No sparsity in x T ◮ Why should we suffer from �·� 1 term? Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 3 / 22

  5. Online Gradient Descent Let g t = ∇ log(1 + exp( � a t , x t � )) + λ sign ( x t ). OGD step (Zinkevich ’03): � � η � g t , x � + 1 2 � x − x t � 2 x t +1 = x t − η g t = argmin 2 x f ( x ) + λ � x � 1 f ( x t ) + � g t , x - x t � Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 4 / 22

  6. Online Gradient Descent Let g t = ∇ log(1 + exp( � a t , x t � )) + λ sign ( x t ). OGD step (Zinkevich ’03): � � η � g t , x � + 1 2 � x − x t � 2 x t +1 = x t − η g t = argmin 2 x � g t , x � + B ψ ( x, x t ) Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 4 / 22

  7. Problems with Subgradient Methods ◮ Subgradients are non-informative at singularities Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 5 / 22

  8. Problems with Subgradient Methods ◮ Subgradients are non-informative at singularities Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 5 / 22

  9. Problems with Subgradient Methods ◮ Subgradients are non-informative at singularities Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 5 / 22

  10. Composite Objective Approach Let g t = ∇ log(1 + exp( � a t , x t � )). Truncated gradient (Langford et al. ’08, Duchi & Singer ’09): � 1 � 2 � x − x t � 2 + η � g t , x � + ηλ � x � 1 x t +1 = argmin x = sign ( x t − η g t ) ⊙ [ | x t − η g t | − ηλ ] + x t - ηg t [ | x t - ηg t | - ηλ ] + Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 6 / 22

  11. Composite Objective Approach Update is x t +1 = sign ( x t − η g t ) ⊙ [ | x t − η g t | − ηλ ] + Two nice things: ◮ Sparsity from [ · ] + ◮ Convergence rate: let G = max t � g t � 2 � � x ∗ � 2 G � f ( x T ) + λ � x T � 1 = f ( x ∗ ) + λ � x ∗ � 1 + O √ T No extra penalty from λ � x � 1 ! Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 7 / 22

  12. Abstraction to Regularized Online Convex Optimization Repeat: ◮ Learner plays point x t ◮ Receive f t + ϕ ( ϕ known) ◮ Suffer loss f t ( x t ) + ϕ ( x t ) Goal: attain small regret T T � � R ( T ) := f t ( x t ) + ϕ ( x t ) − inf f t ( x ) + ϕ ( x ) x ∈X t =1 t =1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 8 / 22

  13. Composite Objective MIrror Descent Let g t = ∇ f t ( x t ). Comid step: x t +1 = argmin { B ψ ( x , x t ) + η � g t , x � + ηϕ ( x ) } x ∈X f ( x ) f ( x ) + ϕ ( x ) ϕ ( x ) Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 9 / 22

  14. Composite Objective MIrror Descent Let g t = ∇ f t ( x t ). Comid step: x t +1 = argmin { B ψ ( x , x t ) + η � g t , x � + ηϕ ( x ) } x ∈X B ψ ( x, x t ) + � g, x � + ϕ ( x ) f ( x ) + ϕ ( x ) Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 9 / 22

  15. Convergence Results Old (online gradient/mirror descent): Theorem: For any x ∗ ∈ X , � T f t ( x t ) + ϕ ( x t ) − f t ( x ∗ ) − ϕ ( x ∗ ) t =1 T � ≤ 1 η B ψ ( x ∗ , x 1 ) + η �∇ f t ( x t ) + ∇ ϕ ( x t ) � 2 ∗ 2 t =1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 10 / 22

  16. Convergence Results Old (online gradient/mirror descent): Theorem: For any x ∗ ∈ X , � T f t ( x t ) + ϕ ( x t ) − f t ( x ∗ ) − ϕ ( x ∗ ) t =1 T � ≤ 1 η B ψ ( x ∗ , x 1 ) + η �∇ f t ( x t ) + ∇ ϕ ( x t ) � 2 ∗ 2 t =1 New ( Comid ): Theorem: For any x ∗ ∈ X , T T � � f t ( x t ) + ϕ ( x t ) − f t ( x ∗ ) − ϕ ( x ∗ ) ≤ 1 η B ψ ( x ∗ , x 1 ) + η �∇ f t ( x t ) � 2 ∗ 2 t =1 t =1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 10 / 22

  17. Derived Algorithms ◮ FOBOS (Duchi & Singer, 2009) ◮ p -norm divergences ◮ Mixed-norm regularization ◮ Matrix Comid Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 11 / 22

  18. p -norms Better ℓ 1 algorithms: ϕ ( x ) = λ � x � 1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 12 / 22

  19. p -norms Better ℓ 1 algorithms: ϕ ( x ) = λ � x � 1 ◮ Idea: non-Euclidean geometry (e.g. dense gradients, sparse x ∗ ) p is strongly convex over R d w.r.t. ℓ p , 2( p − 1) � x � 2 1 ◮ Recall 1 < p ≤ 2 2 � x � 2 ◮ Take ψ ( x ) = 1 p Corollary: When � f ′ t ( x t ) � ∞ ≤ G ∞ , take p = 1 + 1 / log d to get � � � � x ∗ � 1 G ∞ R ( T ) = O T log d Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 12 / 22

  20. Derived p -norm algorithms SMIDAS (Shalev-Shwartz & Tewari 2009): take ϕ ( x ) = λ � x � 1 . Assume sign ([ ∇ ψ ( x )] j ) = sign ( x j ), define S λ ( z ) = sign ( z ) · [ | z | − λ ] + Then x t +1 = ( ∇ ψ ) − 1 S ηλ ( ∇ ψ ( x t ) − η f ′ t ( x t )) λη t 0 x t +1 x t +1 / 2 0 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 13 / 22

  21. Comid with mixed norms � d ϕ ( X ) = � X � ℓ 1 /ℓ q = � x j � q j =1   � x 1 � q x 1   x 2 � x 2 � q   X = ⇒  .  . . .   . . x d � x d � q ◮ Separable and solvable using previous methods ◮ Multitask and multiclass learning ◮ x j associated with feature j ◮ Penalize x j once Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 14 / 22

  22. Mixed-norm p -norm algorithms Specialize problem to x � v , x � + 1 2 � x � 2 min p + λ � x � ∞ ◮ Closed form? No. Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 15 / 22

  23. Mixed-norm p -norm algorithms Specialize problem to x � v , x � + 1 2 � x � 2 min p + λ � x � ∞ ◮ Closed form? No. ◮ Dual problem ( x ∗ = v − β ): min β � v − β � q subject to � β � 1 ≤ λ Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 15 / 22

  24. Mixed-norm p -norm algorithms Problem: min β � v − β � q subject to � β � 1 ≤ λ Observation: Monotonicity of β , so v i ≥ v j implies β i ≥ β j Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 16 / 22

  25. Mixed-norm p -norm algorithms Problem: min β � v − β � q subject to � β � 1 ≤ λ Observation: Monotonicity of β , so v i ≥ v j implies β i ≥ β j Root-finding problem: β 6 ( θ ) d d � � � v i − θ 1 / ( q − 1) � λ = β i ( θ ) = + i =1 i =1 v 4 v 6 v 8 Solve with median-like search Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 16 / 22

  26. Matrix Comid Idea: get sparsity in spectrum of X ∈ R d 1 × d 2 . Take min { d 1 , d 2 } � ϕ ( X ) = | | | X | | | 1 = σ i ( X ) i =1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 17 / 22

  27. Matrix Comid Idea: get sparsity in spectrum of X ∈ R d 1 × d 2 . Take min { d 1 , d 2 } � ϕ ( X ) = | | | X | | | 1 = σ i ( X ) i =1 Schatten p -norms: apply p -norms to columns of X ∈ R d 1 × d 2 � min { d 1 , d 2 } � 1 / p � σ i ( X ) p | | | X | | | p = � σ ( X ) � p = i =1 Important fact: for 1 < p ≤ 2, 1 | 2 ψ ( X ) = 2( p − 1) | | | X | | p is strongly convex w.r.t. | | |·| | | p (Ball et al., 1994) Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 17 / 22

Recommend


More recommend