e ffi cient bregman projections onto the simplex
play

E ffi cient Bregman Projections Onto the Simplex Walid Krichene - PowerPoint PPT Presentation

Introduction Projection Algorithms Numerical experiments E ffi cient Bregman Projections Onto the Simplex Walid Krichene Syrine Krichene Alexandre Bayen Electrical Engineering and Computer Sciences, UC Berkeley ENSIMAG and Criteo Labs, France


  1. Introduction Projection Algorithms Numerical experiments E ffi cient Bregman Projections Onto the Simplex Walid Krichene Syrine Krichene Alexandre Bayen Electrical Engineering and Computer Sciences, UC Berkeley ENSIMAG and Criteo Labs, France ! December 16, 2015 1/15

  2. Introduction Projection Algorithms Numerical experiments Outline 1 Introduction 2 Projection Algorithms 3 Numerical experiments 1/15

  3. Introduction Projection Algorithms Numerical experiments Outline 1 Introduction 2 Projection Algorithms 3 Numerical experiments 1/15

  4. Introduction Projection Algorithms Numerical experiments Bregman Projections onto the simplex Bregman projections are the building block of mirror descent (Nemirovski and Yudin) and dual averaging (Nesterov). Convex optimization: min x ∈ X f ( x ) Online learning (regret minimization). 2/15

  5. Introduction Projection Algorithms Numerical experiments Bregman Projections onto the simplex Bregman projections are the building block of mirror descent (Nemirovski and Yudin) and dual averaging (Nesterov). Convex optimization: min x ∈ X f ( x ) Online learning (regret minimization). Algorithm 2 Mirror descent method 1: for ⌧ 2 N do Query a sub-gradient vector g ( ⌧ ) 2 @ f ( x ( ⌧ ) ) (or loss vector) 2: Update 3: x ( ⌧ + 1 ) = arg min D ( x , ( r ) − 1 ( r ( x ( ⌧ ) ) � ⌘ ⌧ g ( ⌧ ) )) (1) x ∈ X : strongly convex distance generating function. D : Bregman divergence. 2/15

  6. Introduction Projection Algorithms Numerical experiments Illustration of Bregman projections E E ∗ X r ψ x ( τ ) � η τ g ( τ ) x ( τ +1) ( r ψ ) − 1 Figure: Illustration of a mirror descent iteration. x ( ⌧ + 1 ) = arg min D ( x , ( r ) − 1 ( r ( x ( ⌧ ) ) � ⌘ ⌧ g ( ⌧ ) )) x ∈ X 3/15

  7. Introduction Projection Algorithms Numerical experiments More precisely Feasible set is the simplex (or cartesian product of simplexes) ( ) X x 2 R d ∆ = + : x i = 1 i Motivation: online learning, optimization with probability distributions. 4/15

  8. Introduction Projection Algorithms Numerical experiments More precisely Feasible set is the simplex (or cartesian product of simplexes) ( ) X x 2 R d ∆ = + : x i = 1 i Motivation: online learning, optimization with probability distributions. DGF is induced by a potential. X ( x ) = f ( x i ) i R x 1 � − 1 ( u ) du , � increasing, called the potential. f ( x ) = Consequence: known expression of r and ( r ) − 1 . 4/15

  9. Introduction Projection Algorithms Numerical experiments Outline 1 Introduction 2 Projection Algorithms 3 Numerical experiments 4/15

  10. Introduction Projection Algorithms Numerical experiments Projection algorithms General strategy: Derive optimality conditions Design algorithm to satisfy conditions. 5/15

  11. Introduction Projection Algorithms Numerical experiments Optimality conditions x ? = arg min D ( x , ( r ) − 1 ( r (¯ x ) � ¯ g ) x ∈ X Optimality conditions x ? is optimal if and only if 9 ⌫ ? 2 R : ( � ( � − 1 (¯ x ? � g i + ⌫ ? ) � 8 i , i = x i ) � ¯ + , P d i = 1 x ? i = 1 , Proof: write KKT conditions, eliminate complementary slackness. 6/15

  12. Introduction Projection Algorithms Numerical experiments Optimality conditions x ? = arg min D ( x , ( r ) − 1 ( r (¯ x ) � ¯ g ) x ∈ X Optimality conditions x ? is optimal if and only if 9 ⌫ ? 2 R : ( � ( � − 1 (¯ x ? � g i + ⌫ ? ) � 8 i , i = x i ) � ¯ + , P d i = 1 x ? i = 1 , Proof: write KKT conditions, eliminate complementary slackness. Comments: Reduced a problem in dimension d to a problem in dimension 1. � ( � − 1 (¯ The function c : ⌫ 7! P � � x i ) � ¯ g i + ⌫ ) + is increasing. i Can solve for ⌫ ? using bisection. 6/15

  13. Introduction Projection Algorithms Numerical experiments Bisection algorithm for general divergences Algorithm 3 Bisection method to compute the projection x ? with precision ✏ . 1: Input: ¯ x , ¯ g , ✏ . 2: Initialize ⌫ = � − 1 ( 1 ) � max � − 1 (¯ ¯ x i ) � ¯ g i i ⌫ = � − 1 ( 1 / d ) � max � − 1 (¯ x i ) � ¯ g i i 3: while c ( ⌫ ) � c ( ⌫ ) > ✏ do Let ⌫ + ¯ ⌫ + ⌫ 4: 2 if c ( ⌫ + ) > 1 then 5: ⌫ ⌫ + ¯ 6: else 7: ⌫ ⌫ + 8: � ( � − 1 (¯ � � 9: Return ˜ x (¯ ⌫ ) = x i ) � ¯ g i + ¯ ⌫ ) + Theorem The algorithm terminates after O ( ln 1 ✏ ) iterations, and outputs ˜ x such that ⌫ ) � x ? k  ✏ k ˜ x (¯ 7/15

  14. Introduction Projection Algorithms Numerical experiments Exact projections for exponential divergences Special case 1: ( x ) = k x k 2 : can compute the solution exactly [1]. [1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, E ffi cient Projections onto the ` 1 Ball for Learning in High Dimensions, ICML 2008. 8/15

  15. Introduction Projection Algorithms Numerical experiments Exact projections for exponential divergences Special case 1: ( x ) = k x k 2 : can compute the solution exactly [1]. Special case 2: Exponential divergence: � ✏ : ( �1 , + 1 ) ! ( � ✏ , + 1 ) u 7! e u − 1 � ✏ , [1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, E ffi cient Projections onto the ` 1 Ball for Learning in High Dimensions, ICML 2008. 8/15

  16. Introduction Projection Algorithms Numerical experiments Exact projections for exponential divergences Special case 1: ( x ) = k x k 2 : can compute the solution exactly [1]. Special case 2: Exponential divergence: � ✏ : ( �1 , + 1 ) ! ( � ✏ , + 1 ) u 7! e u − 1 � ✏ , For ✏ = 0: ( x ) = H ( x ) = P i x i ln x i (negative entropy). D ( x , y ) = D KL ( x , y ) . [1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, E ffi cient Projections onto the ` 1 Ball for Learning in High Dimensions, ICML 2008. 8/15

  17. Introduction Projection Algorithms Numerical experiments Exact projections for exponential divergences Special case 1: ( x ) = k x k 2 : can compute the solution exactly [1]. Special case 2: Exponential divergence: � ✏ : ( �1 , + 1 ) ! ( � ✏ , + 1 ) u 7! e u − 1 � ✏ , For ✏ = 0: ( x ) = H ( x ) = P i x i ln x i (negative entropy). D ( x , y ) = D KL ( x , y ) . For ✏ > 0: ( x ) = H ( x + ✏ ) D ( x , y ) = D KL ( x + ✏ , y + ✏ ) . [1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, E ffi cient Projections onto the ` 1 Ball for Learning in High Dimensions, ICML 2008. 8/15

  18. Introduction Projection Algorithms Numerical experiments Motivation Bregman projection with KL divergence. Hedge algorithm in online learning. Multiplicative weights algorithm. Exponentiated gradient descent. Has closed-form solution in O ( d ) 9/15

  19. Introduction Projection Algorithms Numerical experiments Motivation Bregman projection with KL divergence. Hedge algorithm in online learning. Multiplicative weights algorithm. Exponentiated gradient descent. Has closed-form solution in O ( d ) D KL ( x, y 0 ) However: D KL, ✏ ( x, y 0 ) ` ✏ 2 k x � y 0 k 2 D KL ( x , y ) unbounded on the simplex 1 L ✏ 2 k x � y 0 k 2 1 (problematic for stochastic mirror descent). H ( x ) is not a smooth function (problematic for accelerated mirror descent). Taking ✏ > 0 solves these issues. 0 1 p 9/15

  20. Introduction Projection Algorithms Numerical experiments Optimality conditions Recall general optimality condition: x ? � ( � − 1 (¯ g i + ⌫ ? ) i = � x i ) � ¯ � + . Optimality conditions with exponential divergence Let x ? be the solution and I = { i : x ? i > 0 } its support. Then 8 x i + ✏ ) e − ¯ gi i = � ✏ + (¯ x ? 8 i 2 I , , < Z ? (2) x i + ✏ ) e − ¯ gi P i ∈ I (¯ Z ? = . : 1 + |I| ✏ g i , then x i + ✏ ) e − ¯ Furthermore, if ¯ y i = (¯ ( i 2 I and ¯ y j > ¯ y i ) ) j 2 I 10/15

  21. Introduction Projection Algorithms Numerical experiments A sorting-based algorithm Algorithm 4 Sorting method to compute the Bregman projection with D ✏ 1: Input: ¯ x , ¯ g 2: Output: x ? x i + ✏ ) e − ¯ g i 3: Form the vector ¯ y i = (¯ 4: Sort ¯ y , let ¯ y � ( i ) be the i -th smallest element of y . 5: Let j ? be the smallest index for which X ( 1 + ✏ ( d � j + 1 ))¯ y � ( j ) � ✏ y � ( i ) > 0 ¯ i ≥ j P i ≥ j ? ¯ y � ( i ) 6: Set Z = 1 + ✏ ( d − j ? + 1 ) 7: Set ✓ � ✏ + ¯ y i ◆ x ? i = Z + Complexity: O ( d ln d ) 11/15

  22. Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm Adapted from the QuickSelect algorithm: Select i th element of a vector ¯ y . Can sort then return i th element: O ( d ln d ) . QuickSelect: expected O ( d ) , worst-case O ( d 2 ) . 12/15

  23. Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm 9 1 4 8 7 2 3 5 6 k = 5 12/15

  24. Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm 9 1 4 8 7 2 3 5 6 k = 5 12/15

  25. Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm k = 5 9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 12/15

  26. Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm k = 5 9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 k = 3 12/15

  27. Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm k = 5 9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 k = 3 12/15

Recommend


More recommend