Introduction Projection Algorithms Numerical experiments E ffi cient Bregman Projections Onto the Simplex Walid Krichene Syrine Krichene Alexandre Bayen Electrical Engineering and Computer Sciences, UC Berkeley ENSIMAG and Criteo Labs, France ! December 16, 2015 1/15
Introduction Projection Algorithms Numerical experiments Outline 1 Introduction 2 Projection Algorithms 3 Numerical experiments 1/15
Introduction Projection Algorithms Numerical experiments Outline 1 Introduction 2 Projection Algorithms 3 Numerical experiments 1/15
Introduction Projection Algorithms Numerical experiments Bregman Projections onto the simplex Bregman projections are the building block of mirror descent (Nemirovski and Yudin) and dual averaging (Nesterov). Convex optimization: min x ∈ X f ( x ) Online learning (regret minimization). 2/15
Introduction Projection Algorithms Numerical experiments Bregman Projections onto the simplex Bregman projections are the building block of mirror descent (Nemirovski and Yudin) and dual averaging (Nesterov). Convex optimization: min x ∈ X f ( x ) Online learning (regret minimization). Algorithm 2 Mirror descent method 1: for ⌧ 2 N do Query a sub-gradient vector g ( ⌧ ) 2 @ f ( x ( ⌧ ) ) (or loss vector) 2: Update 3: x ( ⌧ + 1 ) = arg min D ( x , ( r ) − 1 ( r ( x ( ⌧ ) ) � ⌘ ⌧ g ( ⌧ ) )) (1) x ∈ X : strongly convex distance generating function. D : Bregman divergence. 2/15
Introduction Projection Algorithms Numerical experiments Illustration of Bregman projections E E ∗ X r ψ x ( τ ) � η τ g ( τ ) x ( τ +1) ( r ψ ) − 1 Figure: Illustration of a mirror descent iteration. x ( ⌧ + 1 ) = arg min D ( x , ( r ) − 1 ( r ( x ( ⌧ ) ) � ⌘ ⌧ g ( ⌧ ) )) x ∈ X 3/15
Introduction Projection Algorithms Numerical experiments More precisely Feasible set is the simplex (or cartesian product of simplexes) ( ) X x 2 R d ∆ = + : x i = 1 i Motivation: online learning, optimization with probability distributions. 4/15
Introduction Projection Algorithms Numerical experiments More precisely Feasible set is the simplex (or cartesian product of simplexes) ( ) X x 2 R d ∆ = + : x i = 1 i Motivation: online learning, optimization with probability distributions. DGF is induced by a potential. X ( x ) = f ( x i ) i R x 1 � − 1 ( u ) du , � increasing, called the potential. f ( x ) = Consequence: known expression of r and ( r ) − 1 . 4/15
Introduction Projection Algorithms Numerical experiments Outline 1 Introduction 2 Projection Algorithms 3 Numerical experiments 4/15
Introduction Projection Algorithms Numerical experiments Projection algorithms General strategy: Derive optimality conditions Design algorithm to satisfy conditions. 5/15
Introduction Projection Algorithms Numerical experiments Optimality conditions x ? = arg min D ( x , ( r ) − 1 ( r (¯ x ) � ¯ g ) x ∈ X Optimality conditions x ? is optimal if and only if 9 ⌫ ? 2 R : ( � ( � − 1 (¯ x ? � g i + ⌫ ? ) � 8 i , i = x i ) � ¯ + , P d i = 1 x ? i = 1 , Proof: write KKT conditions, eliminate complementary slackness. 6/15
Introduction Projection Algorithms Numerical experiments Optimality conditions x ? = arg min D ( x , ( r ) − 1 ( r (¯ x ) � ¯ g ) x ∈ X Optimality conditions x ? is optimal if and only if 9 ⌫ ? 2 R : ( � ( � − 1 (¯ x ? � g i + ⌫ ? ) � 8 i , i = x i ) � ¯ + , P d i = 1 x ? i = 1 , Proof: write KKT conditions, eliminate complementary slackness. Comments: Reduced a problem in dimension d to a problem in dimension 1. � ( � − 1 (¯ The function c : ⌫ 7! P � � x i ) � ¯ g i + ⌫ ) + is increasing. i Can solve for ⌫ ? using bisection. 6/15
Introduction Projection Algorithms Numerical experiments Bisection algorithm for general divergences Algorithm 3 Bisection method to compute the projection x ? with precision ✏ . 1: Input: ¯ x , ¯ g , ✏ . 2: Initialize ⌫ = � − 1 ( 1 ) � max � − 1 (¯ ¯ x i ) � ¯ g i i ⌫ = � − 1 ( 1 / d ) � max � − 1 (¯ x i ) � ¯ g i i 3: while c ( ⌫ ) � c ( ⌫ ) > ✏ do Let ⌫ + ¯ ⌫ + ⌫ 4: 2 if c ( ⌫ + ) > 1 then 5: ⌫ ⌫ + ¯ 6: else 7: ⌫ ⌫ + 8: � ( � − 1 (¯ � � 9: Return ˜ x (¯ ⌫ ) = x i ) � ¯ g i + ¯ ⌫ ) + Theorem The algorithm terminates after O ( ln 1 ✏ ) iterations, and outputs ˜ x such that ⌫ ) � x ? k ✏ k ˜ x (¯ 7/15
Introduction Projection Algorithms Numerical experiments Exact projections for exponential divergences Special case 1: ( x ) = k x k 2 : can compute the solution exactly [1]. [1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, E ffi cient Projections onto the ` 1 Ball for Learning in High Dimensions, ICML 2008. 8/15
Introduction Projection Algorithms Numerical experiments Exact projections for exponential divergences Special case 1: ( x ) = k x k 2 : can compute the solution exactly [1]. Special case 2: Exponential divergence: � ✏ : ( �1 , + 1 ) ! ( � ✏ , + 1 ) u 7! e u − 1 � ✏ , [1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, E ffi cient Projections onto the ` 1 Ball for Learning in High Dimensions, ICML 2008. 8/15
Introduction Projection Algorithms Numerical experiments Exact projections for exponential divergences Special case 1: ( x ) = k x k 2 : can compute the solution exactly [1]. Special case 2: Exponential divergence: � ✏ : ( �1 , + 1 ) ! ( � ✏ , + 1 ) u 7! e u − 1 � ✏ , For ✏ = 0: ( x ) = H ( x ) = P i x i ln x i (negative entropy). D ( x , y ) = D KL ( x , y ) . [1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, E ffi cient Projections onto the ` 1 Ball for Learning in High Dimensions, ICML 2008. 8/15
Introduction Projection Algorithms Numerical experiments Exact projections for exponential divergences Special case 1: ( x ) = k x k 2 : can compute the solution exactly [1]. Special case 2: Exponential divergence: � ✏ : ( �1 , + 1 ) ! ( � ✏ , + 1 ) u 7! e u − 1 � ✏ , For ✏ = 0: ( x ) = H ( x ) = P i x i ln x i (negative entropy). D ( x , y ) = D KL ( x , y ) . For ✏ > 0: ( x ) = H ( x + ✏ ) D ( x , y ) = D KL ( x + ✏ , y + ✏ ) . [1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, E ffi cient Projections onto the ` 1 Ball for Learning in High Dimensions, ICML 2008. 8/15
Introduction Projection Algorithms Numerical experiments Motivation Bregman projection with KL divergence. Hedge algorithm in online learning. Multiplicative weights algorithm. Exponentiated gradient descent. Has closed-form solution in O ( d ) 9/15
Introduction Projection Algorithms Numerical experiments Motivation Bregman projection with KL divergence. Hedge algorithm in online learning. Multiplicative weights algorithm. Exponentiated gradient descent. Has closed-form solution in O ( d ) D KL ( x, y 0 ) However: D KL, ✏ ( x, y 0 ) ` ✏ 2 k x � y 0 k 2 D KL ( x , y ) unbounded on the simplex 1 L ✏ 2 k x � y 0 k 2 1 (problematic for stochastic mirror descent). H ( x ) is not a smooth function (problematic for accelerated mirror descent). Taking ✏ > 0 solves these issues. 0 1 p 9/15
Introduction Projection Algorithms Numerical experiments Optimality conditions Recall general optimality condition: x ? � ( � − 1 (¯ g i + ⌫ ? ) i = � x i ) � ¯ � + . Optimality conditions with exponential divergence Let x ? be the solution and I = { i : x ? i > 0 } its support. Then 8 x i + ✏ ) e − ¯ gi i = � ✏ + (¯ x ? 8 i 2 I , , < Z ? (2) x i + ✏ ) e − ¯ gi P i ∈ I (¯ Z ? = . : 1 + |I| ✏ g i , then x i + ✏ ) e − ¯ Furthermore, if ¯ y i = (¯ ( i 2 I and ¯ y j > ¯ y i ) ) j 2 I 10/15
Introduction Projection Algorithms Numerical experiments A sorting-based algorithm Algorithm 4 Sorting method to compute the Bregman projection with D ✏ 1: Input: ¯ x , ¯ g 2: Output: x ? x i + ✏ ) e − ¯ g i 3: Form the vector ¯ y i = (¯ 4: Sort ¯ y , let ¯ y � ( i ) be the i -th smallest element of y . 5: Let j ? be the smallest index for which X ( 1 + ✏ ( d � j + 1 ))¯ y � ( j ) � ✏ y � ( i ) > 0 ¯ i ≥ j P i ≥ j ? ¯ y � ( i ) 6: Set Z = 1 + ✏ ( d − j ? + 1 ) 7: Set ✓ � ✏ + ¯ y i ◆ x ? i = Z + Complexity: O ( d ln d ) 11/15
Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm Adapted from the QuickSelect algorithm: Select i th element of a vector ¯ y . Can sort then return i th element: O ( d ln d ) . QuickSelect: expected O ( d ) , worst-case O ( d 2 ) . 12/15
Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm 9 1 4 8 7 2 3 5 6 k = 5 12/15
Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm 9 1 4 8 7 2 3 5 6 k = 5 12/15
Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm k = 5 9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 12/15
Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm k = 5 9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 k = 3 12/15
Introduction Projection Algorithms Numerical experiments A randomized-pivot algorithm k = 5 9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 k = 3 12/15
Recommend
More recommend