Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1
Remember subgradient method We want to solve x ∈ R n f ( x ) , min for f convex, not necessarily differentiable Subgradient method: choose initial x (0) ∈ R n , repeat: x ( k ) = x ( k − 1) − t k · g ( k − 1) , k = 1 , 2 , 3 , . . . where g ( k − 1) is a subgradient of f at x ( k − 1) If f is Lipschitz on a bounded set containing its minimizer, then √ subgradient method has convergence rate O (1 / k ) Downside: can be very slow! 2
Outline Today: • Generalized gradient descent • Convergence analysis • ISTA, matrix completion • Special cases 3
Decomposable functions Suppose f ( x ) = g ( x ) + h ( x ) • g is convex, differentiable • h is convex, not necessarily differentiable If f were differentiable, gradient descent update: x + = x − t ∇ f ( x ) Recall motivation: minimize quadratic approximation to f around x , replace ∇ 2 f ( x ) by 1 t I , f ( x ) + ∇ f ( x ) T ( z − x ) + 1 x + = argmin 2 t � z − x � 2 z � �� � � f t ( z ) 4
In our case f is not differentiable, but f = g + h , g differentiable Why don’t we make quadratic approximation to g , leave h alone? I.e., update x + = argmin � g t ( z ) + h ( z ) z g ( x ) + ∇ g ( x ) T ( z − x ) + 1 2 t � z − x � 2 + h ( z ) = argmin z � � 1 � 2 + h ( z ) � z − ( x − t ∇ g ( x )) = argmin 2 t z 2 t � z − ( x − t ∇ g ( x )) � 2 1 be close to gradient update for g h ( z ) also make h small 5
Generalized gradient descent Define 1 2 t � x − z � 2 + h ( z ) prox t ( x ) = argmin z ∈ R n Generalized gradient descent: choose initialize x (0) , repeat: x ( k ) = prox t k ( x ( k − 1) − t k ∇ g ( x ( k − 1) )) , k = 1 , 2 , 3 , . . . To make update step look familiar, can write it as x ( k ) = x ( k − 1) − t k · G t k ( x ( k − 1) ) where G t is the generalized gradient, G t ( x ) = x − prox t ( x − t ∇ g ( x )) t 6
What good did this do? You have a right to be suspicious ... looks like we just swapped one minimization problem for another Point is that prox function prox t ( · ) is can be computed analytically for a lot of important functions h . Note: • prox t doesn’t depend on g at all • g can be very complicated as long as we can compute its gradient Convergence analysis: will be in terms of # of iterations of the algorithm Each iteration evaluates prox t ( · ) once, and this can be cheap or expensive, depending on h 7
ISTA Consider lasso criterion f ( x ) = 1 + . 2 � y − Ax � 2 λ � x � 1 . � �� � � �� � g ( x ) h ( x ) Prox function is now 1 2 t � x − z � 2 + λ � z � 1 prox t ( x ) = argmin z ∈ R n = S λt ( x ) where S λ ( x ) is the soft-thresholding operator, x i − λ if x i > λ [ S λ ( x )] i = 0 if − λ ≤ x i ≤ λ x i + λ if x i < − λ 8
Recall ∇ g ( x ) = − A T ( y − Ax ) . Hence generalized gradient update step is: x + = S λt ( x + tA T ( y − Ax )) Resulting algorithm called ISTA (Iterative Soft-Thresholding Algorithm). Very simple algorithm to compute a lasso solution 0.50 0.20 Generalized gradient f(k)−fstar 0.10 (ISTA) vs subgradient 0.05 descent: Subgradient method 0.02 Generalized gradient 0 200 400 600 800 1000 k 9
Convergence analysis We have f ( x ) = g ( x ) + h ( x ) , and assume • g is convex, differentiable, ∇ g is Lipschitz continuous with constant L > 0 • h is convex, prox t ( x ) = argmin z {� x − z � 2 / (2 t ) + h ( z ) } can be evaluated Theorem: Generalized gradient descent with fixed step size t ≤ 1 /L satisfies f ( x ( k ) ) − f ( x ⋆ ) ≤ � x (0) − x ⋆ � 2 2 tk I.e., generalized gradient descent has convergence rate O (1 /k ) Same as gradient descent! But remember, this counts # of iterations, not # of operations 10
Proof Similar to proof for gradient descent, but with generalized gradient G t replacing gradient ∇ f . Main steps: • ∇ g Lipschitz with constant L ⇒ f ( y ) ≤ g ( x ) + ∇ g ( x ) T ( y − x ) + L 2 � y − x � 2 + h ( y ) all x, y • Plugging in y = x + = x − tG t ( x ) , f ( x + ) ≤ g ( x ) − t ∇ g ( x ) T G t ( x )+ Lt 2 � G t ( x ) � 2 + h ( x − tG t ( x )) • By definition of prox, 1 2 t � z − ( x − t ∇ g ( x )) � 2 + h ( z ) x − tG t ( x ) = argmin z ∈ R n ⇒ ∇ g ( x ) − G t ( x ) + v = 0 , v ∈ ∂h ( x − tG t ( x )) 11
• Using G t ( x ) − ∇ g ( x ) ∈ ∂h ( x − tG t ( x )) , and convexity of g , f ( x + ) ≤ f ( z ) + G t ( x ) T ( x − z ) − (1 − Lt 2 ) t � G t ( x ) � 2 all z • Letting t ≤ 1 /L and z = x ⋆ , f ( x + ) ≤ f ( x ⋆ ) + G t ( x ) T ( x ⋆ − x ) − t 2 � G t ( x ) � 2 � � x − x ⋆ � 2 − � x + − x ⋆ � 2 � = f ( x ⋆ ) + 1 2 t Proof proceeds just as with gradient descent. 12
Backtracking line search Same as with gradient descent, just replace ∇ f with generalized gradient G t . I.e., • Fix 0 < β < 1 • Then at each iteration, start with t = 1 , and while f ( x − tG t ( x )) > f ( x ) − t 2 � G t ( x ) � 2 , update t = βt Theorem: Generalized gradient descent with backtracking line search satisfies f ( x ( k ) ) − f ( x ⋆ ) ≤ � x (0) − x ⋆ � 2 2 t min k where t min = min { 1 , β/L } 13
Matrix completion Given matrix A , m × n , only observe entries A ij , ( i, j ) ∈ Ω Want to fill in missing entries (e.g., ), so we solve: � 1 ( A ij − X ij ) 2 + λ � X � ∗ min 2 X ∈ R m × n ( i,j ) ∈ Ω Here � X � ∗ is the nuclear norm of X , r � � X � ∗ = σ i ( X ) i =1 where r = rank( X ) and σ 1 ( X ) , . . . σ r ( X ) are its singular values 14
Define P Ω , projection operator onto observed set: � X ij ( i, j ) ∈ Ω [ P Ω ( X )] ij = 0 ( i, j ) / ∈ Ω Criterion is f ( X ) = 1 + . 2 � P Ω ( A ) − P Ω ( X ) � 2 λ � X � ∗ F . � �� � � �� � g ( X ) h ( X ) Two things for generalized gradient descent: • Gradient: ∇ g ( X ) = − ( P Ω ( A ) − P Ω ( X )) • Prox function: 1 2 t � X − Z � 2 prox t ( X ) = argmin F + λ � Z � ∗ Z ∈ R m × n 15
Claim: prox t ( X ) = S λt ( X ) , where the matrix soft-thresholding operator S λ ( X ) is defined by S λ ( X ) = U Σ λ V T where X = U Σ V T is a singular value decomposition, and Σ λ is diagonal with (Σ λ ) ii = max { Σ ii − λ, 0 } Why? Note prox t ( X ) = Z , where Z satisfies 0 ∈ Z − X + λt · ∂ � Z � ∗ Fact: if Z = U Σ V T , then ∂ � Z � ∗ = { UV T + W : W ∈ R m × n , � W � ≤ 1 , U T W = 0 , WV = 0 } Now plug in Z = S λt ( X ) and check that we can get 0 16
Hence generalized gradient update step is: X + = S λt ( X + t ( P Ω ( A ) − P Ω ( X ))) Note that ∇ g ( X ) is Lipschitz continuous with L = 1 , so we can choose fixed step size t = 1 . Update step is now: X + = S λ ( P Ω ( A ) + P ⊥ Ω ( X )) where P ⊥ Ω projects onto unobserved set, P Ω ( X ) + P ⊥ Ω ( X ) = X This is the soft-impute algorithm 1 , simple and effective method for matrix completion 1 Mazumder et al. (2011), Spectral regularization algorithms for learning large incomplete matrices 17
Why “generalized”? Special cases of generalized gradient descent, on f = g + h : • h = 0 → gradient descent • h = I C → projected gradient descent • g = 0 → proximal minimization algorithm Therefore these algorithms all have O (1 /k ) convergence rate 18
Projected gradient descent Given closed, convex set C ∈ R n , min x ∈ C g ( x ) ⇔ min g ( x ) + I C ( x ) x � 0 x ∈ C where I C ( x ) = is the indicator function of C ∞ x / ∈ C Hence 1 2 t � x − z � 2 + I C ( z ) prox t ( x ) = argmin z � x − z � 2 = argmin z ∈ C I.e., prox t ( x ) = P C ( x ) , projection operator onto C 19
Therefore generalized gradient update step is: x + = P C ( x − t ∇ g ( x )) i.e., perform usual gradient update and then project back onto C . Called projected gradient descent 1.5 1.0 0.5 c() 0.0 ● −0.5 ● −1.0 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 20
What sets C are easy to project onto? Lots, e.g., • Affine images C = { Ax + b : x ∈ R n } • Solution set of linear system C = { x ∈ R n : Ax = b } • Nonnegative orthant C = { x ∈ R n : x ≥ 0 } = R n + • Norm balls C = { x ∈ R n : � x � p ≤ 1 } , for p = 1 , 2 , ∞ • Some simple polyhedra and simple cones Warning: it is easy to write down seemingly simple set C , and P C can turn out to be very hard! E.g., it is generally hard to project onto solution set of arbitrary linear inequalities, i.e, arbitrary polyhedron C = { x ∈ R n : Ax ≤ b } 21
Proximal minimization algorithm Consider for h convex (not necessarily differentiable), min h ( x ) x Generalized gradient update step is just a prox update: 1 x + = argmin 2 t � x − z � 2 + h ( z ) z Called proximal minimization algorithm Faster than subgradient method, but not implementable unless we know prox in closed form 22
Recommend
More recommend