admm and mirror descent
play

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am - PowerPoint PPT Presentation

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve this lecture) Optimization 10-725 / 36-725 Oct 30, 2012 1 Recap of Dual Ascent For problems like min x f ( x ) s.t. Ax = b We defined Lagrangian L


  1. ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve this lecture) Optimization 10-725 / 36-725 Oct 30, 2012 1

  2. Recap of Dual Ascent For problems like min x f ( x ) s.t. Ax = b We defined Lagrangian L ( x, u ) = f ( x ) + u ⊤ ( Ax − b ) We defined the lagrange dual function g ( u ) = inf x L ( x, u ) If x + minimizes L ( x, u ) then ∂g ( u ) = Ax + − b 2

  3. Recap of Dual Ascent Dual problem : maximize g ( u ) - use subgradient ascent! This gives us the algorithm x t +1 = arg min x L ( x, u t ) u t +1 = u t + η t ( Ax t +1 − b ) If strong duality, x ∗ = arg min x L ( x, u ∗ ) , provided it is unique. For appropriate η t (and some conditions), x t , u t converge to an optimal primal and dual point. If g not differentiable, convergence not monotone, i.e. sometimes g ( u t +1 ) � g ( u t ) . 3

  4. Recap of Dual Decomposition Ascent i f i ( x i ) where x i ∈ R n i are disjoint Suppose f ( x ) = � Write Ax = � i A i x i , and so � � � � f i ( x i ) + u ⊤ A i x i − (1 /N ) u ⊤ b L ( x, u ) = L i ( x i , u ) = i i x -minimization step in dual ascent decomposes x t +1 x i L i ( x i , u t ) = arg min i u t +1 = u t + η t ( Ax t +1 − b ) 4

  5. Recap of Augmented Lagrangian, Method of Multipliers L ρ ( x, u ) = f ( x ) + u ⊤ ( Ax − b ) + ( ρ/ 2) � Ax − b � 2 2 Lagrangian of min x f ( x ) + ( ρ/ 2) � Ax − b � 2 2 s.t. Ax = b Associated dual function g ρ ( u ) = min x L ρ ( x, u ) Applying dual ascent : x t +1 = arg min x L ρ ( x, u t ) u t +1 = u t + ρ ( Ax t +1 − b ) More robust than dual ascent (converges if f is not strictly convex or when f can be infinite). However, lost decomposability. 5

  6. Alternating Direction Method of Multipliers Augmented Lagrangian for f ( x ) = f 1 ( x 1 ) + f 2 ( x 2 ) is L ρ ( x 1 , x 2 , u ) = f 1 ( x 1 )+ f 2 ( x 2 )+ u ⊤ ( A 1 x 1 + A 2 x 2 − b )+( ρ/ 2) � A 1 x 1 + A 2 x 2 − b � 2 2 ”Alternating direction” minimization x t +1 x 1 L ρ ( x 1 , x t 2 , u t ) = arg min 1 x t +1 x 2 L ρ ( x t +1 , x 2 , u t ) = arg min 2 1 u t +1 = u t + ρ ( A 1 x t +1 + A 2 x t +1 − b ) 1 2 Normal method of multipliers would’ve done ( x t +1 , x t +1 x 1 ,x 2 L ρ ( x 1 , x 2 , u t ) ) = arg min 1 2 u t +1 = u t + ρ ( A 1 x t +1 + A 2 x t +1 − b ) 1 2 6

  7. Convergence Guarantees of ADMM Assumption 1 : f 1 , f 2 are closed, proper, convex (epigraphs are closed, nonempty, convex) Assumption 2 : Unaugmented Lagrangian L 0 ( x 1 , x 2 , u ) has saddle L 0 ( x S 1 , x S 2 , u ) ≤ L 0 ( x S 1 , x S 2 , u S ) ≤ L 0 ( x 1 , x 2 , u S ) Residual convergence : r t = A 1 x t 1 + A 2 x t 2 − b → 0 Objective convergence : f 1 ( x t 1 ) + f 2 ( x t 2 ) → f ∗ Dual variable convergence : y t → y ∗ Primal variables needn’t converge (more assumptions needed) 7

  8. Example: Generalized Lasso with Repeated Ridge 1 2 � Ax − b � 2 min 2 + λ � Fx � 1 x In ADMM form, � Ax − b � 2 min 2 + λ � z � 1 x,z s.t. Fx − z = 0 ADMM updates : x t +1 = ( A ⊤ A + ρF ⊤ F ) − 1 ( A ⊤ b + ρF ⊤ ( z t − u t )) z t +1 = S λ/ρ ( Fx t +1 + u t ) u t +1 = u t + Fx t +1 − z t +1 i � x i � 2 for disjoint x i ∈ R n i ), ADMM uses For group lasso ( λ � vector soft thresholding operator � � κ S κ ( a ) = 1 − a � a � 2 + 8

  9. Break! 9

  10. Bregman Divergence ∆ g If g is strongly convex wrt norm � . � , define ∆ g ( x, y ) = g ( x ) − [ g ( y ) + ∇ g ( y ) ⊤ ( x − y )] Read ” distance between x and y as measured by function g ”. Eg: g ( x ) = � x � 2 2 , strongly convex wrt � . � 2 ∆ g ( x, y ) = � x − y � 2 2 Eg: g ( x ) = � i ( x i log x i − x i ) , strongly convex wrt � . � 1 � � x i � � � ∆ g ( x, y ) = x i log + y i − x i y i i 10

  11. Properties of Bregman Divergence For a λ -strongly convex function g , we defined ∆ g ( x, y ) = g ( x ) − [ g ( y ) + ∇ g ( y ) ⊤ ( x − y )] So ∆ g ( x, x ) = 0 and by strong convexity, ∆ g ( x, y ) ≥ λ 2 � x − y � 2 ≥ 0 Derivatives: ∇ x ∆ g ( x, y ) = ∇ g ( x ) − ∇ g ( y ) ∇ 2 x ∆ g ( x, y ) = ∇ 2 g ( x ) � λI Triangle Inequality (kinda): ∆ g ( x, y ) + ∆ g ( y, z ) = ∆ g ( x, z ) + ( ∇ g ( z ) − ∇ g ( y )) ⊤ ( x − y ) 11

  12. Recap of Gradient Descent Consider ( S ⊆ R n ) the problem min x ∈ S f ( x ) Gradient descent : minimize quadratic approx. of f at x t ( H t = I ) f ( x t ) + ∂f ( x t ) ⊤ ( x − x t ) + 1 x t +1 = arg min 2 � x − x t � 2 2 x From HW2 (via regret) : for projected subgradient descent, f ( x t ) − f ( x ∗ ) ≤ L 2 D 2 √ T where max x ∈ S � ∂f ( x ) � 2 ≤ L 2 , max x,y ∈ S � x − y � 2 ≤ D 2 How does this scale with n ? Depends on L 2 ( f, S ) and D 2 ( S ) 12

  13. Mirror Descent Given a norm � . � over the domain S , x t +1 = arg min f ( x t ) + ∂f ( x t ) ⊤ ( x − x t ) + ∆ g ( x, x t ) x where g is strongly convex wrt � . � . Alternatively, x t +1 = arg min x ⊤ ( ∂f ( x t ) − ∇ g ( x t )) + g ( x ) x Hence, ∂f ( x t ) + ∇ g ( x t +1 ) − ∇ g ( x t ) = 0 So, we sometimes see x t +1 = ∇ g − 1 ( ∇ g ( x t ) − η t ∂f ( x t )) 13

  14. Convergence Guarantees Let � ∂f ( x ) � ∗ ≤ L � . � or equivalently f ( x ) − f ( y ) ≤ L � . � � x − y � If x g = arg min x ∈ S g ( x ) , let D g, � . � = � 2 max y ∆ g ( x g , y ) /λ , then � x − x g � ≤ D g, � . � Choosing η t = λD g, � . � √ � ∂f ( x t ) � ∗ T f ( x T ) − f ( x ∗ ) ≤ L � . � D g, � . � √ T Remember (HW2): η t = D 2 � max x,y � x − y � 2 √ T and D 2 = 2 L 2 14

  15. Example : Probability Simplex and � . � 1 n-dimensional simplex : x ≥ 0 , 1 ⊤ x = 1 Functions are Lipschitz wrt � . � 1 : max x � ∂f ( x ) � ∞ ≤ L 1 If g ( x ) = � i x i log x i − x i , we get exponentiated gradient x t +1 = x t ◦ exp( − η t ∇ f ( x t )) D g, � . � 1 ≤ √ 2 log n , yielding a rate � log n/T . n/T ( D 2 = 1 , L 2 ≤ √ nL 1 ) g ( x ) = � x � 2 � 2 (grad. descent) gives 15

  16. References Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers - Boyd, Parikh, Chu, Peleato and Eckstein, 2010 Lecture Notes on Modern Convex Optimization - Ben-Tal and Nemirovski, 2012 16

Recommend


More recommend