convex optimization
play

Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal - PowerPoint PPT Presentation

Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal methods; Incremental methods I) 21 March, 2013 Suvrit Sra Douglas-Rachford method 0 f ( x ) + g ( x ) 2 / 19 Douglas-Rachford method 0 f ( x ) + g (


  1. Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal methods; Incremental methods – I) 21 March, 2013 ◦ Suvrit Sra

  2. Douglas-Rachford method 0 ∈ ∂f ( x ) + ∂g ( x ) 2 / 19

  3. Douglas-Rachford method 0 ∈ ∂f ( x ) + ∂g ( x ) DR method: given z 0 , iterate for k ≥ 0 x k = prox g ( z k ) v k = prox f (2 x k − z k ) z k +1 = z k + γ k ( v k − x k ) 2 / 19

  4. Douglas-Rachford method 0 ∈ ∂f ( x ) + ∂g ( x ) DR method: given z 0 , iterate for k ≥ 0 x k = prox g ( z k ) v k = prox f (2 x k − z k ) z k +1 = z k + γ k ( v k − x k ) For γ k = 1 , we have z k +1 = z k + v k − x k z k +1 = z k + prox f (2 prox g ( z k ) − z k ) − prox g ( z k ) 2 / 19

  5. Douglas-Rachford method z k +1 = z k + prox f (2 prox g ( z k ) − z k ) − prox g ( z k ) 3 / 19

  6. Douglas-Rachford method z k +1 = z k + prox f (2 prox g ( z k ) − z k ) − prox g ( z k ) Dropping superscripts, we have the fixed-point iteration z ← Tz T = I + P f (2 P g − I ) − P g 3 / 19

  7. Douglas-Rachford method z k +1 = z k + prox f (2 prox g ( z k ) − z k ) − prox g ( z k ) Dropping superscripts, we have the fixed-point iteration z ← Tz T = I + P f (2 P g − I ) − P g Lemma DR can be written as: z ← 1 2 ( R f R g + I ) z , where R f denotes the reflection operator 2 P f − I (similarly R g ). Exercise: Prove this claim. 3 / 19

  8. Proximity for several functions Optimizing sums of functions � 1 2 � x − y � 2 f ( x ) := 2 + i f i ( x ) � f ( x ) := i f i ( x ) 4 / 19

  9. Proximity for several functions Optimizing sums of functions � 1 2 � x − y � 2 f ( x ) := 2 + i f i ( x ) � f ( x ) := i f i ( x ) DR does not work immediately 4 / 19

  10. Product space trick ◮ Original problem over H = R n 5 / 19

  11. Product space trick ◮ Original problem over H = R n ◮ Suppose we have � m i =1 f i ( x ) 5 / 19

  12. Product space trick ◮ Original problem over H = R n ◮ Suppose we have � m i =1 f i ( x ) ◮ Introduce new variables ( x 1 , . . . , x m ) 5 / 19

  13. Product space trick ◮ Original problem over H = R n ◮ Suppose we have � m i =1 f i ( x ) ◮ Introduce new variables ( x 1 , . . . , x m ) ◮ Now problem is over domain H m := H × H × · · · × H ( m -times) 5 / 19

  14. Product space trick ◮ Original problem over H = R n ◮ Suppose we have � m i =1 f i ( x ) ◮ Introduce new variables ( x 1 , . . . , x m ) ◮ Now problem is over domain H m := H × H × · · · × H ( m -times) ◮ New constraint: x 1 = x 2 = . . . = x m � min i f i ( x i ) ( x 1 ,...,x m ) s.t. x 1 = x 2 = · · · = x m . 5 / 19

  15. Product space trick min x f ( x ) + I B ( x ) where x ∈ H m and B = { z ∈ H m | z = ( x, x, . . . , x ) } 6 / 19

  16. Product space trick min x f ( x ) + I B ( x ) where x ∈ H m and B = { z ∈ H m | z = ( x, x, . . . , x ) } ◮ Let y = ( y 1 , . . . , y m ) 6 / 19

  17. Product space trick min x f ( x ) + I B ( x ) where x ∈ H m and B = { z ∈ H m | z = ( x, x, . . . , x ) } ◮ Let y = ( y 1 , . . . , y m ) ◮ prox f ( y ) = (prox f 1 ( y 1 ) , . . . , prox f m ( y m )) 6 / 19

  18. Product space trick min x f ( x ) + I B ( x ) where x ∈ H m and B = { z ∈ H m | z = ( x, x, . . . , x ) } ◮ Let y = ( y 1 , . . . , y m ) ◮ prox f ( y ) = (prox f 1 ( y 1 ) , . . . , prox f m ( y m )) ◮ P B ( y ) can be solved as follows: 6 / 19

  19. Product space trick min x f ( x ) + I B ( x ) where x ∈ H m and B = { z ∈ H m | z = ( x, x, . . . , x ) } ◮ Let y = ( y 1 , . . . , y m ) ◮ prox f ( y ) = (prox f 1 ( y 1 ) , . . . , prox f m ( y m )) ◮ P B ( y ) can be solved as follows: 1 2 � z − y � 2 min z ∈B 2 1 2 � x − y i � 2 � min x ∈H i 2 x = 1 = ⇒ � i y i m Exercise: Work out the details of DR with the above ideas. Note: this trick works for all other situations! 6 / 19

  20. Proximity operator for sums 1 2 � x − y � 2 min x 2 + g ( x ) + h ( x ) 7 / 19

  21. Proximity operator for sums 1 2 � x − y � 2 min x 2 + g ( x ) + h ( x ) Usually prox f + g � = prox f ◦ prox g 7 / 19

  22. Proximity operator for sums 1 2 � x − y � 2 min x 2 + g ( x ) + h ( x ) Usually prox f + g � = prox f ◦ prox g Proximal-Dykstra method 1 Let x 0 = y ; u 0 = 0 , z 0 = 0 2 k -th iteration ( k ≥ 0 ) 7 / 19

  23. Proximity operator for sums 1 2 � x − y � 2 min x 2 + g ( x ) + h ( x ) Usually prox f + g � = prox f ◦ prox g Proximal-Dykstra method 1 Let x 0 = y ; u 0 = 0 , z 0 = 0 2 k -th iteration ( k ≥ 0 ) w k = prox g ( x k + z k ) u k +1 = x k + u k − w k 7 / 19

  24. Proximity operator for sums 1 2 � x − y � 2 min x 2 + g ( x ) + h ( x ) Usually prox f + g � = prox f ◦ prox g Proximal-Dykstra method 1 Let x 0 = y ; u 0 = 0 , z 0 = 0 2 k -th iteration ( k ≥ 0 ) w k = prox g ( x k + z k ) u k +1 = x k + u k − w k x k +1 = prox h ( w k + z k ) z k +1 = w k + z k − x k +1 7 / 19

  25. Proximity operator for sums 1 2 � x − y � 2 min x 2 + g ( x ) + h ( x ) Usually prox f + g � = prox f ◦ prox g Proximal-Dykstra method 1 Let x 0 = y ; u 0 = 0 , z 0 = 0 2 k -th iteration ( k ≥ 0 ) w k = prox g ( x k + z k ) u k +1 = x k + u k − w k x k +1 = prox h ( w k + z k ) z k +1 = w k + z k − x k +1 Why does it work? 7 / 19

  26. Proximity operator for sums 1 2 � x − y � 2 min x 2 + g ( x ) + h ( x ) Usually prox f + g � = prox f ◦ prox g Proximal-Dykstra method 1 Let x 0 = y ; u 0 = 0 , z 0 = 0 2 k -th iteration ( k ≥ 0 ) w k = prox g ( x k + z k ) u k +1 = x k + u k − w k x k +1 = prox h ( w k + z k ) z k +1 = w k + z k − x k +1 Why does it work? After the break...! 7 / 19

  27. Proximity operator for sums 1 2 � x − y � 2 min x 2 + g ( x ) + h ( x ) Usually prox f + g � = prox f ◦ prox g Proximal-Dykstra method 1 Let x 0 = y ; u 0 = 0 , z 0 = 0 2 k -th iteration ( k ≥ 0 ) w k = prox g ( x k + z k ) u k +1 = x k + u k − w k x k +1 = prox h ( w k + z k ) z k +1 = w k + z k − x k +1 Why does it work? After the break...! Exercise: Use the product-space trick to extend this to a parallel Dykstra-like method for m ≥ 3 functions. 7 / 19

  28. Incremental methods 8 / 19

  29. Separable objectives f ( x ) = � m min i f i ( x ) + λr ( x ) 9 / 19

  30. Separable objectives f ( x ) = � m min i f i ( x ) + λr ( x ) Gradient / subgradient methods x k − α k ∇ f ( x k ) x k +1 = λ = 0 , x k +1 x k − α k g ( x k ) , g ( x k ) ∈ ∂f ( x k ) + λ∂r ( x k ) = prox α k r ( x k − α k ∇ f ( x k )) x k +1 = 9 / 19

  31. Separable objectives f ( x ) = � m min i f i ( x ) + λr ( x ) Gradient / subgradient methods x k − α k ∇ f ( x k ) x k +1 = λ = 0 , x k +1 x k − α k g ( x k ) , g ( x k ) ∈ ∂f ( x k ) + λ∂r ( x k ) = prox α k r ( x k − α k ∇ f ( x k )) x k +1 = How much computation does one iteration take? 9 / 19

  32. Incremental gradient methods What if at iteration k , we randomly pick an integer i ( k ) ∈ { 1 , 2 , . . . , m } ? 10 / 19

  33. Incremental gradient methods What if at iteration k , we randomly pick an integer i ( k ) ∈ { 1 , 2 , . . . , m } ? And instead just perform the update? x k +1 = x k − α k ∇ f i ( k ) ( x k ) 10 / 19

  34. Incremental gradient methods What if at iteration k , we randomly pick an integer i ( k ) ∈ { 1 , 2 , . . . , m } ? And instead just perform the update? x k +1 = x k − α k ∇ f i ( k ) ( x k ) ◮ The update requires only gradient for f i ( k ) ◮ One iteration now m times faster than with ∇ f ( x ) 10 / 19

  35. Incremental gradient methods What if at iteration k , we randomly pick an integer i ( k ) ∈ { 1 , 2 , . . . , m } ? And instead just perform the update? x k +1 = x k − α k ∇ f i ( k ) ( x k ) ◮ The update requires only gradient for f i ( k ) ◮ One iteration now m times faster than with ∇ f ( x ) But does this make sense? 10 / 19

  36. Incremental gradient methods ♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. 11 / 19

  37. Incremental gradient methods ♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly 11 / 19

  38. Incremental gradient methods ♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the f i ( x ) may have similar minimizers; 11 / 19

  39. Incremental gradient methods ♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the f i ( x ) may have similar minimizers; by using the f i only individually we hope to take advantage of this fact, and greatly speed up. 11 / 19

Recommend


More recommend