complexity of composite optimization
play

Complexity of Composite Optimization Guanghui (George) Lan - PowerPoint PPT Presentation

Complexity of Composite Optimization Guanghui (George) Lan University of Florida Georgia Institue of Technology (from 1/2016) NIPS Optimization for Machine Learning Workshop December 11, 2015 Background Complex composite problems Finite-sum


  1. Complexity of Composite Optimization Guanghui (George) Lan University of Florida Georgia Institue of Technology (from 1/2016) NIPS Optimization for Machine Learning Workshop December 11, 2015

  2. Background Complex composite problems Finite-sum problems Summary General CP methods Problem: Ψ ∗ = min x ∈ X Ψ( x ) . X closed and convex. Ψ is convex x ) − Ψ ∗ ≤ ǫ . Goal: to find an ǫ -solution, i.e., ¯ x ∈ X s.t. Ψ(¯ Complexity: the number of (sub)gradient evaluations of Ψ – Ψ is smooth: O ( 1 / √ ǫ ) . Ψ is nonsmooth: O ( 1 /ǫ 2 ) . Ψ is strongly convex: O ( log ( 1 /ǫ )) . beamer-tu-logo 2 / 40

  3. Background Complex composite problems Finite-sum problems Summary Composite optimization problems We consider composite problems which can be modeled as Ψ ∗ = min x ∈ X { Ψ( x ) := f ( x ) + h ( x ) } . Here, f : X → R is a smooth and expensive term (data fitting), h : X → R is a nonsmooth regularization term (solution structures), and X is a closed convex feasible set. Three Challenging Cases h or X are not necessarily simple. f given by the summation of many terms. f (or h ) is nonconvex and possibly stochastic. beamer-tu-logo 3 / 40

  4. Background Complex composite problems Finite-sum problems Summary Existing complexity results Problem: Ψ ∗ := min x ∈ X { Ψ( x ) := f ( x ) + h ( x ) } . First-order methods: iterative methods which operate with the gradients (subgradients) of f and h . Complexity: number of iterations needed to find an ǫ -solution, x ) − Ψ ∗ ≤ ǫ . i.e., a point ¯ x ∈ X s.t. Ψ(¯ Easy case: h simple, X simple Pr X , h ( y ) := argmin x ∈ X � y − x � 2 + h ( x ) is easy to compute (e.g., compressed sensing). Complexity: O ( 1 / √ ǫ ) (Nesterov 07, Tseng 08, Beck and Teboulle 09). beamer-tu-logo 4 / 40

  5. Background Complex composite problems Finite-sum problems Summary Existing complexity results Problem: Ψ ∗ := min x ∈ X { Ψ( x ) := f ( x ) + h ( x ) } . First-order methods: iterative methods which operate with the gradients (subgradients) of f and h . Complexity: number of iterations needed to find an ǫ -solution, x ) − Ψ ∗ ≤ ǫ . i.e., a point ¯ x ∈ X s.t. Ψ(¯ Easy case: h simple, X simple Pr X , h ( y ) := argmin x ∈ X � y − x � 2 + h ( x ) is easy to compute (e.g., compressed sensing). Complexity: O ( 1 / √ ǫ ) (Nesterov 07, Tseng 08, Beck and Teboulle 09). beamer-tu-logo 4 / 40

  6. Background Complex composite problems Finite-sum problems Summary More difficult cases h general, X simple h is a general nonsmooth function; P X := argmin x ∈ X � y − x � 2 is easy to compute (e.g., total variation). Complexity: O ( 1 /ǫ 2 ) . h structured, X simple h is structured, e.g., h ( x ) = max y ∈ Y � Ax , y � ; P X is easy to compute (e.g., total variation). Complexity: O ( 1 /ǫ ) . h simple, X complicated L X , h ( y ) := argmin x ∈ X � y , x � + h ( x ) is easy to compute (e.g., matrix completion).Complexity: O ( 1 /ǫ ) . beamer-tu-logo 5 / 40

  7. Background Complex composite problems Finite-sum problems Summary More difficult cases h general, X simple h is a general nonsmooth function; P X := argmin x ∈ X � y − x � 2 is easy to compute (e.g., total variation). Complexity: O ( 1 /ǫ 2 ) . h structured, X simple h is structured, e.g., h ( x ) = max y ∈ Y � Ax , y � ; P X is easy to compute (e.g., total variation). Complexity: O ( 1 /ǫ ) . h simple, X complicated L X , h ( y ) := argmin x ∈ X � y , x � + h ( x ) is easy to compute (e.g., matrix completion).Complexity: O ( 1 /ǫ ) . beamer-tu-logo 5 / 40

  8. Background Complex composite problems Finite-sum problems Summary More difficult cases h general, X simple h is a general nonsmooth function; P X := argmin x ∈ X � y − x � 2 is easy to compute (e.g., total variation). Complexity: O ( 1 /ǫ 2 ) . h structured, X simple h is structured, e.g., h ( x ) = max y ∈ Y � Ax , y � ; P X is easy to compute (e.g., total variation). Complexity: O ( 1 /ǫ ) . h simple, X complicated L X , h ( y ) := argmin x ∈ X � y , x � + h ( x ) is easy to compute (e.g., matrix completion).Complexity: O ( 1 /ǫ ) . beamer-tu-logo 5 / 40

  9. Background Complex composite problems Finite-sum problems Summary Motivation O ( 1 / √ ǫ ) h simple, X simple 100 O ( 1 /ǫ 2 ) 10 8 h general, X simple 10 4 h structured, X simple O ( 1 /ǫ ) 10 4 h simple, X complicated O ( 1 /ǫ ) More general h or more complicated X ⇓ Slow convergence of first-order algorithms ⇓ A large number of gradient evaluations of ∇ f beamer-tu-logo 6 / 40

  10. Background Complex composite problems Finite-sum problems Summary Motivation O ( 1 / √ ǫ ) h simple, X simple 100 O ( 1 /ǫ 2 ) 10 8 h general, X simple 10 4 h structured, X simple O ( 1 /ǫ ) 10 4 h simple, X complicated O ( 1 /ǫ ) More general h or more complicated X ⇓ Slow convergence of first-order algorithms × ? ⇓ A large number of gradient evaluations of ∇ f Question: Can we skip the computation of ∇ f ? beamer-tu-logo 6 / 40

  11. Background Complex composite problems Finite-sum problems Summary Composite problems Ψ ∗ = min x ∈ X { Ψ( x ) := f ( x ) + h ( x ) } . f is smooth, i.e., ∃ L > 0 s.t. ∀ x , y ∈ X , �∇ f ( y ) − ∇ f ( x ) � ≤ L � y − x � . h is nonsmooth, i.e., ∃ M > 0 s.t. ∀ x , y ∈ X , | h ( x ) − h ( y ) | ≤ M � y − x � . P X is simple to compute. Question: How many number of gradient evaluations of ∇ f and subgradient evaluations of h ′ are needed to find an ǫ -solution? beamer-tu-logo 7 / 40

  12. Background Complex composite problems Finite-sum problems Summary Existing results Existing algorithms evaluate ∇ f and h ′ together at each iteration: Mirror-prox method (Juditsky, Nemirovski and Travel, 11): � � ǫ + M 2 L O ǫ 2 Accelerated stochastic approximation (Lan, 12): �� � ǫ + M 2 L O ǫ 2 Issue: Whenever the second term dominates, the number of gradient evaluations ∇ f is given by O ( 1 /ǫ 2 ) . beamer-tu-logo 8 / 40

  13. Background Complex composite problems Finite-sum problems Summary Bottleneck for composite problems The computation of ∇ f , however, is often the bottleneck in comparison with that of h ′ . The computation of ∇ f invovles a large data set, while that of h ′ only involves a very sparse matrix. In total variation minimization, the computation of gradient: O ( m × n ) , and the computation of subgradient: O ( n ) . Can we reduce the number of gradient evaluations for ∇ f from O ( 1 /ǫ 2 ) to O ( 1 / √ ǫ ) , while still maintaining the optimal O ( 1 /ǫ 2 ) bound on subgradient evaluations for h ′ ? beamer-tu-logo 9 / 40

  14. Background Complex composite problems Finite-sum problems Summary The gradient sliding algorithm Algorithm 1 The gradient sliding (GS) algorithm Input: Initial point x 0 ∈ X and iteration limit N . Let β k ≥ 0 , γ k ≥ 0, and T k ≥ 0 be given and set ¯ x 0 = x 0 . for k = 1 , 2 , . . . , N do 1. Set x k = ( 1 − γ k )¯ x k − 1 + γ k x k − 1 and g k = ∇ f ( x k ) . 2. Set ( x k , ˜ x k ) = PS ( g k , x k − 1 , β k , T k ) . 3. Set ¯ x k = ( 1 − γ k )¯ x k − 1 + γ k ˜ x k . end for Output: ¯ x N . PS : the prox-sliding procedure. beamer-tu-logo 10 / 40

  15. Background Complex composite problems Finite-sum problems Summary The PS procedure Procedure ( x + , ˜ x + ) = PS ( g , x , β, T ) Let the parameters p t > 0 and θ t ∈ [ 0 , 1 ] , t = 1 , . . . , be given. Set u 0 = ˜ u 0 = x . for t = 1 , 2 , . . . , T do 2 � u − x � 2 + β p t u t = argmin u ∈ X � g + h ′ ( u t − 1 ) , u � + β 2 � u − u t − 1 � 2 , ˜ u t = ( 1 − θ t )˜ u t − 1 + θ t u t . end for Set x + = u T and ˜ x + = ˜ u T . Note: � · − · � 2 / 2 can be replaced by the more general Bregman distance V ( x , u ) = ω ( u ) − ω ( x ) − �∇ ω ( x ) , u − x � . beamer-tu-logo 11 / 40

  16. Background Complex composite problems Finite-sum problems Summary Remarks When supplied with g ( · ) , x ∈ X , β , and T , the PS procedure computes a pair of approximate solutions ( x + , ˜ x + ) ∈ X × X for the problem of: � � Φ( u ) := � g , u � + h ( u ) + β 2 � u − x � 2 . argmin u ∈ X In each iteration, the subproblem is given by � Φ k ( u ) := �∇ f ( x k ) , u � + h ( u ) + β k � 2 � u − x k � 2 . argmin u ∈ X beamer-tu-logo 12 / 40

  17. Background Complex composite problems Finite-sum problems Summary Convergence of the PS proedure Proposition If { p t } and { θ t } in the PS procedure satisfy p t = t θ t = 2 ( t + 1 ) and t ( t + 3 ) , 2 then for any t ≥ 1 and u ∈ X, β ( t + 3 )+ β � u 0 − u � 2 M 2 u t ) − Φ( u )+ β ( t + 1 )( t + 2 ) � u t − u � 2 ≤ Φ(˜ . 2 t ( t + 3 ) t ( t + 3 ) beamer-tu-logo 13 / 40

  18. Background Complex composite problems Finite-sum problems Summary Convergence of the GS algorithm Theorem Suppose that the previous conditions on { p t } and { θ t } hold, and that N is given a priori. If � M 2 Nk 2 � β k = 2 L 2 k , γ k = k + 1 , and T k = ˜ DL 2 for some ˜ D > 0 , then � 3 � x 0 − x ∗ � 2 L � + 2 ˜ x N ) − Ψ( x ∗ ) ≤ Ψ(¯ . D N ( N + 1 ) 2 Remark: Do NOT need N given a priori if X is bounded. beamer-tu-logo 14 / 40

Recommend


More recommend