Summary: Linear Convergence, Strictly Convex f � � 1 − 2 Steepest descent: Linear rate approx ; κ � � 1 − 2 Heavy-ball: Linear rate approx √ κ . Big difference! To reduce � x k − x ∗ � by a factor ǫ , need k large enough that � � k 1 − 2 k ≥ κ ≤ ǫ ⇐ 2 | log ǫ | (steepest descent) κ � � k √ κ 1 − 2 √ κ ≤ ǫ ⇐ k ≥ 2 | log ǫ | (heavy-ball) A factor of √ κ difference; e.g. if κ = 1000 (not at all uncommon in inverse problems), need ∼ 30 times fewer steps. M. Figueiredo and S. Wright First-Order Methods April 2016 18 / 68
Conjugate Gradient Basic conjugate gradient (CG) step is p k = −∇ f ( x k ) + γ k p k − 1 . x k +1 = x k + α k p k , Can be identified with heavy-ball, with β k = α k γ k . α k − 1 However, CG can be implemented in a way that doesn’t require knowledge (or estimation) of L and µ . Choose α k to (approximately) miminize f along p k ; Choose γ k by a variety of formulae (Fletcher-Reeves, Polak-Ribiere, etc), all of which are equivalent if f is convex quadratic. e.g. �∇ f ( x k ) � 2 γ k = �∇ f ( x k − 1 ) � 2 M. Figueiredo and S. Wright First-Order Methods April 2016 19 / 68
Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves, Polak-Ribiere, Hestenes. Restarting periodically with p k = −∇ f ( x k ) is useful (e.g. every n iterations, or when p k is not a descent direction). For quadratic f , convergence analysis is based on eigenvalues of A and Chebyshev polynomials, min-max arguments. Get Finite termination in as many iterations as there are distinct eigenvalues; Asymptotic linear convergence with rate approx 1 − 2 √ κ . (like heavy-ball.) (Nocedal and Wright, 2006) M. Figueiredo and S. Wright First-Order Methods April 2016 20 / 68
Accelerated First-Order Methods Accelerate the rate to 1 / k 2 for weakly convex, while retaining the linear rate (related to √ κ ) for strongly convex case. Nesterov (1983) describes a method that requires κ . Initialize: Choose x 0 , α 0 ∈ (0 , 1); set y 0 ← x 0 . Iterate: x k +1 ← y k − 1 L ∇ f ( y k ); (*short-step*) k + α k +1 find α k +1 ∈ (0 , 1): α 2 k +1 = (1 − α k +1 ) α 2 κ ; set β k = α k (1 − α k ) ; α 2 k + α k +1 set y k +1 ← x k +1 + β k ( x k +1 − x k ). Still works for weakly convex ( κ = ∞ ). M. Figueiredo and S. Wright First-Order Methods April 2016 21 / 68
y k+2 x k+2 y k+1 x k+1 x k y k Separates the “gradient descent” and “momentum” step components. M. Figueiredo and S. Wright First-Order Methods April 2016 22 / 68
Convergence Results: Nesterov If α 0 ≥ 1 / √ κ , have �� � � k 1 − 1 4 L f ( x k ) − f ( x ∗ ) ≤ c 1 min √ κ , √ , L + c 2 k ) 2 ( where constants c 1 and c 2 depend on x 0 , α 0 , L . Linear convergence “heavy-ball” rate for strongly convex f ; 1 / k 2 sublinear rate otherwise. In the special case of α 0 = 1 / √ κ , this scheme yields 1 2 √ κ, √ κ + 1 . α k ≡ β k ≡ 1 − M. Figueiredo and S. Wright First-Order Methods April 2016 23 / 68
Beck and Teboulle Beck and Teboulle (2009) propose a similar algorithm, with a fairly short and elementary analysis (though still not intuitive). Initialize: Choose x 0 ; set y 1 = x 0 , t 1 = 1; x k ← y k − 1 Iterate: L ∇ f ( y k ); � � � t k +1 ← 1 1 + 4 t 2 1 + ; 2 k y k +1 ← x k + t k − 1 ( x k − x k − 1 ). t k +1 For (weakly) convex f , converges with f ( x k ) − f ( x ∗ ) ∼ 1 / k 2 . When L is not known, increase an estimate of L until it’s big enough. Beck and Teboulle (2009) do the convergence analysis in 2-3 pages; elementary, but “technical.” M. Figueiredo and S. Wright First-Order Methods April 2016 24 / 68
A Non-Monotone Gradient Method: Barzilai-Borwein Barzilai and Borwein (1988) (BB) proposed an unusual choice of α k . Allows f to increase (sometimes a lot) on some steps: non-monotone. � s k − α z k � 2 , x k +1 = x k − α k ∇ f ( x k ) , α k := arg min α where s k := x k − x k − 1 , z k := ∇ f ( x k ) − ∇ f ( x k − 1 ) . Explicitly, we have α k = s T k z k . z T k z k Note that for f ( x ) = 1 2 x T Ax , we have � 1 � α k = s T k As k L , 1 ∈ . s T k A 2 s k µ BB can be viewed as a quasi-Newton method, with the Hessian approximated by α − 1 k I . M. Figueiredo and S. Wright First-Order Methods April 2016 25 / 68
Comparison: BB vs Greedy Steepest Descent M. Figueiredo and S. Wright First-Order Methods April 2016 26 / 68
There Are Many BB Variants use α k = s T k s k / s T k z k in place of α k = s T k z k / z T k z k ; alternate between these two formulae; hold α k constant for a number (2, 3, 5) of successive steps; take α k to be the steepest descent step from the previous iteration. Nonmonotonicity appears essential to performance. Some variants get global convergence by requiring a sufficient decrease in f over the worst of the last M (say 10) iterates. The original 1988 analysis in BB’s paper is nonstandard and illuminating (just for a 2-variable quadratic). In fact, most analyses of BB and related methods are nonstandard, and consider only special cases. The precursor of such analyses is Akaike (1959) . More recently, see Ascher, Dai, Fletcher, Hager and others. M. Figueiredo and S. Wright First-Order Methods April 2016 27 / 68
Extending to the Constrained Case: x ∈ Ω How to change these methods to handle the constraint x ∈ Ω ? (assuming that Ω is a closed convex set) Some algorithms and theory stay much the same, ...if we can involve the constraint x ∈ Ω explicity in the subproblems. Example: Nesterov’s constant step scheme requires just one calculation to be changed from the unconstrained version. Initialize: Choose x 0 , α 0 ∈ (0 , 1); set y 0 ← x 0 . Iterate: x k +1 ← arg min y ∈ Ω 1 2 � y − [ y k − 1 L ∇ f ( y k )] � 2 2 ; k + α k +1 find α k +1 ∈ (0 , 1): α 2 k +1 = (1 − α k +1 ) α 2 κ ; set β k = α k (1 − α k ) k + α k +1 ; α 2 set y k +1 ← x k +1 + β k ( x k +1 − x k ). Convergence theory is unchanged. M. Figueiredo and S. Wright First-Order Methods April 2016 28 / 68
Regularized Optimization How to change these methods to handle regularized optimization? min f ( x ) + τψ ( x ) , x where f is convex and smooth, while ψ is convex but usually nonsmooth . M. Figueiredo and S. Wright First-Order Methods April 2016 29 / 68
Regularized Optimization How to change these methods to handle regularized optimization? min f ( x ) + τψ ( x ) , x where f is convex and smooth, while ψ is convex but usually nonsmooth . Often, all that is needed is to change the update step to x � x − Φ( x k ) � 2 x k = arg min 2 + λψ ( x ) . where Φ( x k ) is a gradient descent step M. Figueiredo and S. Wright First-Order Methods April 2016 29 / 68
Regularized Optimization How to change these methods to handle regularized optimization? min f ( x ) + τψ ( x ) , x where f is convex and smooth, while ψ is convex but usually nonsmooth . Often, all that is needed is to change the update step to x � x − Φ( x k ) � 2 x k = arg min 2 + λψ ( x ) . where Φ( x k ) is a gradient descent step or something more complicated; e.g. , an accelerated method with Φ( x k , x k − 1 ) instead of Φ( x k ) M. Figueiredo and S. Wright First-Order Methods April 2016 29 / 68
Regularized Optimization How to change these methods to handle regularized optimization? min f ( x ) + τψ ( x ) , x where f is convex and smooth, while ψ is convex but usually nonsmooth . Often, all that is needed is to change the update step to x � x − Φ( x k ) � 2 x k = arg min 2 + λψ ( x ) . where Φ( x k ) is a gradient descent step or something more complicated; e.g. , an accelerated method with Φ( x k , x k − 1 ) instead of Φ( x k ) This is the shrinkage/tresholding step; how to solve it with nonsmooth ψ ? That’s the topic of the following slides. M. Figueiredo and S. Wright First-Order Methods April 2016 29 / 68
Handling Nonsmoothness (e.g. ℓ 1 Norm) Convexity ⇒ continuity (on the domain of the function). Convexity �⇒ differentiability (e.g., ψ ( x ) = � x � 1 ). Subgradients generalize gradients for general convex functions: M. Figueiredo and S. Wright First-Order Methods April 2016 30 / 68
Handling Nonsmoothness (e.g. ℓ 1 Norm) Convexity ⇒ continuity (on the domain of the function). Convexity �⇒ differentiability (e.g., ψ ( x ) = � x � 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f ( x ′ ) ≥ f ( x ) + v T ( x ′ − x ) Subdifferential: ∂ f ( x ) = { all subgradients of f at x } linear lower bound M. Figueiredo and S. Wright First-Order Methods April 2016 30 / 68
Handling Nonsmoothness (e.g. ℓ 1 Norm) Convexity ⇒ continuity (on the domain of the function). Convexity �⇒ differentiability (e.g., ψ ( x ) = � x � 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f ( x ′ ) ≥ f ( x ) + v T ( x ′ − x ) Subdifferential: ∂ f ( x ) = { all subgradients of f at x } If f is differentiable, ∂ f ( x ) = {∇ f ( x ) } linear lower bound M. Figueiredo and S. Wright First-Order Methods April 2016 30 / 68
Handling Nonsmoothness (e.g. ℓ 1 Norm) Convexity ⇒ continuity (on the domain of the function). Convexity �⇒ differentiability (e.g., ψ ( x ) = � x � 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f ( x ′ ) ≥ f ( x ) + v T ( x ′ − x ) Subdifferential: ∂ f ( x ) = { all subgradients of f at x } If f is differentiable, ∂ f ( x ) = {∇ f ( x ) } linear lower bound nondifferentiable case M. Figueiredo and S. Wright First-Order Methods April 2016 30 / 68
More on Subgradients and Subdifferentials The subdifferential is a set-valued function: ∂ f : R d → 2 R d (power set of R d ) f : R d → R ⇒ M. Figueiredo and S. Wright First-Order Methods April 2016 31 / 68
More on Subgradients and Subdifferentials The subdifferential is a set-valued function: ∂ f : R d → 2 R d (power set of R d ) f : R d → R ⇒ Example: − 2 x − 1 , x ≤ − 1 f ( x ) = − x , − 1 < x ≤ 0 x 2 / 2 , x > 0 {− 2 } , x < − 1 [ − 2 , − 1] , x = − 1 {− 1 } , − 1 < x < 0 ∂ f ( x ) = [ − 1 , 0] , x = 0 { x } , x > 0 M. Figueiredo and S. Wright First-Order Methods April 2016 31 / 68
More on Subgradients and Subdifferentials The subdifferential is a set-valued function: ∂ f : R d → 2 R d (power set of R d ) f : R d → R ⇒ Example: − 2 x − 1 , x ≤ − 1 f ( x ) = − x , − 1 < x ≤ 0 x 2 / 2 , x > 0 {− 2 } , x < − 1 [ − 2 , − 1] , x = − 1 {− 1 } , − 1 < x < 0 ∂ f ( x ) = [ − 1 , 0] , x = 0 { x } , x > 0 Fermat’s Rule: x ∈ arg min x f ( x ) ⇔ 0 ∈ ∂ f ( x ) M. Figueiredo and S. Wright First-Order Methods April 2016 31 / 68
A Key Tool: Moreau’s Proximity Operators Moreau (1962) proximity operator 1 2 � x − y � 2 � x ∈ arg min 2 + ψ ( x ) =: prox ψ ( y ) x ...well defined for convex ψ , since � · − y � 2 2 is coercive and strictly convex. M. Figueiredo and S. Wright First-Order Methods April 2016 32 / 68
A Key Tool: Moreau’s Proximity Operators Moreau (1962) proximity operator 1 2 � x − y � 2 � x ∈ arg min 2 + ψ ( x ) =: prox ψ ( y ) x ...well defined for convex ψ , since � · − y � 2 2 is coercive and strictly convex. Example: (seen above) prox τ |·| ( y ) = soft( y , τ ) = sign( y ) max {| y | − τ, 0 } M. Figueiredo and S. Wright First-Order Methods April 2016 32 / 68
A Key Tool: Moreau’s Proximity Operators Moreau (1962) proximity operator 1 2 � x − y � 2 � x ∈ arg min 2 + ψ ( x ) =: prox ψ ( y ) x ...well defined for convex ψ , since � · − y � 2 2 is coercive and strictly convex. Example: (seen above) prox τ |·| ( y ) = soft( y , τ ) = sign( y ) max {| y | − τ, 0 } Block separability: x = ( x 1 , ..., x N ) (a partition of the components of x ) � ψ ( x ) = ψ i ( x i ) ⇒ (prox ψ ( y )) i = prox ψ i ( y i ) i Relationship with subdifferential: z = prox ψ ( y ) ⇔ z − y ∈ ∂ψ ( z ) M. Figueiredo and S. Wright First-Order Methods April 2016 32 / 68
A Key Tool: Moreau’s Proximity Operators Moreau (1962) proximity operator 1 2 � x − y � 2 � x ∈ arg min 2 + ψ ( x ) =: prox ψ ( y ) x ...well defined for convex ψ , since � · − y � 2 2 is coercive and strictly convex. Example: (seen above) prox τ |·| ( y ) = soft( y , τ ) = sign( y ) max {| y | − τ, 0 } Block separability: x = ( x 1 , ..., x N ) (a partition of the components of x ) � ψ ( x ) = ψ i ( x i ) ⇒ (prox ψ ( y )) i = prox ψ i ( y i ) i Relationship with subdifferential: z = prox ψ ( y ) ⇔ z − y ∈ ∂ψ ( z ) Resolvent: z = prox ψ ( y ) ⇔ 0 ∈ ∂ψ ( z ) + ( z − y ) ⇔ y ∈ ( ∂ψ + I ) z prox ψ ( y ) = ( ∂ψ + I ) − 1 y M. Figueiredo and S. Wright First-Order Methods April 2016 32 / 68
Important Proximity Operators Soft-thresholding is the proximity operator of the ℓ 1 norm. M. Figueiredo and S. Wright First-Order Methods April 2016 33 / 68
Important Proximity Operators Soft-thresholding is the proximity operator of the ℓ 1 norm. Consider the indicator ι S of a convex set S ; 1 1 2 � x − u � 2 2 � x − y � 2 prox ι S ( u ) = arg min 2 + ι S ( x ) = arg min 2 = P S ( u ) x x ∈S ...the Euclidean projection on S . M. Figueiredo and S. Wright First-Order Methods April 2016 33 / 68
Important Proximity Operators Soft-thresholding is the proximity operator of the ℓ 1 norm. Consider the indicator ι S of a convex set S ; 1 1 2 � x − u � 2 2 � x − y � 2 prox ι S ( u ) = arg min 2 + ι S ( x ) = arg min 2 = P S ( u ) x x ∈S ...the Euclidean projection on S . Squared Euclidean norm (separable, smooth): y x � x − y � 2 2 + τ � x � 2 prox τ �·� 2 2 ( y ) = arg min 2 = 1 + τ M. Figueiredo and S. Wright First-Order Methods April 2016 33 / 68
Important Proximity Operators Soft-thresholding is the proximity operator of the ℓ 1 norm. Consider the indicator ι S of a convex set S ; 1 1 2 � x − u � 2 2 � x − y � 2 prox ι S ( u ) = arg min 2 + ι S ( x ) = arg min 2 = P S ( u ) x x ∈S ...the Euclidean projection on S . Squared Euclidean norm (separable, smooth): y x � x − y � 2 2 + τ � x � 2 prox τ �·� 2 2 ( y ) = arg min 2 = 1 + τ Euclidean norm (not separable, nonsmooth): � x � x � 2 ( � x � 2 − τ ) , if � x � 2 > τ prox τ �·� 2 ( y ) = if � x � 2 ≤ τ 0 M. Figueiredo and S. Wright First-Order Methods April 2016 33 / 68
More Proximity Operators (Combettes and Pesquet, 2011) M. Figueiredo and S. Wright First-Order Methods April 2016 34 / 68
Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f — denoted by f ∗ : R n → ¯ R — is defined by f ∗ ( u ) = sup x x T u − f ( x ) M. Figueiredo and S. Wright First-Order Methods April 2016 35 / 68
Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f — denoted by f ∗ : R n → ¯ R — is defined by f ∗ ( u ) = sup x x T u − f ( x ) Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f ∗∗ = f . M. Figueiredo and S. Wright First-Order Methods April 2016 35 / 68
Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f — denoted by f ∗ : R n → ¯ R — is defined by f ∗ ( u ) = sup x x T u − f ( x ) Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f ∗∗ = f . Moreau’s decomposition: prox f ( u ) + prox f ∗ ( u ) = u ...meaning that, if you know prox f , you know prox f ∗ , and vice-versa. M. Figueiredo and S. Wright First-Order Methods April 2016 35 / 68
Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f — denoted by f ∗ : R n → ¯ R — is defined by f ∗ ( u ) = sup x x T u − f ( x ) Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f ∗∗ = f . Moreau’s decomposition: prox f ( u ) + prox f ∗ ( u ) = u ...meaning that, if you know prox f , you know prox f ∗ , and vice-versa. Conjugate of indicator: if f ( x ) = ι C ( x ), where C is a convex set, f ∗ ( u ) = sup x x T u − ι C ( x ) = sup x T u ≡ σ C ( u ) (support function of C ). x ∈ C M. Figueiredo and S. Wright First-Order Methods April 2016 35 / 68
From Conjugates to Proximity Operators Notice that | u | = sup x ∈ [ − 1 , 1] x T u = σ [ − 1 , 1] ( u ), thus | · | ∗ = ι [ − 1 , 1] . Using Moreau’s decomposition, we easily derive the soft-threshold: prox τ |·| = 1 − prox ι [ − τ,τ ] = 1 − P [ − τ,τ ] = soft( · , τ ) Conjugate of a norm: if f ( x ) = τ � x � p then f ∗ = ι { x : � x � q ≤ τ } , where 1 q + 1 p = 1 (a H¨ older pair, or H¨ older conjugates). M. Figueiredo and S. Wright First-Order Methods April 2016 36 / 68
From Conjugates to Proximity Operators Notice that | u | = sup x ∈ [ − 1 , 1] x T u = σ [ − 1 , 1] ( u ), thus | · | ∗ = ι [ − 1 , 1] . Using Moreau’s decomposition, we easily derive the soft-threshold: prox τ |·| = 1 − prox ι [ − τ,τ ] = 1 − P [ − τ,τ ] = soft( · , τ ) Conjugate of a norm: if f ( x ) = τ � x � p then f ∗ = ι { x : � x � q ≤ τ } , where 1 q + 1 p = 1 (a H¨ older pair, or H¨ older conjugates). That is, � · � p and � · � q are dual norms: � z � q = sup { x T z : � x � p ≤ 1 } = x T z = σ B p (1) ( z ) sup x ∈ B p (1) M. Figueiredo and S. Wright First-Order Methods April 2016 36 / 68
From Conjugates to Proximity Operators Proximity of norm: prox τ �·� p = I − P B q ( τ ) where B q ( τ ) = { x : � x � q ≤ τ } and 1 q + 1 p = 1. M. Figueiredo and S. Wright First-Order Methods April 2016 37 / 68
From Conjugates to Proximity Operators Proximity of norm: prox τ �·� p = I − P B q ( τ ) where B q ( τ ) = { x : � x � q ≤ τ } and 1 q + 1 p = 1. Example: computing prox �·� ∞ (notice ℓ ∞ is not separable): ∞ + 1 1 Since 1 = 1, prox τ �·� ∞ = I − P B 1 ( τ ) ... the proximity operator of ℓ ∞ norm is the residual of the projection on an ℓ 1 ball. M. Figueiredo and S. Wright First-Order Methods April 2016 37 / 68
From Conjugates to Proximity Operators Proximity of norm: prox τ �·� p = I − P B q ( τ ) where B q ( τ ) = { x : � x � q ≤ τ } and 1 q + 1 p = 1. Example: computing prox �·� ∞ (notice ℓ ∞ is not separable): ∞ + 1 1 Since 1 = 1, prox τ �·� ∞ = I − P B 1 ( τ ) ... the proximity operator of ℓ ∞ norm is the residual of the projection on an ℓ 1 ball. Projection on ℓ 1 ball has no closed form, but there are efficient (linear cost) algorithms (Brucker, 1984) , (Maculan and de Paula, 1989) . M. Figueiredo and S. Wright First-Order Methods April 2016 37 / 68
Geometry and Effect of prox ℓ ∞ Whereas ℓ 1 promotes sparsity, ℓ ∞ promotes equality (in absolute value). M. Figueiredo and S. Wright First-Order Methods April 2016 38 / 68
From Conjugates to Proximity Operators The dual of the ℓ 2 norm is the ℓ 2 norm. M. Figueiredo and S. Wright First-Order Methods April 2016 39 / 68
Group Norms and their Prox Operators M � Group-norm regularizer: ψ ( x ) = λ m � x G m � p . m =1 In the non-overlapping case ( G 1 , ..., G m is a partition of { 1 , ..., n } ), simply use separability: � � � � prox ψ ( u ) G m = prox λ m �·� p u G m . M. Figueiredo and S. Wright First-Order Methods April 2016 40 / 68
Group Norms and their Prox Operators M � Group-norm regularizer: ψ ( x ) = λ m � x G m � p . m =1 In the non-overlapping case ( G 1 , ..., G m is a partition of { 1 , ..., n } ), simply use separability: � � � � prox ψ ( u ) G m = prox λ m �·� p u G m . In the tree-structured case, can get a complete ordering of the groups: G 1 � G 2 ... � G M , where ( G � G ′ ) ⇔ ( G ⊂ G ′ ) or ( G ∩ G ′ = ∅ ). M. Figueiredo and S. Wright First-Order Methods April 2016 40 / 68
Group Norms and their Prox Operators M � Group-norm regularizer: ψ ( x ) = λ m � x G m � p . m =1 In the non-overlapping case ( G 1 , ..., G m is a partition of { 1 , ..., n } ), simply use separability: � � � � prox ψ ( u ) G m = prox λ m �·� p u G m . In the tree-structured case, can get a complete ordering of the groups: G 1 � G 2 ... � G M , where ( G � G ′ ) ⇔ ( G ⊂ G ′ ) or ( G ∩ G ′ = ∅ ). Define Π m : R n → R N : (Π m ( u )) G m = prox λ m �·� p ( u G m ) , G m , where ¯ (Π m ( u )) ¯ G m = u ¯ G m = { 1 , ..., n } \ G m Then prox ψ = Π M ◦ · · · ◦ Π 2 ◦ Π 1 ...only valid for p ∈ { 1 , 2 , ∞} (Jenatton et al., 2011) . M. Figueiredo and S. Wright First-Order Methods April 2016 40 / 68
Matrix Nuclear Norm and its Prox Operator min { m , n } � Recall the trace/nuclear norm: � X � ∗ = σ i . i =1 The dual of a Schatten p -norm is a Schatten q -norm, with q + 1 1 p = 1. Thus, the dual of the nuclear norm is the spectral norm: � � � X � ∞ = max σ 1 , ..., σ min { m , n } . If Y = U Λ V T is the SVD of Y , we have prox τ �·� ∗ ( Y ) = U Λ V T − P { X :max { σ 1 ,...,σ min { m , n } }≤ τ } ( U Λ V T ) � � V T . = U soft Λ , τ M. Figueiredo and S. Wright First-Order Methods April 2016 41 / 68
Atomic Norms: A Unified View M. Figueiredo and S. Wright First-Order Methods April 2016 42 / 68
Another Use of Fenchel-Legendre Conjugates The original problem: min f ( x ) + ψ ( x ) x M. Figueiredo and S. Wright First-Order Methods April 2016 43 / 68
Another Use of Fenchel-Legendre Conjugates The original problem: min f ( x ) + ψ ( x ) x Often this has the form: min g ( A x ) + ψ ( x ) x M. Figueiredo and S. Wright First-Order Methods April 2016 43 / 68
Another Use of Fenchel-Legendre Conjugates The original problem: min f ( x ) + ψ ( x ) x Often this has the form: min g ( A x ) + ψ ( x ) x Using the definition of conjugate g ( A x ) = sup u u T A x − g ∗ ( u ) u u T A x − g ∗ ( u ) + ψ ( x ) min g ( A x ) + ψ ( x ) = inf x sup x u ( − g ∗ ( u )) + inf x u T A x + ψ ( x ) = sup u ( − g ∗ ( u )) − sup − x T A T u − ψ ( x ) = sup x � �� � ψ ∗ ( − A T u ) u g ∗ ( u ) + ψ ∗ ( − A T u ) = − inf M. Figueiredo and S. Wright First-Order Methods April 2016 43 / 68
Another Use of Fenchel-Legendre Conjugates The original problem: min f ( x ) + ψ ( x ) x Often this has the form: min g ( A x ) + ψ ( x ) x Using the definition of conjugate g ( A x ) = sup u u T A x − g ∗ ( u ) u u T A x − g ∗ ( u ) + ψ ( x ) min g ( A x ) + ψ ( x ) = inf x sup x u ( − g ∗ ( u )) + inf x u T A x + ψ ( x ) = sup u ( − g ∗ ( u )) − sup − x T A T u − ψ ( x ) = sup x � �� � ψ ∗ ( − A T u ) u g ∗ ( u ) + ψ ∗ ( − A T u ) = − inf The problem inf u g ∗ ( u ) + ψ ∗ ( − A T u ) is sometimes easier to handle. M. Figueiredo and S. Wright First-Order Methods April 2016 43 / 68
Basic Proximal-Gradient Algorithm Use basic structure: x � x − Φ( x k ) � 2 x k = arg min 2 + ψ ( x ) . with Φ( x k ) a simple gradient descent step, thus � � x k +1 = prox α k ψ x k − α k ∇ f ( x k ) This approach goes by different names, such as “proximal gradient algorithm” (PGA), “iterative shrinkage/thresholding” (IST or ISTA), “forward-backward splitting” (FBS) It has been proposed in different communities: optimization, PDEs, convex analysis, signal processing, machine learning. M. Figueiredo and S. Wright First-Order Methods April 2016 44 / 68
Convergence of the Proximal-Gradient Algorithm � � Basic algorithm: x k +1 = prox α k ψ x k − α k ∇ f ( x k ) M. Figueiredo and S. Wright First-Order Methods April 2016 45 / 68
Convergence of the Proximal-Gradient Algorithm � � Basic algorithm: x k +1 = prox α k ψ x k − α k ∇ f ( x k ) generalized (possibly inexact) version: � � � � x k +1 = (1 − λ k ) x k + λ k prox α k ψ x k − α k ∇ f ( x k ) + b k + a k where a k and b k are “errors” in computing the prox and the gradient; λ k is an over-relaxation parameter. M. Figueiredo and S. Wright First-Order Methods April 2016 45 / 68
Convergence of the Proximal-Gradient Algorithm � � Basic algorithm: x k +1 = prox α k ψ x k − α k ∇ f ( x k ) generalized (possibly inexact) version: � � � � x k +1 = (1 − λ k ) x k + λ k prox α k ψ x k − α k ∇ f ( x k ) + b k + a k where a k and b k are “errors” in computing the prox and the gradient; λ k is an over-relaxation parameter. Convergence is guaranteed (Combettes and Wajs, 2006) if � 0 < inf α k ≤ sup α k < 2 L � λ k ∈ (0 , 1], with inf λ k > 0 � � ∞ k � a k � < ∞ and � ∞ k � b k � < ∞ M. Figueiredo and S. Wright First-Order Methods April 2016 45 / 68
Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68
Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . Here, ∇ f ( x ) = B T ( B x − b ) and the IST/PGA/FBS algorithm is � � x k − α k B T ( B x − b ) x k +1 = prox α k ψ requires only matrix-vector multiplications with B and B T . M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68
Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . Here, ∇ f ( x ) = B T ( B x − b ) and the IST/PGA/FBS algorithm is � � x k − α k B T ( B x − b ) x k +1 = prox α k ψ requires only matrix-vector multiplications with B and B T . Very important in signal/image processing, where fast algorithms exist to compute these products ( e.g. FFT, DWT) M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68
Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . Here, ∇ f ( x ) = B T ( B x − b ) and the IST/PGA/FBS algorithm is � � x k − α k B T ( B x − b ) x k +1 = prox α k ψ requires only matrix-vector multiplications with B and B T . Very important in signal/image processing, where fast algorithms exist to compute these products ( e.g. FFT, DWT) In this case, some more refined convergence results are available. M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68
Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . Here, ∇ f ( x ) = B T ( B x − b ) and the IST/PGA/FBS algorithm is � � x k − α k B T ( B x − b ) x k +1 = prox α k ψ requires only matrix-vector multiplications with B and B T . Very important in signal/image processing, where fast algorithms exist to compute these products ( e.g. FFT, DWT) In this case, some more refined convergence results are available. Even more refined results are available if ψ ( x ) = τ � x � 1 M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68
More on IST/FBS/PGA for the ℓ 2 - ℓ 1 Case 1 2 � B x − b � 2 2 + τ � x � 1 (recall B T B � LI ) Problem: � x ∈ G = arg min x ∈ R n � � x k − α B T ( B x − b ) , ατ IST/FBS/PGA becomes x k +1 = soft with α < 2 / L . M. Figueiredo and S. Wright First-Order Methods April 2016 47 / 68
More on IST/FBS/PGA for the ℓ 2 - ℓ 1 Case 1 2 � B x − b � 2 2 + τ � x � 1 (recall B T B � LI ) Problem: � x ∈ G = arg min x ∈ R n � � x k − α B T ( B x − b ) , ατ IST/FBS/PGA becomes x k +1 = soft with α < 2 / L . The zero set: Z ⊆ { 1 , ..., n } : � x ∈ G ⇒ � x Z = 0 Zeros are found in a finite number of iterations (Hale et al., 2008) : after a finite number of iterations ( x k ) Z = 0. M. Figueiredo and S. Wright First-Order Methods April 2016 47 / 68
More on IST/FBS/PGA for the ℓ 2 - ℓ 1 Case 1 2 � B x − b � 2 2 + τ � x � 1 (recall B T B � LI ) Problem: � x ∈ G = arg min x ∈ R n � � x k − α B T ( B x − b ) , ατ IST/FBS/PGA becomes x k +1 = soft with α < 2 / L . The zero set: Z ⊆ { 1 , ..., n } : � x ∈ G ⇒ � x Z = 0 Zeros are found in a finite number of iterations (Hale et al., 2008) : after a finite number of iterations ( x k ) Z = 0. After that, if B T Z B Z � µ I , with µ > 0 (thus κ ( B T Z B Z ) = L /µ < ∞ ), we have linear convergence x � 2 ≤ 1 − κ � x k +1 − � 1 + κ � x k − � x � 2 for the optimal choice α = 2 / ( L + µ ) (see unconstrained theory). M. Figueiredo and S. Wright First-Order Methods April 2016 47 / 68
Heavy Ball Acceleration: FISTA FISTA ( fast iterative shrinkage-thresholding algorithm ) is heavy-ball-type acceleration of IST (based on Nesterov (1983) ) (Beck and Teboulle, 2009) . Initialize: Choose α ≤ 1 / L , x 0 ; set y 1 = x 0 , t 1 = 1; � � x k ← prox ταψ y k − α ∇ f ( y k ) Iterate: ; � � � t k +1 ← 1 1 + 4 t 2 1 + ; 2 k y k +1 ← x k + t k − 1 ( x k − x k − 1 ). t k +1 M. Figueiredo and S. Wright First-Order Methods April 2016 48 / 68
Heavy Ball Acceleration: FISTA FISTA ( fast iterative shrinkage-thresholding algorithm ) is heavy-ball-type acceleration of IST (based on Nesterov (1983) ) (Beck and Teboulle, 2009) . Initialize: Choose α ≤ 1 / L , x 0 ; set y 1 = x 0 , t 1 = 1; � � x k ← prox ταψ y k − α ∇ f ( y k ) Iterate: ; � � � t k +1 ← 1 1 + 4 t 2 1 + ; 2 k y k +1 ← x k + t k − 1 ( x k − x k − 1 ). t k +1 Acceleration: � 1 � � 1 � FISTA: f ( x k ) − f ( � x ) ∼ O IST: f ( x k ) − f ( � x ) ∼ O . k 2 k M. Figueiredo and S. Wright First-Order Methods April 2016 48 / 68
Heavy Ball Acceleration: FISTA FISTA ( fast iterative shrinkage-thresholding algorithm ) is heavy-ball-type acceleration of IST (based on Nesterov (1983) ) (Beck and Teboulle, 2009) . Initialize: Choose α ≤ 1 / L , x 0 ; set y 1 = x 0 , t 1 = 1; � � x k ← prox ταψ y k − α ∇ f ( y k ) Iterate: ; � � � t k +1 ← 1 1 + 4 t 2 1 + ; 2 k y k +1 ← x k + t k − 1 ( x k − x k − 1 ). t k +1 Acceleration: � 1 � � 1 � FISTA: f ( x k ) − f ( � x ) ∼ O IST: f ( x k ) − f ( � x ) ∼ O . k 2 k When L is not known, increase an estimate of L until it’s big enough. M. Figueiredo and S. Wright First-Order Methods April 2016 48 / 68
Heavy Ball Acceleration: TwIST TwIST ( two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007) ) is a heavy-ball-type acceleration of IST, for 1 2 � B x − b � 2 min 2 + τψ ( x ) x Iterations (with α < 2 / L ) � � x k − α B T ( B x − b ) x k +1 = ( γ − β ) x k + (1 − γ ) x k − 1 + β prox ατψ M. Figueiredo and S. Wright First-Order Methods April 2016 49 / 68
Heavy Ball Acceleration: TwIST TwIST ( two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007) ) is a heavy-ball-type acceleration of IST, for 1 2 � B x − b � 2 min 2 + τψ ( x ) x Iterations (with α < 2 / L ) � � x k − α B T ( B x − b ) x k +1 = ( γ − β ) x k + (1 − γ ) x k − 1 + β prox ατψ Analysis in the strongly convex case: µ I � B T B � LI , with µ > 0. Condition number κ = L /µ < ∞ . M. Figueiredo and S. Wright First-Order Methods April 2016 49 / 68
Heavy Ball Acceleration: TwIST TwIST ( two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007) ) is a heavy-ball-type acceleration of IST, for 1 2 � B x − b � 2 min 2 + τψ ( x ) x Iterations (with α < 2 / L ) � � x k − α B T ( B x − b ) x k +1 = ( γ − β ) x k + (1 − γ ) x k − 1 + β prox ατψ Analysis in the strongly convex case: µ I � B T B � LI , with µ > 0. Condition number κ = L /µ < ∞ . µ + L , where ρ = 1 −√ κ Optimal parameters: γ = ρ 2 + 1, β = 2 α 1+ √ κ , yield linear convergence x � 2 ≤ 1 − √ κ � x k +1 − � 1 + √ κ � x k − � x � 2 M. Figueiredo and S. Wright First-Order Methods April 2016 49 / 68
Heavy Ball Acceleration: TwIST TwIST ( two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007) ) is a heavy-ball-type acceleration of IST, for 1 2 � B x − b � 2 min 2 + τψ ( x ) x Iterations (with α < 2 / L ) � � x k − α B T ( B x − b ) x k +1 = ( γ − β ) x k + (1 − γ ) x k − 1 + β prox ατψ Analysis in the strongly convex case: µ I � B T B � LI , with µ > 0. Condition number κ = L /µ < ∞ . µ + L , where ρ = 1 −√ κ Optimal parameters: γ = ρ 2 + 1, β = 2 α 1+ √ κ , yield linear convergence x � 2 ≤ 1 − √ κ � � versus 1 − κ � x k +1 − � 1 + √ κ � x k − � x � 2 1+ κ for IST M. Figueiredo and S. Wright First-Order Methods April 2016 49 / 68
Illustration of the TwIST Acceleration M. Figueiredo and S. Wright First-Order Methods April 2016 50 / 68
Acceleration via Larger Steps: SpaRSA The standard step-size α k ≤ 2 L in IST too timid M. Figueiredo and S. Wright First-Order Methods April 2016 51 / 68
Acceleration via Larger Steps: SpaRSA The standard step-size α k ≤ 2 L in IST too timid The SpARSA (sparse reconstruction by separable approximation) framework proposes bolder choices of α k (Wright et al., 2009) : � Barzilai-Borwein, to mimic Newton steps. � keep increasing α k until monotonicity is violated: backtrack. M. Figueiredo and S. Wright First-Order Methods April 2016 51 / 68
Acceleration via Larger Steps: SpaRSA The standard step-size α k ≤ 2 L in IST too timid The SpARSA (sparse reconstruction by separable approximation) framework proposes bolder choices of α k (Wright et al., 2009) : � Barzilai-Borwein, to mimic Newton steps. � keep increasing α k until monotonicity is violated: backtrack. Convergence to critical points (minima in the convex case) is guaranteed for a safeguarded version: ensure sufficient decrease w.r.t. the worst value in previous M iterations. M. Figueiredo and S. Wright First-Order Methods April 2016 51 / 68
Another Approach: Gradient Projection min x 1 2 � B x − b � 2 2 + τ � x � 1 can be written as a standard QP: 1 2 � B ( u − v ) − b � 2 2 + τ u T 1 + τ u T 1 min s.t. u ≥ 0 , v ≥ 0 , u , v where u i = max { 0 , x i } and v i = max { 0 , − x i } . M. Figueiredo and S. Wright First-Order Methods April 2016 52 / 68
Another Approach: Gradient Projection min x 1 2 � B x − b � 2 2 + τ � x � 1 can be written as a standard QP: 1 2 � B ( u − v ) − b � 2 2 + τ u T 1 + τ u T 1 min s.t. u ≥ 0 , v ≥ 0 , u , v where u i = max { 0 , x i } and v i = max { 0 , − x i } . � u � With z = , problem can be written in canonical form v 1 2 z T Q z + c T z min s.t. z ≥ 0 z M. Figueiredo and S. Wright First-Order Methods April 2016 52 / 68
Another Approach: Gradient Projection min x 1 2 � B x − b � 2 2 + τ � x � 1 can be written as a standard QP: 1 2 � B ( u − v ) − b � 2 2 + τ u T 1 + τ u T 1 min s.t. u ≥ 0 , v ≥ 0 , u , v where u i = max { 0 , x i } and v i = max { 0 , − x i } . � u � With z = , problem can be written in canonical form v 1 2 z T Q z + c T z min s.t. z ≥ 0 z Solving this problem with projected gradient using Barzilai-Borwein steps: GPSR (gradient projection for sparse reconstruction) (Figueiredo et al., 2007) . M. Figueiredo and S. Wright First-Order Methods April 2016 52 / 68
Speed Comparisons Lorenz (2011) proposed a way of generating problem instances with known solution � x : useful for speed comparison. Define: R k = � x k − � x � 2 and r k = L ( x k ) − L ( � x ) ( where L ( x ) = f ( x ) + τψ ( x ) ). � � x � 2 L ( � x ) M. Figueiredo and S. Wright First-Order Methods April 2016 53 / 68
More Speed Comparisons M. Figueiredo and S. Wright First-Order Methods April 2016 54 / 68
Even More Speed Comparisons M. Figueiredo and S. Wright First-Order Methods April 2016 55 / 68
Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. M. Figueiredo and S. Wright First-Order Methods April 2016 56 / 68
Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. A very simple acceleration strategy: continuation/homotopy Initialization: Set τ 0 ≫ τ , starting point ¯ x , factor σ ∈ (0 , 1), and k = 0. Iterations: Find approx solution x ( τ k ) of min x f ( x ) + τ k ψ ( x ), starting from ¯ x ; if τ k = τ STOP; Set τ k +1 ← max( τ, στ k ) and ¯ x ← x ( τ k ); M. Figueiredo and S. Wright First-Order Methods April 2016 56 / 68
Recommend
More recommend