Subgradient method Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1
Remember gradient descent We want to solve x ∈ R n f ( x ) , min for f convex and differentiable Gradient descent: choose initial x (0) ∈ R n , repeat: x ( k ) = x ( k − 1) − t k · ∇ f ( x ( k − 1) ) , k = 1 , 2 , 3 , . . . If ∇ f Lipschitz, gradient descent has convergence rate O (1 /k ) Downsides: • Can be slow ← later • Doesn’t work for nondifferentiable functions ← today 2
Outline Today: • Subgradients • Examples and properties • Subgradient method • Convergence rate 3
Subgradients Remember that for convex f : R n → R , f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) all x, y I.e., linear approximation always underestimates f A subgradient of convex f : R n → R at x is any g ∈ R n such that f ( y ) ≥ f ( x ) + g T ( y − x ) , all y • Always exists • If f differentiable at x , then g = ∇ f ( x ) uniquely • Actually, same definition works for nonconvex f (however, subgradient need not exist) 4
Examples Consider f : R → R , f ( x ) = | x | 2.0 1.5 1.0 f(x) 0.5 0.0 −0.5 −2 −1 0 1 2 x • For x � = 0 , unique subgradient g = sign( x ) • For x = 0 , subgradient g is any element of [ − 1 , 1] 5
Consider f : R n → R , f ( x ) = � x � (Euclidean norm) f(x) x2 x 1 • For x � = 0 , unique subgradient g = x/ � x � • For x = 0 , subgradient g is any element of { z : � z � ≤ 1 } 6
Consider f : R n → R , f ( x ) = � x � 1 f(x) x2 x 1 • For x i � = 0 , unique i th component g i = sign( x i ) • For x i = 0 , i th component g i is an element of [ − 1 , 1] 7
Let f 1 , f 2 : R n → R be convex, differentiable, and consider f ( x ) = max { f 1 ( x ) , f 2 ( x ) } 15 10 f(x) 5 0 −2 −1 0 1 2 x • For f 1 ( x ) > f 2 ( x ) , unique subgradient g = ∇ f 1 ( x ) • For f 2 ( x ) > f 1 ( x ) , unique subgradient g = ∇ f 2 ( x ) • For f 1 ( x ) = f 2 ( x ) , subgradient g is any point on the line segment between ∇ f 1 ( x ) and ∇ f 2 ( x ) 8
Subdifferential Set of all subgradients of convex f is called the subdifferential: ∂f ( x ) = { g ∈ R n : g is a subgradient of f at x } • ∂f ( x ) is closed and convex (even for nonconvex f ) • Nonempty (can be empty for nonconvex f ) • If f is differentiable at x , then ∂f ( x ) = {∇ f ( x ) } • If ∂f ( x ) = { g } , then f is differentiable at x and ∇ f ( x ) = g 9
Connection to convex geometry Convex set C ⊆ R n , consider indicator function I C : R n → R , � 0 if x ∈ C I C ( x ) = I { x ∈ C } = ∞ if x / ∈ C For x ∈ C , ∂I C ( x ) = N C ( x ) , the normal cone of C at x , N C ( x ) = { g ∈ R n : g T x ≥ g T y for any y ∈ C } Why? Recall definition of subgradient g , I C ( y ) ≥ I C ( x ) + g T ( y − x ) for all y • For y / ∈ C , I C ( y ) = ∞ • For y ∈ C , this means 0 ≥ g T ( y − x ) 10
● ● ● ● 11
Subgradient calculus Basic rules for convex functions: • Scaling: ∂ ( af ) = a · ∂f provided a > 0 • Addition: ∂ ( f 1 + f 2 ) = ∂f 1 + ∂f 2 • Affine composition: if g ( x ) = f ( Ax + b ) , then ∂g ( x ) = A T ∂f ( Ax + b ) • Finite pointwise maximum: if f ( x ) = max i =1 ,...m f i ( x ) , then � � � ∂f ( x ) = conv ∂f i ( x ) , i : f i ( x )= f ( x ) the convex hull of union of subdifferentials of all active functions at x 12
• General pointwise maximum: if f ( x ) = max s ∈S f s ( x ) , then � � �� � ∂f ( x ) ⊇ cl conv ∂f s ( x ) s : f s ( x )= f ( x ) and under some regularity conditions (on S , f s ), we get = • Norms: important special case, f ( x ) = � x � p . Let q be such that 1 /p + 1 /q = 1 , then � y : � y � q ≤ 1 and y T x = max � z � q ≤ 1 z T x � ∂f ( x ) = Why is this a special case? Note � z � q ≤ 1 z T x � x � p = max 13
Why subgradients? Subgradients are important for two reasons: • Convex analysis: optimality characterization via subgradients, monotonicity, relationship to duality • Convex optimization: if you can compute subgradients, then you can minimize (almost) any convex function 14
Optimality condition For convex f , f ( x ⋆ ) = min 0 ∈ ∂f ( x ⋆ ) x ∈ R n f ( x ) ⇔ I.e., x ⋆ is a minimizer if and only if 0 is a subgradient of f at x ⋆ Why? Easy: g = 0 being a subgradient means that for all y f ( y ) ≥ f ( x ⋆ ) + 0 T ( y − x ⋆ ) = f ( x ⋆ ) Note analogy to differentiable case, where ∂f ( x ) = {∇ f ( x ) } 15
Soft-thresholding Lasso problem can be parametrized as 1 2 � y − Ax � 2 + λ � x � 1 min x where λ ≥ 0 . Consider simplified problem with A = I : 1 2 � y − x � 2 + λ � x � 1 min x Claim: solution of simple problem is x ⋆ = S λ ( y ) , where S λ is the soft-thresholding operator: y i − λ if y i > λ [ S λ ( y )] i = 0 if − λ ≤ y i ≤ λ y i + λ if y i < − λ 16
2 � y − x � 2 + λ � x � 1 are Why? Subgradients of f ( x ) = 1 g = x − y + λs, where s i = sign( x i ) if x i � = 0 and s i ∈ [ − 1 , 1] if x i = 0 Now just plug in x = S λ ( y ) and check we can get g = 0 1.0 0.5 Soft-thresholding in 0.0 one variable: −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 17
Subgradient method Given convex f : R n → R , not necessarily differentiable Subgradient method: just like gradient descent, but replacing gradients with subgradients. I.e., initialize x (0) , then repeat x ( k ) = x ( k − 1) − t k · g ( k − 1) , k = 1 , 2 , 3 , . . . , where g ( k − 1) is any subgradient of f at x ( k − 1) Subgradient method is not necessarily a descent method, so we best among x (1) , . . . x ( k ) so far, i.e., keep track of best iterate x ( k ) f ( x ( k ) i =1 ,...k f ( x ( i ) ) best ) = min 18
Step size choices • Fixed step size: t k = t all k = 1 , 2 , 3 , . . . • Diminishing step size: choose t k to satisfy ∞ ∞ � t 2 � k < ∞ , t k = ∞ , k =1 k =1 i.e., square summable but not summable Important that step sizes go to zero, but not too fast Other options too, but important difference to gradient descent: all step sizes options are pre-specified, not adaptively computed 19
Convergence analysis Assume that f : R n → R is convex, also: • f is Lipschitz continuous with constant G > 0 , | f ( x ) − f ( y ) | ≤ G � x − y � for all x, y Equivalently: � g � ≤ G for any subgradient of f at any x • � x (1) − x ∗ � ≤ R (equivalently, � x (0) − x ∗ � is bounded) Theorem: For a fixed step size t , subgradient method satisfies k →∞ f ( x ( k ) best ) ≤ f ( x ⋆ ) + G 2 t/ 2 lim Theorem: For diminishing step sizes, subgradient method sat- isfies k →∞ f ( x ( k ) best ) = f ( x ⋆ ) lim 20
Basic inequality Can prove both results from same basic inequality. Key steps: • Using definition of subgradient, � x ( k +1) − x ⋆ � 2 ≤ � x ( k ) − x ⋆ � 2 − 2 t k ( f ( x ( k ) ) − f ( x ⋆ )) + t 2 k � g ( k ) � 2 • Iterating last inequality, � x ( k +1) − x ⋆ � 2 ≤ k k � x (1) − x ⋆ � 2 − 2 � t i ( f ( x ( i ) ) − f ( x ⋆ )) + � t 2 i � g ( i ) � 2 i =1 i =1 21
• Using � x ( k +1) − x ⋆ � ≥ 0 and � x (1) − x ⋆ � ≤ R , k k t i ( f ( x ( i ) ) − f ( x ⋆ )) ≤ R 2 + � � t 2 i � g ( i ) � 2 2 i =1 i =1 • Introducing f ( x ( k ) best ) , k k � � ( f ( x ( k ) � t i ( f ( x ( i ) ) − f ( x ⋆ )) ≥ 2 � best ) − f ( x ⋆ )) 2 t i i =1 i =1 • Plugging this in and using � g ( i ) � ≤ G , best ) − f ( x ⋆ ) ≤ R 2 + G 2 � k i =1 t 2 f ( x ( k ) i 2 � k i =1 t i 22
Convergence proofs For constant step size t , basic bound is R 2 + G 2 t 2 k → G 2 t as k → ∞ 2 tk 2 For diminishing step sizes t k , ∞ ∞ � � t 2 i < ∞ , t i = ∞ , i =1 i =1 we get R 2 + G 2 � k i =1 t 2 i → 0 as k → ∞ 2 � k i =1 t i 23
Convergence rate After k iterations, what is complexity of error f ( x ( k ) best ) − f ( x ⋆ ) ? √ Consider taking t i = R/ ( G k ) , all i = 1 , . . . k . Then basic bound is R 2 + G 2 � k i =1 t 2 = RG i √ 2 � k i =1 t i k Can show this choice is the best we can do (i.e., minimizes bound) √ I.e., subgradient method has convergence rate O (1 / k ) I.e., to get f ( x ( k ) best ) − f ( x ⋆ ) ≤ ǫ , need O (1 /ǫ 2 ) iterations 24
Intersection of sets Example from Boyd’s lecture notes: suppose we want to find x ⋆ ∈ C 1 ∩ . . . ∩ C m , i.e., find point in intersection of closed, convex sets C 1 , . . . C m First define f ( x ) = max i =1 ,...m dist( x, C i ) , and now solve x ∈ R n f ( x ) min Note that f ( x ⋆ ) = 0 ⇒ x ⋆ ∈ C 1 ∩ . . . ∩ C m Recall distance to set C , dist( x, C ) = min {� x − u � : u ∈ C } 25
For closed, convex C , there is a unique point minimizing � x − u � over u ∈ C . Denoted u ⋆ = P C ( x ) , so dist( x, C ) = � x − P C ( x ) � ● * Let f i ( x ) = dist( x, C i ) , each i . Then f ( x ) = max i =1 ,...m f i ( x ) , and x − P Ci ( x ) • For each i , and x / ∈ C i , ∇ f i ( x ) = � x − P Ci ( x ) � x − P Ci ( x ) • If f ( x ) = f i ( x ) � = 0 , then � x − P Ci ( x ) � ∈ ∂f ( x ) 26
Recommend
More recommend