Proof Assume � g is a subgradient at � x , for any α ≥ 0 g T α � f ( � x + α � e i ) ≥ f ( � x ) + � e i = f ( � x ) + � g [ i ] α g T α � f ( � x ) ≥ f ( � x − α � e i ) + � e i = f ( � x − α � e i ) + � g [ i ] α Combining both inequalities f ( � x ) − f ( � x − α � e i ) g [ i ] ≤ f ( � x + α � e i ) − f ( � x ) ≤ � α α
Proof Assume � g is a subgradient at � x , for any α ≥ 0 g T α � f ( � x + α � e i ) ≥ f ( � x ) + � e i = f ( � x ) + � g [ i ] α g T α � f ( � x ) ≥ f ( � x − α � e i ) + � e i = f ( � x − α � e i ) + � g [ i ] α Combining both inequalities f ( � x ) − f ( � x − α � e i ) g [ i ] ≤ f ( � x + α � e i ) − f ( � x ) ≤ � α α g [ i ] = ∂ f ( � x ) Letting α → 0, implies � ∂� x [ i ]
Subgradient A function f : R n → R is convex if and only if it has a subgradient at every point x ∈ R n there exists � g ∈ R n It is strictly convex if and only for all � such that g T ( � f ( � y ) > f ( � x ) + � y − � x ) , for all � y � = � x .
Optimality condition for nondifferentiable functions If � 0 is a subgradient of f at � x , then 0 T ( � x ) + � f ( � y ) ≥ f ( � y − � x )
Optimality condition for nondifferentiable functions If � 0 is a subgradient of f at � x , then 0 T ( � x ) + � f ( � y ) ≥ f ( � y − � x ) = f ( � x ) y ∈ R n for all �
Optimality condition for nondifferentiable functions If � 0 is a subgradient of f at � x , then 0 T ( � x ) + � f ( � y ) ≥ f ( � y − � x ) = f ( � x ) y ∈ R n for all � Under strict convexity the minimum is unique
Sum of subgradients x ∈ R n of f 1 : R n → R and f 2 : R n → R Let � g 1 and � g 2 be subgradients at � � g := � g 1 + � g 2 is a subgradient of f := f 1 + f 2 at � x
Sum of subgradients x ∈ R n of f 1 : R n → R and f 2 : R n → R Let � g 1 and � g 2 be subgradients at � � g := � g 1 + � g 2 is a subgradient of f := f 1 + f 2 at � x y ∈ R n Proof: For any � f ( � y ) = f 1 ( � y ) + f 2 ( � y )
Sum of subgradients x ∈ R n of f 1 : R n → R and f 2 : R n → R Let � g 1 and � g 2 be subgradients at � � g := � g 1 + � g 2 is a subgradient of f := f 1 + f 2 at � x y ∈ R n Proof: For any � f ( � y ) = f 1 ( � y ) + f 2 ( � y ) g T g T ≥ f 1 ( � x ) + � 1 ( � y − � x ) + f 2 ( � y ) + � 2 ( � y − � x )
Sum of subgradients x ∈ R n of f 1 : R n → R and f 2 : R n → R Let � g 1 and � g 2 be subgradients at � � g := � g 1 + � g 2 is a subgradient of f := f 1 + f 2 at � x y ∈ R n Proof: For any � f ( � y ) = f 1 ( � y ) + f 2 ( � y ) g T g T ≥ f 1 ( � x ) + � 1 ( � y − � x ) + f 2 ( � y ) + � 2 ( � y − � x ) g T ( � ≥ f ( � x ) + � y − � x )
Subgradient of scaled function x ∈ R n of f 1 : R n → R Let � g 1 be a subgradient at � For any η ≥ 0 � g 2 := η� g 1 is a subgradient of f 2 := η f 1 at � x
Subgradient of scaled function x ∈ R n of f 1 : R n → R Let � g 1 be a subgradient at � For any η ≥ 0 � g 2 := η� g 1 is a subgradient of f 2 := η f 1 at � x y ∈ R n Proof: For any � f 2 ( � y ) = η f 1 ( � y )
Subgradient of scaled function x ∈ R n of f 1 : R n → R Let � g 1 be a subgradient at � For any η ≥ 0 � g 2 := η� g 1 is a subgradient of f 2 := η f 1 at � x y ∈ R n Proof: For any � f 2 ( � y ) = η f 1 ( � y ) � � g T ≥ η f 1 ( � x ) + � 1 ( � y − � x )
Subgradient of scaled function x ∈ R n of f 1 : R n → R Let � g 1 be a subgradient at � For any η ≥ 0 � g 2 := η� g 1 is a subgradient of f 2 := η f 1 at � x y ∈ R n Proof: For any � f 2 ( � y ) = η f 1 ( � y ) � � g T ≥ η f 1 ( � x ) + � 1 ( � y − � x ) g T ≥ f 2 ( � x ) + � 2 ( � y − � x )
Subdifferential of absolute value f ( x ) = | x |
Subdifferential of absolute value At x � = 0, f ( x ) = | x | is differentiable, so g = sign ( x ) At x = 0, we need f ( 0 + y ) ≥ f ( 0 ) + g ( y − 0 )
Subdifferential of absolute value At x � = 0, f ( x ) = | x | is differentiable, so g = sign ( x ) At x = 0, we need f ( 0 + y ) ≥ f ( 0 ) + g ( y − 0 ) | y | ≥ gy
Subdifferential of absolute value At x � = 0, f ( x ) = | x | is differentiable, so g = sign ( x ) At x = 0, we need f ( 0 + y ) ≥ f ( 0 ) + g ( y − 0 ) | y | ≥ gy Holds if and only if | g | ≤ 1
Subdifferential of ℓ 1 norm x ∈ R n if and only if � g is a subgradient of the ℓ 1 norm at � � g [ i ] = sign ( x [ i ]) if x [ i ] � = 0 | � g [ i ] | ≤ 1 if � x [ i ] = 0
Proof � g is a subgradient of ||·|| 1 at � x if and only if � g [ i ] is a subgradient of |·| at � x [ i ] for all 1 ≤ i ≤ n
Proof If � g is a subgradient of ||·|| 1 at � x then for any y ∈ R | y | = | � x [ i ] | + || � x + ( y − � x [ i ]) � e i || 1 − || � x || 1
Proof If � g is a subgradient of ||·|| 1 at � x then for any y ∈ R | y | = | � x [ i ] | + || � x + ( y − � x [ i ]) � e i || 1 − || � x || 1 g T ( y − � ≥ | � x [ i ] | + || � x || 1 + � x [ i ]) � e i − || � x || 1
Proof If � g is a subgradient of ||·|| 1 at � x then for any y ∈ R | y | = | � x [ i ] | + || � x + ( y − � x [ i ]) � e i || 1 − || � x || 1 g T ( y − � ≥ | � x [ i ] | + || � x || 1 + � x [ i ]) � e i − || � x || 1 = | � x [ i ] | + � g [ i ] ( y − � x [ i ]) so � g [ i ] is a subgradient of |·| at | � x [ i ] | for all 1 ≤ i ≤ n
Proof y ∈ R n If � g [ i ] is a subgradient of |·| at | � x [ i ] | for 1 ≤ i ≤ n then for any � n � || � y || 1 = | � y [ i ] | i = 1
Proof y ∈ R n If � g [ i ] is a subgradient of |·| at | � x [ i ] | for 1 ≤ i ≤ n then for any � n � || � y || 1 = | � y [ i ] | i = 1 n � ≥ | � x [ i ] | + � g [ i ] ( � y [ i ] − � x [ i ]) i = 1
Proof y ∈ R n If � g [ i ] is a subgradient of |·| at | � x [ i ] | for 1 ≤ i ≤ n then for any � n � || � y || 1 = | � y [ i ] | i = 1 n � ≥ | � x [ i ] | + � g [ i ] ( � y [ i ] − � x [ i ]) i = 1 g T ( � = || � x || 1 + � y − � x ) so � g is a subgradient of ||·|| 1 at � x
Subdifferential of ℓ 1 norm
Subdifferential of ℓ 1 norm
Subdifferential of ℓ 1 norm
Subdifferential of the nuclear norm Let X ∈ R m × n be a rank- r matrix with SVD USV T , where U ∈ R m × r , V ∈ R n × r and S ∈ R r × r A matrix G is a subgradient of the nuclear norm at X if and only if G := UV T + W where W satisfies || W || ≤ 1 U T W = 0 W V = 0
Proof x ∈ R m with unit ℓ 2 norm we have By Pythagoras’ Theorem, for any � 2 � � � � � 2 � �� �� � x || 2 � P row ( X ) � x 2 + � P row ( X ) ⊥ � x 2 = || � 2 = 1 � � � � � � �
Proof x ∈ R m with unit ℓ 2 norm we have By Pythagoras’ Theorem, for any � 2 � � � � � 2 � �� � �� x || 2 � P row ( X ) � x 2 + � P row ( X ) ⊥ � x 2 = || � 2 = 1 � � � � � � � The rows of UV T are in row ( X ) and the rows of W in row ( X ) ⊥ , so || G || 2 := x || 2 max || G � 2 { || � x ∈ R n } x || 2 = 1 | �
Proof x ∈ R m with unit ℓ 2 norm we have By Pythagoras’ Theorem, for any � 2 � � � � � 2 � �� �� � x || 2 � P row ( X ) � x 2 + � P row ( X ) ⊥ � x 2 = || � 2 = 1 � � � � � � � The rows of UV T are in row ( X ) and the rows of W in row ( X ) ⊥ , so || G || 2 := x || 2 max || G � 2 { || � x ∈ R n } x || 2 = 1 | � 2 � � � � � UV T � x || 2 = max x 2 + || W � � � � � 2 � � � { || � x ∈ R n } x || 2 = 1 | �
Proof x ∈ R m with unit ℓ 2 norm we have By Pythagoras’ Theorem, for any � 2 � � � � � 2 �� � �� � x || 2 � P row ( X ) � x 2 + � P row ( X ) ⊥ � x 2 = || � 2 = 1 � � � � � � � The rows of UV T are in row ( X ) and the rows of W in row ( X ) ⊥ , so || G || 2 := x || 2 max || G � 2 { || � x ∈ R n } x || 2 = 1 | � 2 � � � � � UV T � x || 2 = max x 2 + || W � � � � � 2 � � � { || � x ∈ R n } x || 2 = 1 | � 2 2 � � � UV T P row ( X ) � � � � � � � � W P row ( X ) ⊥ � = max x 2 + x � � � � � � � � � � � � � � { || � x ∈ R n } 2 x || 2 = 1 | �
Proof x ∈ R m with unit ℓ 2 norm we have By Pythagoras’ Theorem, for any � 2 � � � � � 2 � �� � �� x || 2 � P row ( X ) � x 2 + � P row ( X ) ⊥ � x 2 = || � 2 = 1 � � � � � � � The rows of UV T are in row ( X ) and the rows of W in row ( X ) ⊥ , so || G || 2 := x || 2 max || G � 2 { || � x ∈ R n } x || 2 = 1 | � 2 � � � � � UV T � x || 2 = max x 2 + || W � � � � � 2 � � � { || � x ∈ R n } x || 2 = 1 | � 2 2 � � � UV T P row ( X ) � � � � � � � � W P row ( X ) ⊥ � = max x 2 + x � � � � � � � � � � � � � � { || � x ∈ R n } 2 x || 2 = 1 | � 2 � 2 � � � UV T � � 2 + || W || 2 � � � � � 2 �� � �� ≤ � P row ( X ) � x � P row ( X ) ⊥ � x � � � � � � � � � � � � � � 2
Proof x ∈ R m with unit ℓ 2 norm we have By Pythagoras’ Theorem, for any � 2 � � � � � 2 � �� � �� x || 2 � P row ( X ) � x 2 + � P row ( X ) ⊥ � x 2 = || � 2 = 1 � � � � � � � The rows of UV T are in row ( X ) and the rows of W in row ( X ) ⊥ , so || G || 2 := x || 2 max || G � 2 { || � x ∈ R n } x || 2 = 1 | � 2 � � � � � UV T � x || 2 = max x 2 + || W � � � � � 2 � � � { || � x ∈ R n } x || 2 = 1 | � 2 2 � � � UV T P row ( X ) � � � � � � � � W P row ( X ) ⊥ � = max x 2 + x � � � � � � � � � � � � � � { || � x ∈ R n } 2 x || 2 = 1 | � 2 � 2 � � UV T � � � 2 + || W || 2 � � � � � 2 �� � �� ≤ � P row ( X ) � x � P row ( X ) ⊥ � x � � � � � � � � � � � � � � 2 ≤ 1
Hölder’s inequality for matrices For any matrix A ∈ R m × n , || A || ∗ = sup � A , B � . {|| B ||≤ 1 | B ∈ R m × n }
Proof For any matrix Y ∈ R m × n || Y || ∗ ≥ � G , Y � = � G , X � + � G , Y − X � � � UV T , X = + � W , X � + � G , Y − X �
Proof U T W = 0 implies � W , X � = � W , USV T � � U T W , SV T � = = 0 � � UV T , X
Proof U T W = 0 implies � W , X � = � W , USV T � � U T W , SV T � = = 0 � � � � UV T , X VU T X = tr
Proof U T W = 0 implies � W , X � = � W , USV T � � U T W , SV T � = = 0 � � � � UV T , X VU T X = tr � VU T USV T � = tr
Proof U T W = 0 implies � W , X � = � W , USV T � � U T W , SV T � = = 0 � � � � UV T , X VU T X = tr � VU T USV T � = tr � � V T V S = tr
Proof U T W = 0 implies � W , X � = � W , USV T � � U T W , SV T � = = 0 � � � � UV T , X VU T X = tr � VU T USV T � = tr � � V T V S = tr = tr ( S )
Proof U T W = 0 implies � W , X � = � W , USV T � � U T W , SV T � = = 0 � � � � UV T , X VU T X = tr � VU T USV T � = tr � � V T V S = tr = tr ( S ) = || X || ∗
Proof For any matrix Y ∈ R m × n || Y || ∗ ≥ � G , Y � = � G , X � + � G , Y − X � � � UV T , X = + � G , Y − X � � � UV T , X = + � W , X � + � G , Y − X � = || X || ∗ + � G , Y − X �
Sparse linear regression with 2 features y := α � � x 1 + � z � � � � X := x 1 x 2 || � x 1 || 2 = 1 || � x 2 || 2 = 1 � � x 1 , � x 2 � = ρ
Analysis of lasso estimator Let α ≥ 0 � α + � x T � 1 � z − λ � β lasso = 0 as long as � � x T x T � � 2 � z − ρ� 1 � z � x T ≤ λ ≤ α + � 1 � z 1 − | ρ |
Lasso estimator 1.0 0.8 Coefficients 0.6 0.4 0.2 0.0 0.00 0.05 0.10 0.15 0.20 Regularization parameter
Optimality condition for nondifferentiable functions If � 0 is a subgradient of f at � x , then 0 T ( � x ) + � f ( � y ) ≥ f ( � y − � x ) = f ( � x ) y ∈ R n for all � Under strict convexity the minimum is unique
Proof The cost function is strictly convex if n ≥ 2 and ρ � = 1 Aim: Show that there is a subgradient equal to � 0 at a 1-sparse solution
Proof The gradient of the quadratic term := 1 2 � � � � � � � � X � β β − � q y � � � � 2 � � � 2 at � β lasso equals � � = X T � � � X � ∇ q β lasso β lasso − � y
Recommend
More recommend