Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 1 / 53
Outline What is Optimization Convex Sets Convex Functions Convex Optimization Problems Lagrange Duality Optimization Algorithms Take Home Messages Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 2 / 53
What is Optimization What is Optimization (and why do we care?) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 3 / 53
What is Optimization What is Optimization? ◮ Finding the minimizer of a function subject to constraints: minimize f 0 ( x ) x s.t. f i ( x ) ≤ 0 , i = { 1 , . . . , k } h j ( x ) = 0 , j = { 1 , . . . , l } Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 4 / 53
What is Optimization What is Optimization? ◮ Finding the minimizer of a function subject to constraints: minimize f 0 ( x ) x s.t. f i ( x ) ≤ 0 , i = { 1 , . . . , k } h j ( x ) = 0 , j = { 1 , . . . , l } ◮ Example: Stock market. “Minimize variance of return subject to getting at least $50.” Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 4 / 53
What is Optimization Why do we care? Optimization is at the heart of many (most practical?) machine learning algorithms. ◮ Linear regression: � Xw − y � 2 minimize w ◮ Classification (logistic regresion or SVM): n � 1 + exp( − y i x T � � minimize log i w ) w i =1 n or � w � 2 + C � ξ i s.t. ξ i ≥ 1 − y i x T i w, ξ i ≥ 0 . i =1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 5 / 53
What is Optimization We still care... ◮ Maximum likelihood estimation: n � maximize log p θ ( x i ) θ i =1 ◮ Collaborative filtering: 1 + exp( w T x i − w T x j ) � � � minimize log w i ≺ j ◮ k -means: k � � � x i − µ j � 2 minimize J ( µ ) = µ 1 ,...,µ k j =1 i ∈ C j ◮ And more (graphical models, feature selection, active learning, control) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 6 / 53
What is Optimization But generally speaking... We’re screwed. ◮ Local (non global) minima of f 0 ◮ All kinds of constraints (even restricting to continuous functions): h ( x ) = sin(2 πx ) = 0 250 200 150 100 50 0 −50 3 2 3 1 2 0 1 0 −1 −1 −2 −2 −3 −3 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
What is Optimization But generally speaking... We’re screwed. ◮ Local (non global) minima of f 0 ◮ All kinds of constraints (even restricting to continuous functions): h ( x ) = sin(2 πx ) = 0 250 200 150 100 50 0 −50 3 2 3 1 2 0 1 0 −1 −1 −2 −2 −3 −3 ◮ Go for convex problems! Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Convex Sets Convex Sets Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 8 / 53
Convex Sets Definition A set C ⊆ R n is convex if for x, y ∈ C and any α ∈ [0 , 1] , αx + (1 − α ) y ∈ C. y x Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53
Convex Sets Examples ◮ All of R n (obvious) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53
Convex Sets Examples ◮ All of R n (obvious) ◮ Non-negative orthant, R n + : let x � 0 , y � 0 , clearly αx + (1 − α ) y � 0 . Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53
Convex Sets Examples ◮ All of R n (obvious) ◮ Non-negative orthant, R n + : let x � 0 , y � 0 , clearly αx + (1 − α ) y � 0 . ◮ Norm balls: let � x � ≤ 1 , � y � ≤ 1 , then � αx + (1 − α ) y � ≤ � αx � + � (1 − α ) y � = α � x � + (1 − α ) � y � ≤ 1 . Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53
Convex Sets Examples ◮ Affine subspaces: Ax = b , Ay = b , then A ( αx + (1 − α ) y ) = αAx + (1 − α ) Ay = αb + (1 − α ) b = b. 1 0.8 0.6 0.4 x 3 0.2 0 −0.2 −0.4 1 0.8 1 0.6 0.8 0.6 0.4 0.4 0.2 0.2 x 2 0 0 x 1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 11 / 53
Convex Sets More examples ◮ Arbitrary intersections of convex sets: let C i be convex for i ∈ I , C = � i C i , then x ∈ C, y ∈ C ⇒ αx + (1 − α ) y ∈ C i ∀ i ∈ I so αx + (1 − α ) y ∈ C . Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 12 / 53
Convex Sets More examples ◮ PSD Matrices, a.k.a. the positive semidefinite cone S n + ⊂ R n × n . A ∈ S n + means x T Ax ≥ 0 for all x ∈ R n . For 1 A, B ∈ S + n , 0.8 0.6 x T ( αA + (1 − α ) B ) x z 0.4 = αx T Ax + (1 − α ) x T Bx ≥ 0 . 0.2 0 1 0.5 1 ◮ On right: 0.8 0 0.6 0.4 −0.5 0.2 y −1 0 x �� x � � z S 2 x, y, z : x ≥ 0 , y ≥ 0 , xy ≥ z 2 � � + = � 0 = z y Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 13 / 53
Convex Functions Convex Functions Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 14 / 53
Convex Functions Definition A function f : R n → R is convex if for x, y ∈ dom f and any α ∈ [0 , 1] , f ( αx + (1 − α ) y ) ≤ αf ( x ) + (1 − α ) f ( y ) . f ( y ) αf ( x ) + (1 - α ) f ( y ) f ( x ) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 15 / 53
Convex Functions First order convexity conditions Theorem Suppose f : R n → R is differentiable. Then f is convex if and only if for all x, y ∈ dom f f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) f ( y ) f ( x ) + ∇ f ( x ) T ( y - x ) ( x, f ( x )) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 16 / 53
Convex Functions Actually, more general than that Definition The subgradient set , or subdifferential set, ∂f ( x ) of f at x is g : f ( y ) ≥ f ( x ) + g T ( y − x ) for all y � � ∂f ( x ) = . f ( y ) Theorem f : R n → R is convex if and only if it has non-empty ( x, f ( x )) subdifferential set everywhere. f ( x ) + g T ( y - x ) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 17 / 53
Convex Functions Second order convexity conditions Theorem Suppose f : R n → R is twice differentiable. Then f is convex if and only if for all x ∈ dom f , ∇ 2 f ( x ) � 0 . 10 8 6 4 2 0 2 1 2 1 0 0 −1 −1 −2 −2 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 18 / 53
Convex Functions Convex sets and convex functions Definition The epigraph of a function f is the epi f set of points epi f = { ( x, t ) : f ( x ) ≤ t } . ◮ epi f is convex if and only if f is convex. a ◮ Sublevel sets, { x : f ( x ) ≤ a } are convex for convex f . Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 19 / 53
Convex Functions Examples Examples ◮ Linear/affine functions: f ( x ) = b T x + c. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 20 / 53
Convex Functions Examples Examples ◮ Linear/affine functions: f ( x ) = b T x + c. ◮ Quadratic functions: f ( x ) = 1 2 x T Ax + b T x + c for A � 0 . For regression: 1 2 � Xw − y � 2 = 1 2 w T X T Xw − y T Xw + 1 2 y T y. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 20 / 53
Convex Functions Examples More examples ◮ Norms (like ℓ 1 or ℓ 2 for regularization): � αx + (1 − α ) y � ≤ � αx � + � (1 − α ) y � = α � x � + (1 − α ) � y � . Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53
Convex Functions Examples More examples ◮ Norms (like ℓ 1 or ℓ 2 for regularization): � αx + (1 − α ) y � ≤ � αx � + � (1 − α ) y � = α � x � + (1 − α ) � y � . ◮ Composition with an affine function f ( Ax + b ) : f ( A ( αx + (1 − α ) y ) + b ) = f ( α ( Ax + b ) + (1 − α )( Ay + b )) ≤ αf ( Ax + b ) + (1 − α ) f ( Ay + b ) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53
Convex Functions Examples More examples ◮ Norms (like ℓ 1 or ℓ 2 for regularization): � αx + (1 − α ) y � ≤ � αx � + � (1 − α ) y � = α � x � + (1 − α ) � y � . ◮ Composition with an affine function f ( Ax + b ) : f ( A ( αx + (1 − α ) y ) + b ) = f ( α ( Ax + b ) + (1 − α )( Ay + b )) ≤ αf ( Ax + b ) + (1 − α ) f ( Ay + b ) ◮ Log-sum-exp (via ∇ 2 f ( x ) PSD): � n � � f ( x ) = log exp( x i ) i =1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53
Convex Functions Examples Important examples in Machine Learning 3 ◮ SVM loss: [1 - x ] + 1 − y i x T � � f ( w ) = i w + ◮ Binary logistic loss: log(1 + e x ) 1 + exp( − y i x T � � f ( w ) = log i w ) 0 −2 3 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 22 / 53
Recommend
More recommend