Direction of maximum variation ∇ f is direction of maximum increase - ∇ f is direction of maximum decrease � x ) T � � � = � f ′ � � u ( � x ) � ∇ f ( � u � � � � ≤ ||∇ f ( � x ) || 2 || � u || 2 Cauchy-Schwarz inequality = ||∇ f ( � x ) || 2 ∇ f ( � x ) equality holds if and only if � u = ± ||∇ f ( � x ) || 2
Gradient
First-order approximation The first-order or linear approximation of f : R n → R at � x is x ) T ( � f 1 x ( � y ) := f ( � x ) + ∇ f ( � y − � x ) � If f is continuously differentiable at � x y ) − f 1 f ( � x ( � y ) � lim = 0 || � y − � x || 2 y → � � x
First-order approximation f 1 x ( � y ) x f ( � y )
Convexity A differentiable function f : R n → R is convex if and only if for every y ∈ R n � x , � x ) T ( � f ( � y ) ≥ f ( � x ) + ∇ f ( � y − � x ) It is strictly convex if and only if x ) T ( � f ( � y ) > f ( � x ) + ∇ f ( � y − � x )
Optimality condition If f is convex and ∇ f ( � x ) = 0, then for any � y ∈ R f ( � y ) ≥ f ( � x ) If f is strictly convex then for any � y � = � x f ( � y ) > f ( � x )
Epigraph The epigraph of f : R n → R is � x [ 1 ] ≤ � epi ( f ) := � x | f · · · x [ n + 1 ] � x [ n ]
Epigraph epi ( f ) f
Supporting hyperplane A hyperplane H is a supporting hyperplane of a set S at � x if ◮ H and S intersect at � x ◮ S is contained in one of the half-spaces bounded by H
Geometric intuition Geometrically, f is convex if and only if for every � x the plane y [ 1 ] � y [ n + 1 ] = f 1 H f ,� x := � y | � · · · � x y [ n ] � is a supporting hyperplane of the epigraph at � x If ∇ f ( � x ) = 0 the hyperplane is horizontal
Convexity f 1 x ( � y ) x f ( � y )
Hessian matrix If f has a Hessian matrix at every point, it is twice differentiable ∂ 2 f ( � x ) ∂ 2 f ( � x ) ∂ 2 f ( � x ) · · · ∂� x [ 1 ] 2 ∂� x [ 1 ] ∂� x [ 2 ] ∂� x [ 1 ] ∂� x [ n ] ∂ 2 f ( � ∂ 2 f ( � ∂ 2 f ( � x ) x ) x ) · · · ∇ 2 f ( � x [ 1 ] 2 x ) = ∂� x [ 1 ] ∂� x [ 2 ] ∂� ∂� x [ 2 ] ∂� x [ n ] · · · ∂ 2 f ( � ∂ 2 f ( � ∂ 2 f ( � x ) x ) x ) · · · ∂� x [ 1 ] ∂� x [ n ] ∂� x [ 2 ] ∂� x [ n ] ∂� x [ n ] 2
Curvature The second directional derivative f ′′ u of f at � x equals � f ′′ u T ∇ 2 f ( � u ( � x ) = � x ) � u � u ∈ R n for any unit-norm vector �
Second-order approximation The second-order or quadratic approximation of f at � x is x ) + 1 x ) T ∇ 2 f ( � f 2 x ( � y ) := f ( � x ) + ∇ f ( � x ) ( � y − � 2 ( � y − � x ) ( � y − � x ) �
Second-order approximation f 2 x ( � y ) x f ( � y )
Quadratic form Second order polynomial in several dimensions x + � x T A � b T � q ( � x ) := � x + c b ∈ R n and parametrized by symmetric matrix A ∈ R n × n , a vector � a constant c
Quadratic approximation x : R n → R at � x ∈ R n of a The quadratic approximation f 2 � twice-continuously differentiable function f : R n → R satisfies y ) − f 2 f ( � x ( � y ) � lim = 0 x || 2 || � y − � y → � � x 2
Eigendecomposition of symmetric matrices Let A = U Λ U T be the eigendecomposition of a symmetric matrix A Eigenvalues: λ 1 ≥ · · · ≥ λ n (which can be negative or 0) Eigenvectors: � u 1 , . . . , � u n , orthonormal basis x T A � λ 1 = max � x { || � x ∈ R n } x || 2 = 1 | � x T A � u 1 = � arg max � x { || � x ∈ R n } x || 2 = 1 | � x T A � λ n = min � x { || � x ∈ R n } x || 2 = 1 | � x T A � � u n = arg min � x { || � x ∈ R n } x || 2 = 1 | �
Maximum and minimum curvature x ) = U Λ U T be the eigendecomposition of the Hessian at � Let ∇ 2 f ( � x Direction of maximum curvature: � u 1 Direction of minimum curvature (or maximum negative curvature): � u n
Positive semidefinite matrices For any � x x T A � x T U Λ U T � � x = � x n � x � 2 λ i � � u i , � = i = 1 All eigenvalues are nonnegative if and only if x T A � � x ≥ 0 for all � x The matrix is positive semidefinite
Positive (negative) (semi)definite matrices Positive (semi)definite: all eigenvalues are positive (nonnegative), equivalently for all � x x T A � � x > ( ≥ ) 0 Quadratic form: All directions have positive curvature Negative (semi)definite: all eigenvalues are negative (nonpositive), equivalently for all � x x T A � � x < ( ≤ ) 0 Quadratic form: All directions have negative curvature
Convexity A twice-differentiable function g : R → R is convex if and only if g ′′ ( x ) ≥ 0 for all x ∈ R A twice-differentiable function in R n is convex if and only if their Hessian is positive semidefinite at every point If the Hessian is positive definite, the function is strictly convex
Second-order approximation f 2 x ( � y ) x f ( � y )
Convex
Concave
Neither
Convexity Differentiable convex functions Minimizing differentiable convex functions
Problem Challenge: Minimizing differentiable convex functions min f ( � x ) x ∈ R n �
Gradient descent Intuition: Make local progress in the steepest direction −∇ f ( � x ) x ( 0 ) to an arbitrary value Set the initial point � Update by setting x ( k + 1 ) := � � x ( k ) � x ( k ) − α k ∇ f � � where α k > 0 is the step size, until a stopping criterion is met
Gradient descent
Gradient descent 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
Small step size 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
Large step size 100 90 80 70 60 50 40 30 20 10
Line search Idea: Find minimum of α k := arg min α h ( α ) � x ( k ) − α k ∇ f � x ( k ) �� = arg min � � α ∈ R f
Backtracking line search with Armijo rule Given α 0 ≥ 0 and β, η ∈ ( 0 , 1 ) , set α k := α 0 β i for smallest i such that x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � satisfies − 1 2 � � x ( k ) �� � � x ( k + 1 ) � � x ( k ) � � � ≤ f � 2 α k � ∇ f � f � � � � � � � 2 a condition known as Armijo rule
Backtracking line search with Armijo rule 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
Gradient descent for least squares Aim: Use n examples � x ( 1 ) � � x ( 2 ) � � x ( n ) � y ( 1 ) , � y ( 2 ) , � y ( n ) , � , , . . . , to fit a linear model by minimizing least-squares cost function 2 � � � � y − X � � � minimize � β � � � � β ∈ R p � � � 2
Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β )
Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y
Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y Gradient descent updates are β ( k + 1 ) = � β ( k ) + 2 α k X T � β ( k ) � � y − X � �
Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y Gradient descent updates are β ( k + 1 ) = � β ( k ) + 2 α k X T � β ( k ) � � y − X � � n β ( k ) + 2 α k � y ( i ) − � x ( i ) , � � = � � β ( k ) � x ( i ) � i = 1
Gradient ascent for logistic regression Aim: Use n examples � x ( 1 ) � � x ( 2 ) � � x ( n ) � y ( 1 ) , � y ( 2 ) , � y ( n ) , � , , . . . , to fit logistic-regression model by maximizing log-likelihood cost function n y ( i ) log g � � � 1 − y ( i ) � � � �� f ( � � x ( i ) , � x ( i ) , � � � � � β ) := β � + log 1 − g β � i = 1 where 1 g ( t ) = 1 − exp − t
Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals ∇ f ( � β )
Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1
Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1 The gradient ascent updates are β ( k + 1 ) �
Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1 The gradient ascent updates are β ( k + 1 ) := � � β ( k ) n y ( i ) � � x ( i ) − � 1 − y ( i ) � � x ( i ) , � x ( i ) , � β ( k ) � ) β ( k ) � ) � x ( i ) + α k 1 − g ( � � � g ( � � i = 1
Convergence of gradient descent Does the method converge? How fast (slow)? For what step sizes?
Convergence of gradient descent Does the method converge? How fast (slow)? For what step sizes? Depends on function
Lipschitz continuity A function f : R n → R m is Lipschitz continuous if for any � y ∈ R n x , � || f ( � y ) − f ( � x ) || 2 ≤ L || � y − � x || 2 . L is the Lipschitz constant
Lipschitz-continuous gradients If ∇ f is Lipschitz continuous with Lipschitz constant L ||∇ f ( � y ) − ∇ f ( � x ) || 2 ≤ L || � y − � x || 2 y ∈ R n we have a quadratic upper bound then for any � x , � x ) + L x ) T ( � x || 2 f ( � y ) ≤ f ( � x ) + ∇ f ( � y − � 2 || � y − � 2
Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f �
Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2
Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2 � � � 1 − α k L 2 � x ( k ) �� � � x ( k ) � � = f � − α k � ∇ f � � � � � 2 � � � 2
Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2 � � � 1 − α k L 2 � x ( k ) �� � � x ( k ) � � = f � − α k � ∇ f � � � � � 2 � � � 2 If α k ≤ 1 L − α k 2 � � x ( k ) �� � � x ( k + 1 ) � � x ( k ) � � � ≤ f � � ∇ f � f � � � � 2 � � � 2
Convergence of gradient descent ◮ f is convex ◮ ∇ f is L -Lipschitz continuous x ∗ at which f achieves a finite minimum ◮ There exists a point � ◮ The step size is set to α k := α ≤ 1 / L x ( 0 ) − � � 2 � �� x ∗ � �� � � � x ( k ) � x ∗ ) ≤ 2 f � − f ( � 2 α k
Convergence of gradient descent − α k 2 � � x ( k − 1 ) �� � � x ( k ) � � x ( k − 1 ) � � � ≤ f � � ∇ f � f � � � � 2 � � � 2 x ( k − 1 ) � T � � x ( k − 1 ) � � x ∗ − � x ( k − 1 ) � x ∗ ) f � + ∇ f � � ≤ f ( � � x ( k ) � x ∗ ) � − f ( � f
Recommend
More recommend