convex optimization
play

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based - PowerPoint PPT Presentation

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Convexity Differentiable convex functions Minimizing differentiable convex


  1. Direction of maximum variation ∇ f is direction of maximum increase - ∇ f is direction of maximum decrease � x ) T � � � = � f ′ � � u ( � x ) � ∇ f ( � u � � � � ≤ ||∇ f ( � x ) || 2 || � u || 2 Cauchy-Schwarz inequality = ||∇ f ( � x ) || 2 ∇ f ( � x ) equality holds if and only if � u = ± ||∇ f ( � x ) || 2

  2. Gradient

  3. First-order approximation The first-order or linear approximation of f : R n → R at � x is x ) T ( � f 1 x ( � y ) := f ( � x ) + ∇ f ( � y − � x ) � If f is continuously differentiable at � x y ) − f 1 f ( � x ( � y ) � lim = 0 || � y − � x || 2 y → � � x

  4. First-order approximation f 1 x ( � y ) x f ( � y )

  5. Convexity A differentiable function f : R n → R is convex if and only if for every y ∈ R n � x , � x ) T ( � f ( � y ) ≥ f ( � x ) + ∇ f ( � y − � x ) It is strictly convex if and only if x ) T ( � f ( � y ) > f ( � x ) + ∇ f ( � y − � x )

  6. Optimality condition If f is convex and ∇ f ( � x ) = 0, then for any � y ∈ R f ( � y ) ≥ f ( � x ) If f is strictly convex then for any � y � = � x f ( � y ) > f ( � x )

  7. Epigraph The epigraph of f : R n → R is     � x [ 1 ]      ≤ � epi ( f ) :=  � x | f · · · x [ n + 1 ]    � x [ n ] 

  8. Epigraph epi ( f ) f

  9. Supporting hyperplane A hyperplane H is a supporting hyperplane of a set S at � x if ◮ H and S intersect at � x ◮ S is contained in one of the half-spaces bounded by H

  10. Geometric intuition Geometrically, f is convex if and only if for every � x the plane       y [ 1 ] �   y [ n + 1 ] = f 1 H f ,� x :=  � y | � · · · � x     y [ n ] �  is a supporting hyperplane of the epigraph at � x If ∇ f ( � x ) = 0 the hyperplane is horizontal

  11. Convexity f 1 x ( � y ) x f ( � y )

  12. Hessian matrix If f has a Hessian matrix at every point, it is twice differentiable   ∂ 2 f ( � x ) ∂ 2 f ( � x ) ∂ 2 f ( � x ) · · · ∂� x [ 1 ] 2 ∂� x [ 1 ] ∂� x [ 2 ] ∂� x [ 1 ] ∂� x [ n ]    ∂ 2 f ( � ∂ 2 f ( � ∂ 2 f ( �  x ) x ) x ) · · ·   ∇ 2 f ( � x [ 1 ] 2 x ) = ∂� x [ 1 ] ∂� x [ 2 ] ∂� ∂� x [ 2 ] ∂� x [ n ]      · · ·    ∂ 2 f ( � ∂ 2 f ( � ∂ 2 f ( �  x ) x ) x )  · · · ∂� x [ 1 ] ∂� x [ n ] ∂� x [ 2 ] ∂� x [ n ] ∂� x [ n ] 2

  13. Curvature The second directional derivative f ′′ u of f at � x equals � f ′′ u T ∇ 2 f ( � u ( � x ) = � x ) � u � u ∈ R n for any unit-norm vector �

  14. Second-order approximation The second-order or quadratic approximation of f at � x is x ) + 1 x ) T ∇ 2 f ( � f 2 x ( � y ) := f ( � x ) + ∇ f ( � x ) ( � y − � 2 ( � y − � x ) ( � y − � x ) �

  15. Second-order approximation f 2 x ( � y ) x f ( � y )

  16. Quadratic form Second order polynomial in several dimensions x + � x T A � b T � q ( � x ) := � x + c b ∈ R n and parametrized by symmetric matrix A ∈ R n × n , a vector � a constant c

  17. Quadratic approximation x : R n → R at � x ∈ R n of a The quadratic approximation f 2 � twice-continuously differentiable function f : R n → R satisfies y ) − f 2 f ( � x ( � y ) � lim = 0 x || 2 || � y − � y → � � x 2

  18. Eigendecomposition of symmetric matrices Let A = U Λ U T be the eigendecomposition of a symmetric matrix A Eigenvalues: λ 1 ≥ · · · ≥ λ n (which can be negative or 0) Eigenvectors: � u 1 , . . . , � u n , orthonormal basis x T A � λ 1 = max � x { || � x ∈ R n } x || 2 = 1 | � x T A � u 1 = � arg max � x { || � x ∈ R n } x || 2 = 1 | � x T A � λ n = min � x { || � x ∈ R n } x || 2 = 1 | � x T A � � u n = arg min � x { || � x ∈ R n } x || 2 = 1 | �

  19. Maximum and minimum curvature x ) = U Λ U T be the eigendecomposition of the Hessian at � Let ∇ 2 f ( � x Direction of maximum curvature: � u 1 Direction of minimum curvature (or maximum negative curvature): � u n

  20. Positive semidefinite matrices For any � x x T A � x T U Λ U T � � x = � x n � x � 2 λ i � � u i , � = i = 1 All eigenvalues are nonnegative if and only if x T A � � x ≥ 0 for all � x The matrix is positive semidefinite

  21. Positive (negative) (semi)definite matrices Positive (semi)definite: all eigenvalues are positive (nonnegative), equivalently for all � x x T A � � x > ( ≥ ) 0 Quadratic form: All directions have positive curvature Negative (semi)definite: all eigenvalues are negative (nonpositive), equivalently for all � x x T A � � x < ( ≤ ) 0 Quadratic form: All directions have negative curvature

  22. Convexity A twice-differentiable function g : R → R is convex if and only if g ′′ ( x ) ≥ 0 for all x ∈ R A twice-differentiable function in R n is convex if and only if their Hessian is positive semidefinite at every point If the Hessian is positive definite, the function is strictly convex

  23. Second-order approximation f 2 x ( � y ) x f ( � y )

  24. Convex

  25. Concave

  26. Neither

  27. Convexity Differentiable convex functions Minimizing differentiable convex functions

  28. Problem Challenge: Minimizing differentiable convex functions min f ( � x ) x ∈ R n �

  29. Gradient descent Intuition: Make local progress in the steepest direction −∇ f ( � x ) x ( 0 ) to an arbitrary value Set the initial point � Update by setting x ( k + 1 ) := � � x ( k ) � x ( k ) − α k ∇ f � � where α k > 0 is the step size, until a stopping criterion is met

  30. Gradient descent

  31. Gradient descent 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5

  32. Small step size 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5

  33. Large step size 100 90 80 70 60 50 40 30 20 10

  34. Line search Idea: Find minimum of α k := arg min α h ( α ) � x ( k ) − α k ∇ f � x ( k ) �� = arg min � � α ∈ R f

  35. Backtracking line search with Armijo rule Given α 0 ≥ 0 and β, η ∈ ( 0 , 1 ) , set α k := α 0 β i for smallest i such that x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � satisfies − 1 2 � � x ( k ) �� � � x ( k + 1 ) � � x ( k ) � � � ≤ f � 2 α k � ∇ f � f � � � � � � � 2 a condition known as Armijo rule

  36. Backtracking line search with Armijo rule 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5

  37. Gradient descent for least squares Aim: Use n examples � x ( 1 ) � � x ( 2 ) � � x ( n ) � y ( 1 ) , � y ( 2 ) , � y ( n ) , � , , . . . , to fit a linear model by minimizing least-squares cost function 2 � � � � y − X � � � minimize � β � � � � β ∈ R p � � � 2

  38. Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β )

  39. Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y

  40. Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y Gradient descent updates are β ( k + 1 ) = � β ( k ) + 2 α k X T � β ( k ) � � y − X � �

  41. Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y Gradient descent updates are β ( k + 1 ) = � β ( k ) + 2 α k X T � β ( k ) � � y − X � � n β ( k ) + 2 α k � y ( i ) − � x ( i ) , � � = � � β ( k ) � x ( i ) � i = 1

  42. Gradient ascent for logistic regression Aim: Use n examples � x ( 1 ) � � x ( 2 ) � � x ( n ) � y ( 1 ) , � y ( 2 ) , � y ( n ) , � , , . . . , to fit logistic-regression model by maximizing log-likelihood cost function n y ( i ) log g � � � 1 − y ( i ) � � � �� f ( � � x ( i ) , � x ( i ) , � � � � � β ) := β � + log 1 − g β � i = 1 where 1 g ( t ) = 1 − exp − t

  43. Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals ∇ f ( � β )

  44. Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1

  45. Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1 The gradient ascent updates are β ( k + 1 ) �

  46. Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1 The gradient ascent updates are β ( k + 1 ) := � � β ( k ) n y ( i ) � � x ( i ) − � 1 − y ( i ) � � x ( i ) , � x ( i ) , � β ( k ) � ) β ( k ) � ) � x ( i ) + α k 1 − g ( � � � g ( � � i = 1

  47. Convergence of gradient descent Does the method converge? How fast (slow)? For what step sizes?

  48. Convergence of gradient descent Does the method converge? How fast (slow)? For what step sizes? Depends on function

  49. Lipschitz continuity A function f : R n → R m is Lipschitz continuous if for any � y ∈ R n x , � || f ( � y ) − f ( � x ) || 2 ≤ L || � y − � x || 2 . L is the Lipschitz constant

  50. Lipschitz-continuous gradients If ∇ f is Lipschitz continuous with Lipschitz constant L ||∇ f ( � y ) − ∇ f ( � x ) || 2 ≤ L || � y − � x || 2 y ∈ R n we have a quadratic upper bound then for any � x , � x ) + L x ) T ( � x || 2 f ( � y ) ≤ f ( � x ) + ∇ f ( � y − � 2 || � y − � 2

  51. Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f �

  52. Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2

  53. Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2 � � � 1 − α k L 2 � x ( k ) �� � � x ( k ) � � = f � − α k � ∇ f � � � � � 2 � � � 2

  54. Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2 � � � 1 − α k L 2 � x ( k ) �� � � x ( k ) � � = f � − α k � ∇ f � � � � � 2 � � � 2 If α k ≤ 1 L − α k 2 � � x ( k ) �� � � x ( k + 1 ) � � x ( k ) � � � ≤ f � � ∇ f � f � � � � 2 � � � 2

  55. Convergence of gradient descent ◮ f is convex ◮ ∇ f is L -Lipschitz continuous x ∗ at which f achieves a finite minimum ◮ There exists a point � ◮ The step size is set to α k := α ≤ 1 / L x ( 0 ) − � � 2 � �� x ∗ � �� � � � x ( k ) � x ∗ ) ≤ 2 f � − f ( � 2 α k

  56. Convergence of gradient descent − α k 2 � � x ( k − 1 ) �� � � x ( k ) � � x ( k − 1 ) � � � ≤ f � � ∇ f � f � � � � 2 � � � 2 x ( k − 1 ) � T � � x ( k − 1 ) � � x ∗ − � x ( k − 1 ) � x ∗ ) f � + ∇ f � � ≤ f ( � � x ( k ) � x ∗ ) � − f ( � f

Recommend


More recommend