numerical optimization
play

Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department - PowerPoint PPT Presentation

Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 1 / 74 Outline Numerical


  1. Optimization Problems An optimization problem is to minimize a cost function f : R d ! R : min x f ( x ) subject to x 2 C where C ✓ R d is called the feasible set containing feasible points Or, maximizing an objective function Maximizing f equals to minimizing � f If C = R d , we say the optimization problem is unconstrained C can be a set of function constrains , i.e., C = { x : g ( i ) ( x )  0 } i Sometimes, we single out equality constrains C = { x : g ( i ) ( x )  0 , h ( j ) ( x ) = 0 } i , j Each equality constrain can be written as two inequality constrains Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 10 / 74

  2. Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

  3. Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

  4. Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Maxima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

  5. Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Maxima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } Plateau or saddle points: { x : f 0 ( x ) = 0 and H ( f )( x ) = O or indefinite } Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

  6. Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Maxima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } Plateau or saddle points: { x : f 0 ( x ) = 0 and H ( f )( x ) = O or indefinite } y ⇤ = min x 2 C f ( x ) 2 R is called the global minimum Global minima vs. local minima Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

  7. Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Maxima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } Plateau or saddle points: { x : f 0 ( x ) = 0 and H ( f )( x ) = O or indefinite } y ⇤ = min x 2 C f ( x ) 2 R is called the global minimum Global minima vs. local minima x ⇤ = argmin x 2 C f ( x ) is called the optimal point Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

  8. Convex Optimization Problems An optimization problem is convex i ff f is convex by having a “convex hull” 1 surface, i.e., H ( f )( x ) ⌫ 0 , 8 x Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

  9. Convex Optimization Problems An optimization problem is convex i ff f is convex by having a “convex hull” 1 surface, i.e., H ( f )( x ) ⌫ 0 , 8 x g i ( x ) ’s are convex and h j ( x ) ’s are a ffi ne 2 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

  10. Convex Optimization Problems An optimization problem is convex i ff f is convex by having a “convex hull” 1 surface, i.e., H ( f )( x ) ⌫ 0 , 8 x g i ( x ) ’s are convex and h j ( x ) ’s are a ffi ne 2 Convex problems are “easier” since Local minima are necessarily global minima No saddle point Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

  11. Convex Optimization Problems An optimization problem is convex i ff f is convex by having a “convex hull” 1 surface, i.e., H ( f )( x ) ⌫ 0 , 8 x g i ( x ) ’s are convex and h j ( x ) ’s are a ffi ne 2 Convex problems are “easier” since Local minima are necessarily global minima No saddle point We can get the global minimum by solving f 0 ( x ) = 0 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

  12. Analytical Solutions vs. Numerical Solutions I Consider the problem: 1 k Ax � b k 2 + λ k x k 2 � � argmin 2 x Analytical solutions? Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

  13. Analytical Solutions vs. Numerical Solutions I Consider the problem: 1 k Ax � b k 2 + λ k x k 2 � � argmin 2 x Analytical solutions? 2 k b k 2 is convex A > A + λ I x � b > Ax + 1 The cost function f ( x ) = 1 2 x > � � Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

  14. Analytical Solutions vs. Numerical Solutions I Consider the problem: 1 k Ax � b k 2 + λ k x k 2 � � argmin 2 x Analytical solutions? 2 k b k 2 is convex A > A + λ I x � b > Ax + 1 The cost function f ( x ) = 1 2 x > � � A > A + λ I � b > A = 0 , we have Solving f 0 ( x ) = x > � � ⌘ � 1 x ⇤ = ⇣ A > A + λ I A > b Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

  15. Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

  16. Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � In practice, we may not be able to solve f 0 ( x ) = 0 analytically and get x in a closed form E.g., when λ = 0 and n < d Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

  17. Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � In practice, we may not be able to solve f 0 ( x ) = 0 analytically and get x in a closed form E.g., when λ = 0 and n < d Even if we can, the computation cost may be too hight E.g, inverting A > A + λ I 2 R d ⇥ d takes O ( d 3 ) time Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

  18. Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � In practice, we may not be able to solve f 0 ( x ) = 0 analytically and get x in a closed form E.g., when λ = 0 and n < d Even if we can, the computation cost may be too hight E.g, inverting A > A + λ I 2 R d ⇥ d takes O ( d 3 ) time Numerical methods : since numerical errors are inevitable, why not just obtain an approximation of x ⇤ ? Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

  19. Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � In practice, we may not be able to solve f 0 ( x ) = 0 analytically and get x in a closed form E.g., when λ = 0 and n < d Even if we can, the computation cost may be too hight E.g, inverting A > A + λ I 2 R d ⇥ d takes O ( d 3 ) time Numerical methods : since numerical errors are inevitable, why not just obtain an approximation of x ⇤ ? Start from x ( 0 ) , iteratively calculating x ( 1 ) , x ( 2 ) , ··· such that f ( x ( 1 ) ) � f ( x ( 2 ) ) � ··· Usually require much less time to have a good enough x ( t ) ⇡ x ⇤ Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

  20. Outline Numerical Computation 1 Optimization Problems 2 Unconstrained Optimization 3 Gradient Descent Newton’s Method Optimization in ML: Stochastic Gradient Descent 4 Perceptron Adaline Stochastic Gradient Descent Constrained Optimization 5 Optimization in ML: Regularization 6 Linear Regression Polynomial Regression Generalizability & Regularization Duality* 7 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 15 / 74

  21. Unconstrained Optimization Problem: x 2 R d f ( x ) , min where f : R d ! R is not necessarily convex Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 16 / 74

  22. General Descent Algorithm Input : x ( 0 ) 2 R d , an initial guess repeat Determine a descent direction d ( t ) 2 R d ; Line search : choose a step size or learning rate η ( t ) > 0 such that f ( x ( t ) + η ( t ) d ( t ) ) is minimal along the ray x ( t ) + η ( t ) d ( t ) ; Update rule : x ( t + 1 ) x ( t ) + η ( t ) d ( t ) ; until convergence criterion is satisfied ; Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 17 / 74

  23. General Descent Algorithm Input : x ( 0 ) 2 R d , an initial guess repeat Determine a descent direction d ( t ) 2 R d ; Line search : choose a step size or learning rate η ( t ) > 0 such that f ( x ( t ) + η ( t ) d ( t ) ) is minimal along the ray x ( t ) + η ( t ) d ( t ) ; Update rule : x ( t + 1 ) x ( t ) + η ( t ) d ( t ) ; until convergence criterion is satisfied ; Convergence criterion: k x ( t + 1 ) � x ( t ) k  ε , k ∇ f ( x ( t + 1 ) ) k  ε , etc. Line search step could be skipped by letting η ( t ) be a small constant Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 17 / 74

  24. Outline Numerical Computation 1 Optimization Problems 2 Unconstrained Optimization 3 Gradient Descent Newton’s Method Optimization in ML: Stochastic Gradient Descent 4 Perceptron Adaline Stochastic Gradient Descent Constrained Optimization 5 Optimization in ML: Regularization 6 Linear Regression Polynomial Regression Generalizability & Regularization Duality* 7 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 18 / 74

  25. Gradient Descent I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a linear function ˜ f , i.e., f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) ) f ( x ) ⇡ ˜ for x close enough to x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

  26. Gradient Descent I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a linear function ˜ f , i.e., f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) ) f ( x ) ⇡ ˜ for x close enough to x ( t ) This implies that if we pick a close x ( t + 1 ) that decreases ˜ f , we are likely to decrease f as well Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

  27. Gradient Descent I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a linear function ˜ f , i.e., f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) ) f ( x ) ⇡ ˜ for x close enough to x ( t ) This implies that if we pick a close x ( t + 1 ) that decreases ˜ f , we are likely to decrease f as well We can pick x ( t + 1 ) = x ( t ) � η ∇ f ( x ( t ) ) for some small η > 0 , since f ( x ( t + 1 ) ) = f ( x ( t ) ) � η k ∇ f ( x ( t ) ) k 2  ˜ ˜ f ( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

  28. Gradient Descent II Input : x ( 0 ) 2 R d an initial guess, a small η > 0 repeat x ( t + 1 ) x ( t ) � η ∇ f ( x ( t ) ) ; until convergence criterion is satisfied ; Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 20 / 74

  29. Is Negative Gradient a Good Direction? I Update rule: x ( t + 1 ) x ( t ) � η ∇ f ( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

  30. Is Negative Gradient a Good Direction? I Update rule: x ( t + 1 ) x ( t ) � η ∇ f ( x ( t ) ) Yes, as ∇ f ( x ( t ) ) 2 R d denotes the steepest ascent direction of f at point x ( t ) � ∇ f ( x ( t ) ) 2 R d the steepest descent direction Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

  31. Is Negative Gradient a Good Direction? I Update rule: x ( t + 1 ) x ( t ) � η ∇ f ( x ( t ) ) Yes, as ∇ f ( x ( t ) ) 2 R d denotes the steepest ascent direction of f at point x ( t ) � ∇ f ( x ( t ) ) 2 R d the steepest descent direction But why? Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

  32. Is Negative Gradient a Good Direction? II Consider the slope of f in a given direction u at point x ( t ) This is the directional derivative of f , i.e., the derivative of function f ( x ( t ) + ε u ) with respect to ε , evaluated at ε = 0 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 22 / 74

  33. Is Negative Gradient a Good Direction? II Consider the slope of f in a given direction u at point x ( t ) This is the directional derivative of f , i.e., the derivative of function f ( x ( t ) + ε u ) with respect to ε , evaluated at ε = 0 ∂ε f ( x ( t ) + ε u ) = ∇ f ( x ( t ) + ε u ) > u , which ∂ By the chain rule, we have equals to ∇ f ( x ( t ) ) > u when ε = 0 Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then g 0 2 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 22 / 74

  34. Is Negative Gradient a Good Direction? III To find the direction that decreases f fastest at x ( t ) , we solve the problem: u , k u k = 1 ∇ f ( x ( t ) ) > u = arg min u , k u k = 1 k ∇ f ( x ( t ) ) kk u k cos θ arg min where θ is the the angle between u and ∇ f ( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 23 / 74

  35. Is Negative Gradient a Good Direction? III To find the direction that decreases f fastest at x ( t ) , we solve the problem: u , k u k = 1 ∇ f ( x ( t ) ) > u = arg min u , k u k = 1 k ∇ f ( x ( t ) ) kk u k cos θ arg min where θ is the the angle between u and ∇ f ( x ( t ) ) This amounts to solve u cos θ argmin So, u ⇤ = � ∇ f ( x ( t ) ) is the steepest descent direction of f at point x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 23 / 74

  36. How to Set Learning Rate η ? I Too small an η results in slow descent speed and many iterations Too large an η may overshoot the optimal point along the gradient and goes uphill Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 24 / 74

  37. How to Set Learning Rate η ? I Too small an η results in slow descent speed and many iterations Too large an η may overshoot the optimal point along the gradient and goes uphill One way to set a better η is to leverage the curvatures of f The more curvy f at point x ( t ) , the smaller the η Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 24 / 74

  38. How to Set Learning Rate η ? II By Taylor’s theorem, we can approximate f locally at point x ( t ) using a quadratic function ˜ f : f ( x ) ⇡ ˜ f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) )+ 1 2 ( x � x ( t ) ) > H ( f )( x ( t ) )( x � x ( t ) ) for x close enough to x ( t ) H ( f )( x ( t ) ) 2 R d ⇥ d is the (symmetric) Hessian matrix of f at x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 25 / 74

  39. How to Set Learning Rate η ? II By Taylor’s theorem, we can approximate f locally at point x ( t ) using a quadratic function ˜ f : f ( x ) ⇡ ˜ f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) )+ 1 2 ( x � x ( t ) ) > H ( f )( x ( t ) )( x � x ( t ) ) for x close enough to x ( t ) H ( f )( x ( t ) ) 2 R d ⇥ d is the (symmetric) Hessian matrix of f at x ( t ) Line search at step t : f ( x ( t ) � η ∇ f ( x ( t ) )) = argmin η ˜ argmin η f ( x ( t ) ) � η ∇ f ( x ( t ) ) > ∇ f ( x ( t ) )+ η 2 2 ∇ f ( x ( t ) ) > H ( f )( x ( t ) ) ∇ f ( x ( t ) ) If f ( x ( t ) ) > H ( f )( x ( t ) ) ∇ f ( x ( t ) ) > 0 , we can solve f ( x ( t ) � η ∇ f ( x ( t ) )) = 0 and get: ∂ ∂η ˜ ∇ f ( x ( t ) ) > ∇ f ( x ( t ) ) η ( t ) = ∇ f ( x ( t ) ) > H ( f )( x ( t ) ) ∇ f ( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 25 / 74

  40. Problems of Gradient Descent Gradient descent is designed to find the steepest descent direction at step x ( t ) Not aware of the conditioning of the Hessian matrix H ( f )( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

  41. Problems of Gradient Descent Gradient descent is designed to find the steepest descent direction at step x ( t ) Not aware of the conditioning of the Hessian matrix H ( f )( x ( t ) ) If H ( f )( x ( t ) ) has a large condition number, then f is curvy in some directions but flat in others at x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

  42. Problems of Gradient Descent Gradient descent is designed to find the steepest descent direction at step x ( t ) Not aware of the conditioning of the Hessian matrix H ( f )( x ( t ) ) If H ( f )( x ( t ) ) has a large condition number, then f is curvy in some directions but flat in others at x ( t ) E.g., suppose f is a quadratic function whose Hessian has a large condition number A step in gradient descent may overshoot the optimal points along flat attributes “Zig-zags” around a narrow valley Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

  43. Problems of Gradient Descent Gradient descent is designed to find the steepest descent direction at step x ( t ) Not aware of the conditioning of the Hessian matrix H ( f )( x ( t ) ) If H ( f )( x ( t ) ) has a large condition number, then f is curvy in some directions but flat in others at x ( t ) E.g., suppose f is a quadratic function whose Hessian has a large condition number A step in gradient descent may overshoot the optimal points along flat attributes “Zig-zags” around a narrow valley Why not take conditioning into account when picking descent directions? Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

  44. Outline Numerical Computation 1 Optimization Problems 2 Unconstrained Optimization 3 Gradient Descent Newton’s Method Optimization in ML: Stochastic Gradient Descent 4 Perceptron Adaline Stochastic Gradient Descent Constrained Optimization 5 Optimization in ML: Regularization 6 Linear Regression Polynomial Regression Generalizability & Regularization Duality* 7 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 27 / 74

  45. Newton’s Method I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a quadratic function ˜ f , i.e., f ( x ) ⇡ ˜ f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) )+ 1 2 ( x � x ( t ) ) > H ( f )( x ( t ) )( x � x ( t ) ) for x close enough to x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 28 / 74

  46. Newton’s Method I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a quadratic function ˜ f , i.e., f ( x ) ⇡ ˜ f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) )+ 1 2 ( x � x ( t ) ) > H ( f )( x ( t ) )( x � x ( t ) ) for x close enough to x ( t ) If f is strictly convex (i.e., H ( f )( a ) � O , 8 a ), we can find x ( t + 1 ) that minimizes ˜ f in order to decrease f Solving ∇ ˜ f ( x ( t + 1 ) ; x ( t ) ) = 0 , we have x ( t + 1 ) = x ( t ) � H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) H ( f )( x ( t ) ) � 1 as a “corrector” to the negative gradient Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 28 / 74

  47. Newton’s Method II Input : x ( 0 ) 2 R d an initial guess, η > 0 repeat x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) ; until convergence criterion is satisfied ; Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 29 / 74

  48. Newton’s Method II Input : x ( 0 ) 2 R d an initial guess, η > 0 repeat x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) ; until convergence criterion is satisfied ; In practice, we multiply the shift by a small η > 0 to make sure that x ( t + 1 ) is close to x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 29 / 74

  49. Newton’s Method III If f is positive definite quadratic, then only one step is required Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 30 / 74

  50. General Functions Update rule: x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) What if f is not strictly convex? H ( f )( x ( t ) ) � O or indefinite Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

  51. General Functions Update rule: x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) What if f is not strictly convex? H ( f )( x ( t ) ) � O or indefinite The Levenberg–Marquardt extension : ⌘ � 1 x ( t + 1 ) = x ( t ) � η ⇣ H ( f )( x ( t ) )+ α I ∇ f ( x ( t ) ) for some α > 0 With a large α , degenerates into gradient descent of learning rate 1 / α Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

  52. General Functions Update rule: x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) What if f is not strictly convex? H ( f )( x ( t ) ) � O or indefinite The Levenberg–Marquardt extension : ⌘ � 1 x ( t + 1 ) = x ( t ) � η ⇣ H ( f )( x ( t ) )+ α I ∇ f ( x ( t ) ) for some α > 0 With a large α , degenerates into gradient descent of learning rate 1 / α Input : x ( 0 ) 2 R d an initial guess, η > 0 , α > 0 repeat � � 1 ∇ f ( x ( t ) ) ; x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) )+ α I � until convergence criterion is satisfied ; Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

  53. Gradient Descent vs. Newton’s Method Steps of Gradient descent when f is a Rosenbrock’s banana: Steps of Newton’s method: Only 6 steps in total Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 32 / 74

  54. Problems of Newton’s Method Computing H ( f )( x ( t ) ) � 1 is slow Takes O ( d 3 ) time at each step, which is much slower then O ( d ) of gradient descent Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

  55. Problems of Newton’s Method Computing H ( f )( x ( t ) ) � 1 is slow Takes O ( d 3 ) time at each step, which is much slower then O ( d ) of gradient descent Imprecise x ( t + 1 ) = x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) due to numerical errors H ( f )( x ( t ) ) may have a large condition number Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

  56. Problems of Newton’s Method Computing H ( f )( x ( t ) ) � 1 is slow Takes O ( d 3 ) time at each step, which is much slower then O ( d ) of gradient descent Imprecise x ( t + 1 ) = x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) due to numerical errors H ( f )( x ( t ) ) may have a large condition number Attracted to saddle points (when f is not convex) The x ( t + 1 ) solved from ∇ ˜ f ( x ( t + 1 ) ; x ( t ) ) = 0 is a critical point Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

  57. Outline Numerical Computation 1 Optimization Problems 2 Unconstrained Optimization 3 Gradient Descent Newton’s Method Optimization in ML: Stochastic Gradient Descent 4 Perceptron Adaline Stochastic Gradient Descent Constrained Optimization 5 Optimization in ML: Regularization 6 Linear Regression Polynomial Regression Generalizability & Regularization Duality* 7 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 34 / 74

  58. Who is Afraid of Non-convexity? In ML, the function to solve is usually the cost function C ( w ) of a model F = { f : f parametrized by w } Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

  59. Who is Afraid of Non-convexity? In ML, the function to solve is usually the cost function C ( w ) of a model F = { f : f parametrized by w } Many ML models have convex cost functions in order to take advantages of convex optimization E.g., perceptron, linear regression, logistic regression, SVMs, etc. Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

  60. Who is Afraid of Non-convexity? In ML, the function to solve is usually the cost function C ( w ) of a model F = { f : f parametrized by w } Many ML models have convex cost functions in order to take advantages of convex optimization E.g., perceptron, linear regression, logistic regression, SVMs, etc. However, in deep learning, the cost function of a neural network is typically not convex We will discuss techniques that tackle non-convexity later Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

  61. Assumption on Cost Functions In ML, we usually assume that the (real-valued) cost function is Lipschitz continuous and/or have Lipschitz continuous derivatives I.e., the rate of change of C if bounded by a Lipschitz constant K : | C ( w ( 1 ) ) � C ( w ( 2 ) ) |  K k w ( 1 ) � w ( 2 ) k , 8 w ( 1 ) , w ( 2 ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 36 / 74

  62. Outline Numerical Computation 1 Optimization Problems 2 Unconstrained Optimization 3 Gradient Descent Newton’s Method Optimization in ML: Stochastic Gradient Descent 4 Perceptron Adaline Stochastic Gradient Descent Constrained Optimization 5 Optimization in ML: Regularization 6 Linear Regression Polynomial Regression Generalizability & Regularization Duality* 7 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 37 / 74

  63. Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

  64. Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

  65. Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943 Our brains consist of interconnected neurons Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

  66. Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943 Our brains consist of interconnected neurons Each neuron takes signals from other neurons as input Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

  67. Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943 Our brains consist of interconnected neurons Each neuron takes signals from other neurons as input If the accumulated signal exceeds a certain threshold, an output signal is generated Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

  68. Model Binary classification problem: Training dataset: X = { ( x ( i ) , y ( i ) ) } i , where x ( i ) 2 R D and y ( i ) 2 { 1 , � 1 } Output: a function f ( x ) = ˆ y such that ˆ y is close to the true label y Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

  69. Model Binary classification problem: Training dataset: X = { ( x ( i ) , y ( i ) ) } i , where x ( i ) 2 R D and y ( i ) 2 { 1 , � 1 } Output: a function f ( x ) = ˆ y such that ˆ y is close to the true label y Model: { f : f ( x ; w , b ) = sign ( w > x � b ) } sign ( a ) = 1 if a � 0 ; otherwise 0 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

  70. Model Binary classification problem: Training dataset: X = { ( x ( i ) , y ( i ) ) } i , where x ( i ) 2 R D and y ( i ) 2 { 1 , � 1 } Output: a function f ( x ) = ˆ y such that ˆ y is close to the true label y Model: { f : f ( x ; w , b ) = sign ( w > x � b ) } sign ( a ) = 1 if a � 0 ; otherwise 0 For simplicity, we use shorthand f ( x ; w ) = sign ( w > x ) where w = [ � b , w 1 , ··· , w D ] > and x = [ 1 , x 1 , ··· , x D ] > Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

  71. Iterative Training Algorithm I Initiate w ( 0 ) and learning rate η > 0 1 Epoch: for each example ( x ( t ) , y ( t ) ) , update w by 2 w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) = f ( x ( t ) ; w ( t ) ) = sign ( w ( t ) > x ( t ) ) where ˆ Repeat epoch several times (or until converge) 3 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 40 / 74

  72. Iterative Training Algorithm II Update rule: w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) is correct, we have w ( t + 1 ) = w ( t ) If ˆ Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

  73. Iterative Training Algorithm II Update rule: w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) is correct, we have w ( t + 1 ) = w ( t ) If ˆ y ( t ) is incorrect, we have w ( t + 1 ) = w ( t ) + 2 η y ( t ) x ( t ) If ˆ Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

  74. Iterative Training Algorithm II Update rule: w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) is correct, we have w ( t + 1 ) = w ( t ) If ˆ y ( t ) is incorrect, we have w ( t + 1 ) = w ( t ) + 2 η y ( t ) x ( t ) If ˆ If y ( t ) = 1 , the updated prediction will more likely to be positive, as sign ( w ( t + 1 ) > x ( t ) ) = sign ( w ( t ) > x ( t ) + c ) for some c > 0 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

  75. Iterative Training Algorithm II Update rule: w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) is correct, we have w ( t + 1 ) = w ( t ) If ˆ y ( t ) is incorrect, we have w ( t + 1 ) = w ( t ) + 2 η y ( t ) x ( t ) If ˆ If y ( t ) = 1 , the updated prediction will more likely to be positive, as sign ( w ( t + 1 ) > x ( t ) ) = sign ( w ( t ) > x ( t ) + c ) for some c > 0 If y ( t ) = � 1 , the updated prediction will more likely to be negative Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

Recommend


More recommend