Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department - PowerPoint PPT Presentation

Optimization Problems An optimization problem is to minimize a cost function f : R d ! R : min x f ( x ) subject to x 2 C where C ✓ R d is called the feasible set containing feasible points Or, maximizing an objective function Maximizing f equals to minimizing � f If C = R d , we say the optimization problem is unconstrained C can be a set of function constrains , i.e., C = { x : g ( i ) ( x )  0 } i Sometimes, we single out equality constrains C = { x : g ( i ) ( x )  0 , h ( j ) ( x ) = 0 } i , j Each equality constrain can be written as two inequality constrains Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 10 / 74

Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Maxima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Maxima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } Plateau or saddle points: { x : f 0 ( x ) = 0 and H ( f )( x ) = O or indefinite } Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Maxima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } Plateau or saddle points: { x : f 0 ( x ) = 0 and H ( f )( x ) = O or indefinite } y ⇤ = min x 2 C f ( x ) 2 R is called the global minimum Global minima vs. local minima Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

Minimums and Optimal Points Critical points : { x : f 0 ( x ) = 0 } Minima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } , where H ( f )( x ) is the Hessian matrix (containing curvatures) of f at point x Maxima: { x : f 0 ( x ) = 0 and H ( f )( x ) � O } Plateau or saddle points: { x : f 0 ( x ) = 0 and H ( f )( x ) = O or indefinite } y ⇤ = min x 2 C f ( x ) 2 R is called the global minimum Global minima vs. local minima x ⇤ = argmin x 2 C f ( x ) is called the optimal point Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

Convex Optimization Problems An optimization problem is convex i ff f is convex by having a “convex hull” 1 surface, i.e., H ( f )( x ) ⌫ 0 , 8 x Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

Convex Optimization Problems An optimization problem is convex i ff f is convex by having a “convex hull” 1 surface, i.e., H ( f )( x ) ⌫ 0 , 8 x g i ( x ) ’s are convex and h j ( x ) ’s are a ffi ne 2 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

Convex Optimization Problems An optimization problem is convex i ff f is convex by having a “convex hull” 1 surface, i.e., H ( f )( x ) ⌫ 0 , 8 x g i ( x ) ’s are convex and h j ( x ) ’s are a ffi ne 2 Convex problems are “easier” since Local minima are necessarily global minima No saddle point Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

Convex Optimization Problems An optimization problem is convex i ff f is convex by having a “convex hull” 1 surface, i.e., H ( f )( x ) ⌫ 0 , 8 x g i ( x ) ’s are convex and h j ( x ) ’s are a ffi ne 2 Convex problems are “easier” since Local minima are necessarily global minima No saddle point We can get the global minimum by solving f 0 ( x ) = 0 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

Analytical Solutions vs. Numerical Solutions I Consider the problem: 1 k Ax � b k 2 + λ k x k 2 � � argmin 2 x Analytical solutions? Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

Analytical Solutions vs. Numerical Solutions I Consider the problem: 1 k Ax � b k 2 + λ k x k 2 � � argmin 2 x Analytical solutions? 2 k b k 2 is convex A > A + λ I x � b > Ax + 1 The cost function f ( x ) = 1 2 x > � � Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

Analytical Solutions vs. Numerical Solutions I Consider the problem: 1 k Ax � b k 2 + λ k x k 2 � � argmin 2 x Analytical solutions? 2 k b k 2 is convex A > A + λ I x � b > Ax + 1 The cost function f ( x ) = 1 2 x > � � A > A + λ I � b > A = 0 , we have Solving f 0 ( x ) = x > � � ⌘ � 1 x ⇤ = ⇣ A > A + λ I A > b Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � In practice, we may not be able to solve f 0 ( x ) = 0 analytically and get x in a closed form E.g., when λ = 0 and n < d Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � In practice, we may not be able to solve f 0 ( x ) = 0 analytically and get x in a closed form E.g., when λ = 0 and n < d Even if we can, the computation cost may be too hight E.g, inverting A > A + λ I 2 R d ⇥ d takes O ( d 3 ) time Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � In practice, we may not be able to solve f 0 ( x ) = 0 analytically and get x in a closed form E.g., when λ = 0 and n < d Even if we can, the computation cost may be too hight E.g, inverting A > A + λ I 2 R d ⇥ d takes O ( d 3 ) time Numerical methods : since numerical errors are inevitable, why not just obtain an approximation of x ⇤ ? Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

Analytical Solutions vs. Numerical Solutions II Problem ( A 2 R n ⇥ d , b 2 R n , λ 2 R ): 1 k Ax � b k 2 + λ k x k 2 � � arg min 2 x 2 R d � � 1 A > b Analytical solution: x ⇤ = A > A + λ I � In practice, we may not be able to solve f 0 ( x ) = 0 analytically and get x in a closed form E.g., when λ = 0 and n < d Even if we can, the computation cost may be too hight E.g, inverting A > A + λ I 2 R d ⇥ d takes O ( d 3 ) time Numerical methods : since numerical errors are inevitable, why not just obtain an approximation of x ⇤ ? Start from x ( 0 ) , iteratively calculating x ( 1 ) , x ( 2 ) , ··· such that f ( x ( 1 ) ) � f ( x ( 2 ) ) � ··· Usually require much less time to have a good enough x ( t ) ⇡ x ⇤ Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

Outline Numerical Computation 1 Optimization Problems 2 Unconstrained Optimization 3 Gradient Descent Newton’s Method Optimization in ML: Stochastic Gradient Descent 4 Perceptron Adaline Stochastic Gradient Descent Constrained Optimization 5 Optimization in ML: Regularization 6 Linear Regression Polynomial Regression Generalizability & Regularization Duality* 7 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 15 / 74

Unconstrained Optimization Problem: x 2 R d f ( x ) , min where f : R d ! R is not necessarily convex Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 16 / 74

General Descent Algorithm Input : x ( 0 ) 2 R d , an initial guess repeat Determine a descent direction d ( t ) 2 R d ; Line search : choose a step size or learning rate η ( t ) > 0 such that f ( x ( t ) + η ( t ) d ( t ) ) is minimal along the ray x ( t ) + η ( t ) d ( t ) ; Update rule : x ( t + 1 ) x ( t ) + η ( t ) d ( t ) ; until convergence criterion is satisfied ; Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 17 / 74

General Descent Algorithm Input : x ( 0 ) 2 R d , an initial guess repeat Determine a descent direction d ( t ) 2 R d ; Line search : choose a step size or learning rate η ( t ) > 0 such that f ( x ( t ) + η ( t ) d ( t ) ) is minimal along the ray x ( t ) + η ( t ) d ( t ) ; Update rule : x ( t + 1 ) x ( t ) + η ( t ) d ( t ) ; until convergence criterion is satisfied ; Convergence criterion: k x ( t + 1 ) � x ( t ) k  ε , k ∇ f ( x ( t + 1 ) ) k  ε , etc. Line search step could be skipped by letting η ( t ) be a small constant Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 17 / 74

Gradient Descent I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a linear function ˜ f , i.e., f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) ) f ( x ) ⇡ ˜ for x close enough to x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

Gradient Descent I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a linear function ˜ f , i.e., f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) ) f ( x ) ⇡ ˜ for x close enough to x ( t ) This implies that if we pick a close x ( t + 1 ) that decreases ˜ f , we are likely to decrease f as well Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

Gradient Descent I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a linear function ˜ f , i.e., f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) ) f ( x ) ⇡ ˜ for x close enough to x ( t ) This implies that if we pick a close x ( t + 1 ) that decreases ˜ f , we are likely to decrease f as well We can pick x ( t + 1 ) = x ( t ) � η ∇ f ( x ( t ) ) for some small η > 0 , since f ( x ( t + 1 ) ) = f ( x ( t ) ) � η k ∇ f ( x ( t ) ) k 2  ˜ ˜ f ( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

Gradient Descent II Input : x ( 0 ) 2 R d an initial guess, a small η > 0 repeat x ( t + 1 ) x ( t ) � η ∇ f ( x ( t ) ) ; until convergence criterion is satisfied ; Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 20 / 74

Is Negative Gradient a Good Direction? I Update rule: x ( t + 1 ) x ( t ) � η ∇ f ( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

Is Negative Gradient a Good Direction? I Update rule: x ( t + 1 ) x ( t ) � η ∇ f ( x ( t ) ) Yes, as ∇ f ( x ( t ) ) 2 R d denotes the steepest ascent direction of f at point x ( t ) � ∇ f ( x ( t ) ) 2 R d the steepest descent direction Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

Is Negative Gradient a Good Direction? I Update rule: x ( t + 1 ) x ( t ) � η ∇ f ( x ( t ) ) Yes, as ∇ f ( x ( t ) ) 2 R d denotes the steepest ascent direction of f at point x ( t ) � ∇ f ( x ( t ) ) 2 R d the steepest descent direction But why? Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

Is Negative Gradient a Good Direction? II Consider the slope of f in a given direction u at point x ( t ) This is the directional derivative of f , i.e., the derivative of function f ( x ( t ) + ε u ) with respect to ε , evaluated at ε = 0 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 22 / 74

Is Negative Gradient a Good Direction? II Consider the slope of f in a given direction u at point x ( t ) This is the directional derivative of f , i.e., the derivative of function f ( x ( t ) + ε u ) with respect to ε , evaluated at ε = 0 ∂ε f ( x ( t ) + ε u ) = ∇ f ( x ( t ) + ε u ) > u , which ∂ By the chain rule, we have equals to ∇ f ( x ( t ) ) > u when ε = 0 Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then g 0 2 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 22 / 74

Is Negative Gradient a Good Direction? III To find the direction that decreases f fastest at x ( t ) , we solve the problem: u , k u k = 1 ∇ f ( x ( t ) ) > u = arg min u , k u k = 1 k ∇ f ( x ( t ) ) kk u k cos θ arg min where θ is the the angle between u and ∇ f ( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 23 / 74

Is Negative Gradient a Good Direction? III To find the direction that decreases f fastest at x ( t ) , we solve the problem: u , k u k = 1 ∇ f ( x ( t ) ) > u = arg min u , k u k = 1 k ∇ f ( x ( t ) ) kk u k cos θ arg min where θ is the the angle between u and ∇ f ( x ( t ) ) This amounts to solve u cos θ argmin So, u ⇤ = � ∇ f ( x ( t ) ) is the steepest descent direction of f at point x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 23 / 74

How to Set Learning Rate η ? I Too small an η results in slow descent speed and many iterations Too large an η may overshoot the optimal point along the gradient and goes uphill Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 24 / 74

How to Set Learning Rate η ? I Too small an η results in slow descent speed and many iterations Too large an η may overshoot the optimal point along the gradient and goes uphill One way to set a better η is to leverage the curvatures of f The more curvy f at point x ( t ) , the smaller the η Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 24 / 74

How to Set Learning Rate η ? II By Taylor’s theorem, we can approximate f locally at point x ( t ) using a quadratic function ˜ f : f ( x ) ⇡ ˜ f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) )+ 1 2 ( x � x ( t ) ) > H ( f )( x ( t ) )( x � x ( t ) ) for x close enough to x ( t ) H ( f )( x ( t ) ) 2 R d ⇥ d is the (symmetric) Hessian matrix of f at x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 25 / 74

How to Set Learning Rate η ? II By Taylor’s theorem, we can approximate f locally at point x ( t ) using a quadratic function ˜ f : f ( x ) ⇡ ˜ f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) )+ 1 2 ( x � x ( t ) ) > H ( f )( x ( t ) )( x � x ( t ) ) for x close enough to x ( t ) H ( f )( x ( t ) ) 2 R d ⇥ d is the (symmetric) Hessian matrix of f at x ( t ) Line search at step t : f ( x ( t ) � η ∇ f ( x ( t ) )) = argmin η ˜ argmin η f ( x ( t ) ) � η ∇ f ( x ( t ) ) > ∇ f ( x ( t ) )+ η 2 2 ∇ f ( x ( t ) ) > H ( f )( x ( t ) ) ∇ f ( x ( t ) ) If f ( x ( t ) ) > H ( f )( x ( t ) ) ∇ f ( x ( t ) ) > 0 , we can solve f ( x ( t ) � η ∇ f ( x ( t ) )) = 0 and get: ∂ ∂η ˜ ∇ f ( x ( t ) ) > ∇ f ( x ( t ) ) η ( t ) = ∇ f ( x ( t ) ) > H ( f )( x ( t ) ) ∇ f ( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 25 / 74

Problems of Gradient Descent Gradient descent is designed to find the steepest descent direction at step x ( t ) Not aware of the conditioning of the Hessian matrix H ( f )( x ( t ) ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

Problems of Gradient Descent Gradient descent is designed to find the steepest descent direction at step x ( t ) Not aware of the conditioning of the Hessian matrix H ( f )( x ( t ) ) If H ( f )( x ( t ) ) has a large condition number, then f is curvy in some directions but flat in others at x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

Problems of Gradient Descent Gradient descent is designed to find the steepest descent direction at step x ( t ) Not aware of the conditioning of the Hessian matrix H ( f )( x ( t ) ) If H ( f )( x ( t ) ) has a large condition number, then f is curvy in some directions but flat in others at x ( t ) E.g., suppose f is a quadratic function whose Hessian has a large condition number A step in gradient descent may overshoot the optimal points along flat attributes “Zig-zags” around a narrow valley Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

Problems of Gradient Descent Gradient descent is designed to find the steepest descent direction at step x ( t ) Not aware of the conditioning of the Hessian matrix H ( f )( x ( t ) ) If H ( f )( x ( t ) ) has a large condition number, then f is curvy in some directions but flat in others at x ( t ) E.g., suppose f is a quadratic function whose Hessian has a large condition number A step in gradient descent may overshoot the optimal points along flat attributes “Zig-zags” around a narrow valley Why not take conditioning into account when picking descent directions? Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

Newton’s Method I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a quadratic function ˜ f , i.e., f ( x ) ⇡ ˜ f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) )+ 1 2 ( x � x ( t ) ) > H ( f )( x ( t ) )( x � x ( t ) ) for x close enough to x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 28 / 74

Newton’s Method I By Taylor’s theorem, we can approximate f locally at point x ( t ) using a quadratic function ˜ f , i.e., f ( x ) ⇡ ˜ f ( x ; x ( t ) ) = f ( x ( t ) )+ ∇ f ( x ( t ) ) > ( x � x ( t ) )+ 1 2 ( x � x ( t ) ) > H ( f )( x ( t ) )( x � x ( t ) ) for x close enough to x ( t ) If f is strictly convex (i.e., H ( f )( a ) � O , 8 a ), we can find x ( t + 1 ) that minimizes ˜ f in order to decrease f Solving ∇ ˜ f ( x ( t + 1 ) ; x ( t ) ) = 0 , we have x ( t + 1 ) = x ( t ) � H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) H ( f )( x ( t ) ) � 1 as a “corrector” to the negative gradient Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 28 / 74

Newton’s Method II Input : x ( 0 ) 2 R d an initial guess, η > 0 repeat x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) ; until convergence criterion is satisfied ; Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 29 / 74

Newton’s Method II Input : x ( 0 ) 2 R d an initial guess, η > 0 repeat x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) ; until convergence criterion is satisfied ; In practice, we multiply the shift by a small η > 0 to make sure that x ( t + 1 ) is close to x ( t ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 29 / 74

Newton’s Method III If f is positive definite quadratic, then only one step is required Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 30 / 74

General Functions Update rule: x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) What if f is not strictly convex? H ( f )( x ( t ) ) � O or indefinite Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

General Functions Update rule: x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) What if f is not strictly convex? H ( f )( x ( t ) ) � O or indefinite The Levenberg–Marquardt extension : ⌘ � 1 x ( t + 1 ) = x ( t ) � η ⇣ H ( f )( x ( t ) )+ α I ∇ f ( x ( t ) ) for some α > 0 With a large α , degenerates into gradient descent of learning rate 1 / α Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

General Functions Update rule: x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) What if f is not strictly convex? H ( f )( x ( t ) ) � O or indefinite The Levenberg–Marquardt extension : ⌘ � 1 x ( t + 1 ) = x ( t ) � η ⇣ H ( f )( x ( t ) )+ α I ∇ f ( x ( t ) ) for some α > 0 With a large α , degenerates into gradient descent of learning rate 1 / α Input : x ( 0 ) 2 R d an initial guess, η > 0 , α > 0 repeat � � 1 ∇ f ( x ( t ) ) ; x ( t + 1 ) x ( t ) � η H ( f )( x ( t ) )+ α I � until convergence criterion is satisfied ; Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

Gradient Descent vs. Newton’s Method Steps of Gradient descent when f is a Rosenbrock’s banana: Steps of Newton’s method: Only 6 steps in total Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 32 / 74

Problems of Newton’s Method Computing H ( f )( x ( t ) ) � 1 is slow Takes O ( d 3 ) time at each step, which is much slower then O ( d ) of gradient descent Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

Problems of Newton’s Method Computing H ( f )( x ( t ) ) � 1 is slow Takes O ( d 3 ) time at each step, which is much slower then O ( d ) of gradient descent Imprecise x ( t + 1 ) = x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) due to numerical errors H ( f )( x ( t ) ) may have a large condition number Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

Problems of Newton’s Method Computing H ( f )( x ( t ) ) � 1 is slow Takes O ( d 3 ) time at each step, which is much slower then O ( d ) of gradient descent Imprecise x ( t + 1 ) = x ( t ) � η H ( f )( x ( t ) ) � 1 ∇ f ( x ( t ) ) due to numerical errors H ( f )( x ( t ) ) may have a large condition number Attracted to saddle points (when f is not convex) The x ( t + 1 ) solved from ∇ ˜ f ( x ( t + 1 ) ; x ( t ) ) = 0 is a critical point Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

Who is Afraid of Non-convexity? In ML, the function to solve is usually the cost function C ( w ) of a model F = { f : f parametrized by w } Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

Who is Afraid of Non-convexity? In ML, the function to solve is usually the cost function C ( w ) of a model F = { f : f parametrized by w } Many ML models have convex cost functions in order to take advantages of convex optimization E.g., perceptron, linear regression, logistic regression, SVMs, etc. Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

Who is Afraid of Non-convexity? In ML, the function to solve is usually the cost function C ( w ) of a model F = { f : f parametrized by w } Many ML models have convex cost functions in order to take advantages of convex optimization E.g., perceptron, linear regression, logistic regression, SVMs, etc. However, in deep learning, the cost function of a neural network is typically not convex We will discuss techniques that tackle non-convexity later Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

Assumption on Cost Functions In ML, we usually assume that the (real-valued) cost function is Lipschitz continuous and/or have Lipschitz continuous derivatives I.e., the rate of change of C if bounded by a Lipschitz constant K : | C ( w ( 1 ) ) � C ( w ( 2 ) ) |  K k w ( 1 ) � w ( 2 ) k , 8 w ( 1 ) , w ( 2 ) Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 36 / 74

Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943 Our brains consist of interconnected neurons Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943 Our brains consist of interconnected neurons Each neuron takes signals from other neurons as input Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

Perceptron & Neurons Perceptron , proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943 Our brains consist of interconnected neurons Each neuron takes signals from other neurons as input If the accumulated signal exceeds a certain threshold, an output signal is generated Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

Model Binary classification problem: Training dataset: X = { ( x ( i ) , y ( i ) ) } i , where x ( i ) 2 R D and y ( i ) 2 { 1 , � 1 } Output: a function f ( x ) = ˆ y such that ˆ y is close to the true label y Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

Model Binary classification problem: Training dataset: X = { ( x ( i ) , y ( i ) ) } i , where x ( i ) 2 R D and y ( i ) 2 { 1 , � 1 } Output: a function f ( x ) = ˆ y such that ˆ y is close to the true label y Model: { f : f ( x ; w , b ) = sign ( w > x � b ) } sign ( a ) = 1 if a � 0 ; otherwise 0 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

Model Binary classification problem: Training dataset: X = { ( x ( i ) , y ( i ) ) } i , where x ( i ) 2 R D and y ( i ) 2 { 1 , � 1 } Output: a function f ( x ) = ˆ y such that ˆ y is close to the true label y Model: { f : f ( x ; w , b ) = sign ( w > x � b ) } sign ( a ) = 1 if a � 0 ; otherwise 0 For simplicity, we use shorthand f ( x ; w ) = sign ( w > x ) where w = [ � b , w 1 , ··· , w D ] > and x = [ 1 , x 1 , ··· , x D ] > Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

Iterative Training Algorithm I Initiate w ( 0 ) and learning rate η > 0 1 Epoch: for each example ( x ( t ) , y ( t ) ) , update w by 2 w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) = f ( x ( t ) ; w ( t ) ) = sign ( w ( t ) > x ( t ) ) where ˆ Repeat epoch several times (or until converge) 3 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 40 / 74

Iterative Training Algorithm II Update rule: w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) is correct, we have w ( t + 1 ) = w ( t ) If ˆ Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

Iterative Training Algorithm II Update rule: w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) is correct, we have w ( t + 1 ) = w ( t ) If ˆ y ( t ) is incorrect, we have w ( t + 1 ) = w ( t ) + 2 η y ( t ) x ( t ) If ˆ Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

Iterative Training Algorithm II Update rule: w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) is correct, we have w ( t + 1 ) = w ( t ) If ˆ y ( t ) is incorrect, we have w ( t + 1 ) = w ( t ) + 2 η y ( t ) x ( t ) If ˆ If y ( t ) = 1 , the updated prediction will more likely to be positive, as sign ( w ( t + 1 ) > x ( t ) ) = sign ( w ( t ) > x ( t ) + c ) for some c > 0 Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

Iterative Training Algorithm II Update rule: w ( t + 1 ) = w ( t ) + η ( y ( t ) � ˆ y ( t ) ) x ( t ) y ( t ) is correct, we have w ( t + 1 ) = w ( t ) If ˆ y ( t ) is incorrect, we have w ( t + 1 ) = w ( t ) + 2 η y ( t ) x ( t ) If ˆ If y ( t ) = 1 , the updated prediction will more likely to be positive, as sign ( w ( t + 1 ) > x ( t ) ) = sign ( w ( t ) > x ( t ) + c ) for some c > 0 If y ( t ) = � 1 , the updated prediction will more likely to be negative Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department - PowerPoint PPT Presentation

Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 1 / 74 Outline Numerical

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

JUST THE MATHS SLIDES NUMBER 17.7 NUMERICAL MATHEMATICS 7 (Numerical solution) of

JUST THE MATHS SLIDES NUMBER 17.8 NUMERICAL MATHEMATICS 8 (Numerical solution) of

4. Numerical Quadrature Where analytical abilities end . . . 4. Numerical Quadrature Numerical

Numerical Differentiation & Integration Elements of Numerical Integration I Numerical

Numerical Recipes for Multiprecision Computations Henri Cohen May 13, 2014 IMB, Universit e

Numerical Differentiation & Integration Numerical Differentiation II Numerical Analysis (9th

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Numerical Issues and Influences in the Design of Algebraic Modeling Languages for Optimization

Basics of Numerical Optimization: Computing Derivatives Ju Sun Computer Science &

Numerical Optimization - a brief review - What is optimization, and why should we care about it?

Projective Arithmetic Functional Encryption and Indistinguishability Obfuscation (iO) from

Counting points on projective hypersurfaces David Harvey New York University 19th October 2010

Finite Projective Planes http://math.uwyo.edu/moorhouse/pub/planes/ Eric Moorhouse Mutually

4. How to Submit a Proposal Using EasyGrants - Q & A 2 Assess Conservation Needs and

Recall: Indexing into Cube Map Compute R = 2( N V ) N V Object at origin V Use

A Survey of Parallelism in Solving Numerical Optimization and Operations Research Problems

Projections and Viewing Transformations Graphics & Visualization: Principles & Algorithms

Wieners Conjecture About Transformations . . . Transformation Groups Examples of . . .

Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department - PowerPoint PPT Presentation

Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 1 / 74 Outline Numerical

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Numerical Differentiation &amp; Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation &amp; Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

JUST THE MATHS SLIDES NUMBER 17.7 NUMERICAL MATHEMATICS 7 (Numerical solution) of

JUST THE MATHS SLIDES NUMBER 17.8 NUMERICAL MATHEMATICS 8 (Numerical solution) of

4. Numerical Quadrature Where analytical abilities end . . . 4. Numerical Quadrature Numerical

Numerical Differentiation &amp; Integration Elements of Numerical Integration I Numerical

Numerical Recipes for Multiprecision Computations Henri Cohen May 13, 2014 IMB, Universit e

Numerical Differentiation &amp; Integration Numerical Differentiation II Numerical Analysis (9th

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Numerical Issues and Influences in the Design of Algebraic Modeling Languages for Optimization

Basics of Numerical Optimization: Computing Derivatives Ju Sun Computer Science &amp;

Numerical Optimization - a brief review - What is optimization, and why should we care about it?

Projective Arithmetic Functional Encryption and Indistinguishability Obfuscation (iO) from

Counting points on projective hypersurfaces David Harvey New York University 19th October 2010

Finite Projective Planes http://math.uwyo.edu/moorhouse/pub/planes/ Eric Moorhouse Mutually

4. How to Submit a Proposal Using EasyGrants - Q &amp; A 2 Assess Conservation Needs and

Recall: Indexing into Cube Map Compute R = 2( N V ) N V Object at origin V Use

A Survey of Parallelism in Solving Numerical Optimization and Operations Research Problems

Projections and Viewing Transformations Graphics &amp; Visualization: Principles &amp; Algorithms

Wieners Conjecture About Transformations . . . Transformation Groups Examples of . . .

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Differentiation & Integration Elements of Numerical Integration I Numerical

Numerical Differentiation & Integration Numerical Differentiation II Numerical Analysis (9th

Basics of Numerical Optimization: Computing Derivatives Ju Sun Computer Science &

4. How to Submit a Proposal Using EasyGrants - Q & A 2 Assess Conservation Needs and

Projections and Viewing Transformations Graphics & Visualization: Principles & Algorithms