machine learning mt 2016 6 optimisation
play

Machine Learning - MT 2016 6. Optimisation Varun Kanade University - PowerPoint PPT Presentation

Machine Learning - MT 2016 6. Optimisation Varun Kanade University of Oxford October 26, 2016 Outline Most machine learning methods can (ultimately) be cast as optimization problems. Linear Programming Basics: Gradients, Hessians


  1. Machine Learning - MT 2016 6. Optimisation Varun Kanade University of Oxford October 26, 2016

  2. Outline Most machine learning methods can (ultimately) be cast as optimization problems. ◮ Linear Programming ◮ Basics: Gradients, Hessians ◮ Gradient Descent ◮ Stochastic Gradient Descent ◮ Constrained Optimization Most machine learning packages such as scikit-learn, tensorflow, octave, torch etc. , will have optimization methods implemented. But you will have to understand the basics of optimization to use them effectively. 1

  3. Linear Programming Looking for solutions x ∈ R n to the following optimization problem c T x minimize subject to: a T i x ≤ b i , i = 1 , . . . , m a T i x = ¯ ¯ b i , i = 1 , . . . , l ◮ No analytic solution ◮ ‘‘Efficient’’ algorithms exist 2

  4. Linear Model with Absolute Loss Suppose we have data � ( x i , y i ) � N i =1 and that we want to minimise the objective: N � | x T L ( w ) = i w − y i | i =1 Let us introduce ζ i one for each datapoint Consider the linear program in the D + N variables w 1 , . . . , w D , ζ 1 , . . . , ζ N N � minimize ζ i i =1 subject to: w T x i − y i ≤ ζ i , i = 1 , . . . , N y i − w T x i ≤ ζ i , i = 1 , . . . , N 3

  5. Minimising the Lasso Objective For the Lasso objective, i.e., linear model with ℓ 1 -regularisation, we have N D � ( w T x i − y i ) 2 + λ � L lasso ( w ) = | w i | i =1 i =1 ◮ Quadratic part of the loss function can’t be framed as linear programming ◮ Lasso regularization does not allow for closed form solutions ◮ Must resort to general optimisation methods 4

  6. Calculus Background: Gradients z = f ( w 1 , w 2 ) = w 2 a 2 + w 2 1 2 b 2 ∂w 1 = 2 w 1 ∂f a 2 ∂w 2 = 2 w 2 ∂f b 2     2 w 1 ∂f a 2 ∂w 1  = ∇ w f =    ∂f 2 w 2 ∂w 2 b 2 ◮ Gradient vectors are orthogonal to contour curves ◮ Gradient points in the direction of steepest increase 5

  7. Calculus Background: Hessians z = f ( w 1 , w 2 ) = w 2 a 2 + w 2 1 2 b 2     2 w 1 ∂f a 2 ∂w 1  = ∇ w f =    ∂f 2 w 2 ∂w 2 b 2     ∂ 2 f ∂ 2 f 2 0 ∂w 2 ∂w 1 ∂w 2 a 2  = H = 1    ∂ 2 f ∂ 2 f 2 0 ∂w 2 ∂w 2 ∂w 1 b 2 2 ◮ As long as all second derivates exist, the Hessian H is symmetric ◮ Hessian captures the curvature of the surface 6

  8. Calculus Background: Chain Rule z = f ( w 1 ( θ 1 , θ 2 ) , w 2 ( θ 1 , θ 2 )) w 1 θ 1 f z w 2 θ 2 ∂f ∂w 1 · ∂w 1 ∂f ∂w 2 · ∂w 2 ∂f ∂θ 1 = ∂θ 1 + ∂θ 1 We will use this a lot when we study neural networks and back propagation 7

  9. General Form for Gradient and Hessian Suppose w ∈ R D and f : R D → R The gradient vector contains all first order partial derivatives   ∂f ∂w 1  ∂f    ∂w 2   ∇ w f ( w ) = .   .   .   ∂f ∂w D Hessian matrix of f contains all second order partial derivatives.   ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂w 2 ∂w 1 ∂w 2 ∂w 1 ∂w D   1  ∂ 2 f ∂ 2 f ∂ 2 f  · · ·   ∂w 2 ∂w 1 ∂w 2 ∂w 2 ∂w D   H = 2 . . .  ...  . . .   . . .    ∂ 2 f ∂ 2 f ∂ 2 f  · · · ∂w 2 ∂w D ∂w 1 ∂w D ∂w 2 D 8

  10. Gradient Descent Algorithm Gradient descent is one of the simplest, but very general algorithm for optimization It is an iterative algorithm, producing a new w t +1 at each iteration as w t +1 = w t − η t g t = w t − η t ∇ f ( w t ) We will denote the gradients by g t η t > 0 is the learning rate or step size 9

  11. Gradient Descent for Least Squares Regression N � L ( w ) = ( Xw − y ) T ( Xw − y ) = ( x T i w − y i ) 2 i =1 We can compute the gradient of L with respect to w � � X T Xw − X T y ∇ w L = 2 ◮ Why would you want to use gradient descent instead of directly plugging in the formula? ◮ If N and D are both very large � � ◮ Computational complexity of matrix formula O min { N 2 D, ND 2 } ◮ Each gradient calculation O ( ND ) 10

  12. Choosing a Step Size ◮ Choosing a good step-size is important ◮ It step size is too large, algorithm may never converge ◮ If step size is too small, convergence may be very slow ◮ May want a time-varying step size 11

  13. Newton’s Method (Second Order Method) ◮ Newton’s method uses second ◮ Gradient descent uses only the derivatives first derivative ◮ Degree 2 Taylor approximation ◮ Local linear approximation around current point 12

  14. Newton’s Method in High Dimensions The updates depend on the gradient g t and the Hessian H t at point w t w t +1 = w t − H − 1 t g t Approximate f around w t using second order Taylor approximation t ( w − w t ) + 1 f quad ( w ) = f ( w t ) + g T 2( w − w t ) T H t ( w − w t ) We move directly to the (unique) stationary point of f quad The gradient of f quad is given by: ∇ w f quad = g t + H t ( w − w t ) Setting ∇ w f quad = 0 , to get w t +1 , we have w t +1 = w t − H − 1 t g t 13

  15. Newton’s Method gives Stationary Points H has positive eigenvalues H has negative eigenvalues H has mixed eigenvalues Hessian will tell you which kind of stationary point is found Newton’s method can be computationally expensive in high dimensions. Need to compute and invert a Hessian at each iteration 14

  16. Minimising the Lasso Objective For the Lasso objective, i.e., linear model with ℓ 1 -regularisation, we have N D � � ( w T x i − y i ) 2 + λ L lasso ( w ) = | w i | i =1 i =1 ◮ Quadratic part of the loss function can’t be framed as linear programming ◮ Lasso regularization does not allow for closed form solutions ◮ Must resort to general optimisation methods ◮ We still have the problem that the objective function is not differentiable! 15

  17. Sub-gradient Descent Focus on the case when f is convex, f ( αx + (1 − α ) y ) ≤ αf ( x ) + (1 − α ) f ( y ) for all x, y , α ∈ [0 , 1] f ( x ) ≥ f ( x 0 ) + g ( x − x 0 ) where g is a sub-derivative f ( x ) ≥ f ( x 0 ) + g T ( x − x 0 ) where g is a sub-gradient Any g satisfying the above inequality will be called a sub-gradient at x 0 16

  18. Sub-gradient Descent f ( w ) = | w 1 | + | w 2 | + | w 3 | + | w 4 | for w ∈ R 4 What is a sub-gradient at the point w = [2 , − 3 , 0 , 1] T ? 2   1   − 1   1 . 5 g = ∇ w f =   γ   1 1 for any γ ∈ [ − 1 , 1] 0 . 5 0 − 2 − 1 0 1 2 The sub-derivative of f ( x ) = max( x, 0) at x = 0 is [0 , 1] . 17

  19. Optimization Algorithms for Machine Learning We have data D = � ( x i , y i ) � N i =1 . We are minimizing the objective function: N L ( w ; D ) = 1 � ℓ ( w ; x i , y i ) + λ R ( w ) N � �� � i =1 Regularisation Term The gradient of the objective function is N ∇ w L = 1 � ∇ w ℓ ( w ; x i , y i ) + λ ∇ w R ( w ) N i =1 For Ridge Regression we have N L ridge ( w ) = 1 � ( w T x i − y i ) 2 + λ w T w N i =1 N ∇ w L ridge = 1 � 2( w T x i − y i ) x i + 2 λ w N i =1 18

  20. Stochastic Gradient Descent As part of the learning algorithm, we calculate the following gradient: N ∇ w L = 1 � ∇ w ℓ ( w ; x i , y i ) + R ( w ) N i =1 Suppose we pick a random datapoint ( x i , y i ) and evaluate g i = ∇ w ℓ ( w ; x i , y i ) What is E [ g i ] ? N E [ g i ] = 1 � ∇ w ℓ ( w ; x i , y i ) N i =1 Instead of computing the entire gradient, we can compute the gradient at just a single datapoint! In expectation g i points in the same direction as the entire gradient (except for the regularisation term) 19

  21. Online Learning: Stochastic Gradient Descent ◮ Using stochastic gradient descent it is possible to learn ‘‘online’’, i.e., we get data little at a time ◮ Cost of computing the gradient in ‘Stochastic Gradient Descent (SGD)’ is significantly less compared to the gradient on the full dataset ◮ Learning rates should be chosen by (cross-)validation 20

  22. Batch/Offline Learning N w t +1 = w t − η � ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) N i =1 Online Learning w t +1 = w t − η ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) Minibatch Online Learning b w t +1 = w t − η � ∇ w ℓ ( w ; x i , y i ) − λ ∇ w R ( w ) b i =1 21

  23. Many Optimisation Techniques (Tricks) First Order Methods/(Sub) Gradient Methods ◮ Nesterov’s Accelerated Gradient ◮ Line-Search to Find Step-Size ◮ Momentum-based Methods ◮ AdaGrad, AdaDelta, Adam, RMSProp Second Order/Newton/Quasinewton Methods ◮ Conjugate Gradient Method ◮ BGFS and L-BGFS 22

  24. Adagrad: Example Application for Text Data y x 1 x 2 x 3 x 4 Heathrow: Will Boris Johnson lie down in front of 1 1 0 0 1 the bulldozers? He was happy to lie down the side of -1 1 1 0 0 a bus. -1 1 1 1 0 . . . 1 1 1 0 0 On his part, Johnson has already sought to clarify the 1 1 0 0 0 comments, telling Sky News that what he in fact said -1 1 1 1 0 was not that he would lie down in front of the 1 1 1 0 0 bulldozers, but that he would lie down the side . And he 1 1 1 0 1 never actually said bulldozers , he said bus . 1 1 1 0 0 Adagrad Update η w t +1 ,i ← w t,i − g t,i �� t s =1 g 2 s,i Rare features (which are 0 in most datapoints) can be most predictive 23

Recommend


More recommend