applied machine learning
play

Applied Machine Learning Gradient Descent Methods Siamak - PowerPoint PPT Presentation

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient


  1. Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

  2. Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient Application to linear regression and classification

  3. Optimization in ML inference and learning of a model often involves optimization: optimization is a huge field bold: the setting considered in this class discrete (combinatorial) vs continuous variables constrained vs unconstrained for continuous optimization in ML: convex vs non-convex looking for local vs global optima? analytic gradient ? analytic Hessian? stochastic vs batch smooth vs non-smooth

  4. Gradient Recall for a multivariate function J ( w , w ) 0 1 J partial derivatives instead of derivative = derivative when other vars. are fixed J ( w , w + ϵ )− J ( w , w ) ∂ J ( w , w ) ≜ lim ϵ →0 0 1 0 1 0 1 w 0 ∂ w 1 ϵ w 1 we can estimate this numerically if needed J (use small epsilon in the the formula above) gradient : vector of all partial derivatives ∂ ∂ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] T ∂ w 1 ∂ w D w 0 w 1

  5. Gradient descent an iterative algorithm for optimization w {0} starts from some new notation! update using gradient { t +1} { t } { t } ← − α ∇ J ( w ) w w steepest descent direction learning rate cost function converges to a local minima (for maximization : objective function ) ∂ ∂ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] T ∂ w 1 ∂ w D image: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

  6. Convex function R N a convex subset of intersects any line in at most one line segment convex not convex a convex function is a function for which the epigraph is a convex set epigraph: set of all points above the graph ′ ′ f ( λw + (1 − λ ) w ) ≤ λf ( w ) + (1 − λ ) f ( w ) 0 < λ < 1 w ′ w

  7. Minimum of a convex function Convex functions are easier to minimize: critical points are global minimum gradient descent can find it { t +1} { t } { t } ← − α ∇ J ( w ) w w convex non-convex: gradient descent may find a local optima J ( w ) w w a concave function is a negative of a convex function (easy to maximize ) image: https://www.willamette.edu/~gorr/classes/cs449/momrate.html

  8. Recognizing convex functions a constant function is convex f ( x ) = c a linear function is convex f ( x ) = w x ⊤ d 2 convex if second derivative is positive everywhere f ≥ 0 ∀ x x 2 examples 2 d , e , − log( x ), − x x x

  9. Recognizing convex functions sum of convex functions is convex example example sum of squared errors 2 ⊤ ( n ) y ) 2 J ( w ) = ∣∣ Xw − y ∣∣ = ( w x − ∑ n 2

  10. Recognizing convex functions maximum of convex functions is convex example 3 4 9 y 4 f ( y ) = max x y = example x ∈[0,3] note this is not convex in x

  11. Recognizing convex functions composition of convex functions is generally not convex (− log( x )) 2 example however, if are convex, f , g and is non-decreasing is convex g ( f ( x )) g e f ( x ) example for convex f

  12. Recognizing convex functions is the logistic regression cost function convex in model parameters (w)? linear non-negative ⊤ ⊤ N ( n ) − w x ( n ) J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e same argument checking second derivative ∂ 2 e − z log(1 + e ) = z ≥ 0 ∂ z 2 (1+ e − z 2 ) sum of convex functions COMP 551 | Fall 2020

  13. Gradient recall for linear and logistic regression ⊤ y in both cases: ∇ J ( w ) = ∑ n x ( − ^ y ) = X ( − ^ y ) y ⊤ ^ = linear regression: y w x 1 def gradient(x, y, w): ⊤ ^ = σ ( w x ) 2 N,D = x.shape y logistic regression: 3 yh = logistic(np.dot(x, w)) 4 grad = np.dot(x.T, yh - y) / N 5 return grad time complexity: O ( ND ) (two matrix multiplications) 2 3 O ( ND + D ) compared to the direct solution for linear regression: gradient descent can be much faster for large D 1

  14. Gradient Descent implementing gradient descent is easy! 1 def GradientDescent(x, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = x.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: Some termination condition : 10 g = gradient(x, y, w) code on the previous page some max #iterations 11 w = w - lr*g 12 return w small gradient 13 a small change in the objective increasing error on validation set early stopping (one way to avoid overfitting)

  15. GD for linear regression example ( n ) ( n ) ( x , −3 x + noise) using direct solution method −1 w = ( X X ) X y ≈ −3.2 T T y = wx y = −3 x

  16. GD for linear regression example After 22 steps { t +1} { t } { t } ← − .01∇ J ( w ) w w data space cost function {0} = 0 w 0 y = w x J ( w ) {22} ≈ −3.2 w w

  17. Learning rate α Learning rate has a significant effect on GD too small: may take a long time to converge too large: it overshoots α = .01 α = .05 J ( w ) w

  18. Learning rate α Learning rate has a significant effect on GD too small: may take a long time to converge too large: it overshoots example linear regression, D=2, 50 gradient steps COMP 551 | Fall 2020

  19. Stochastic Gradient Descent we can write the cost function as a average over instances 1 ∑ n =1 N J ( w ) = J ( w ) cost for a single data-point n N e.g. for linear regression 1 ( n ) ( n ) 2 J ( w ) = ( w x T − ) y n 2 1 ∑ n =1 ∂ N ∂ the same is true for the partial derivatives J ( w ) = J ( w ) n ∂ w j ∂ w j N ∇ J ( w ) = E [∇ J ( w )] therefore D n

  20. Stochastic Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n contour plot of the cost function + batch gradient update w ← w − α ∇ J ( w ) with small learning rate: guaranteed improvement at each step w 1 w 0 image:https://jaykanidan.wordpress.com

  21. Stochastic Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n the steps are "on average" in the right direction each step is using gradient of a different cost J ( w ) n each update is (1/N) of the cost of batch gradient w 1 e.g., for linear regression O ( D ) ( n ) ⊤ ( n ) ( n ) ∇ J ( w ) = ( w x − ) x y n w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com

  22. SGD for logistic regression example logistic regression for Iris dataset (D=2 , ) α = .1 batch gradient stochastic gradient

  23. Convergence of SGD stochastic gradients are not zero even at the optimum w how to guarantee convergence? idea: schedule to have a smaller learning rate over time Robbins Monro ∞ { t } = ∞ the sequence we use should satisfy: ∑ t =0 α otherwise for large we can't reach the minimum {0} ∗ ∣∣ w − w ∣∣ ∞ the steps should go to zero { t } 2 ( α ) < ∞ ∑ t =0 10 { t } { t } t −.51 = , α = example α t

  24. Minibatch SGD use a minibatch to produce gradient estimates ∇ J = ∇ J ( w ) ∑ n ∈ B B n B ⊆ {1, … , N } a subset of the dataset GD full batch SGD minibatch-size=16 SGD minibatch-size=1

  25. Oscillations gradient descent can oscillate a lot! each gradient step is prependicular to isocontours in SGD this is worsened due to noisy gradient estimate COMP 551 | Fall 2020

  26. Momentum to help with oscillations: use a running average of gradients more recent gradients should have higher weights { t } { t −1} { t } Δ w ← β Δ w + (1 − β )∇ J ( w ) B { t } { t −1} α Δ w { t } ← − w w weight for the most recent gradient momentum of 0 reduces to SGD (1 − β ) common value > .9 (1 − β ) β T −1 weight for the older gradient is effectively an exponential moving average { T } T T − t { t } Δ w = (1 − β )∇ J ( w ) ∑ t =1 β B there are other variations of momentum with similar idea t = 1 t = T

  27. Momentum Example: logistic regression no momentum with momentum α = .5, β = 0, ∣ B ∣ = 8 α = .5, β = .99, ∣ B ∣ = 8 see the beautiful demo at Distill https://distill.pub/2017/momentum/ COMP 551 | Fall 2020

  28. Adagrad (Adaptive gradient) optional w d use different learning rate for each parameter also make the learning rate adaptive { t } { t −1} ∂ { t −1} 2 ← + J ( w ) S S ∂ w d d d sum of squares of derivatives over all iterations so far (for individual parameter) { t } { t −1} ∂ { t −1} ← − α J ( w ) w w ∂ w d d d { t −1} + ϵ S d the learning rate is adapted to previous updates ϵ is to avoid numerical issues useful when parameters are updated at different rates (e.g., when some features are often zero when using SGD)

  29. Adagrad (Adaptive gradient) optional different learning rate for each parameter w d make the learning rate adaptive α = .1, ∣ B ∣ = 1, T = 80, 000 α = .1, ∣ B ∣ = 1, T = 80, 000, ϵ = 1 e − 8 Adagrad SGD problem: the learning rate goes to zero too quickly COMP 551 | Fall 2020

Recommend


More recommend