applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Gradient Descent - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives Basic idea of gradient descent


  1. Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

  2. Learning objectives Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using adaptive learning rate sub-gradient Application to linear regression and classification 2

  3. Optimization in ML Optimization in ML Inference and learning of a model often involves optimization: optimization is a huge field bold: the setting considered in this class discrete (combinatorial) vs continuous variables constrained vs unconstrained for continuous optimization in ML: convex vs non-convex looking for local vs global optima? analytic gradient ? analytic Hessian? stochastic vs batch smooth vs non-smooth 3

  4. Gradient Gradient for a multivariate function J ( w , w ) 0 1 J partial derivatives instead of derivative = derivative when other vars. are fixed J ( w , w + ϵ )− J ( w , w ) ∂ ) ≜ J ( w , w lim 0 1 0 1 0 1 ϵ →0 w ∂ w ϵ 0 1 w 1 we can estimate this numerically if needed J (use small epsilon in the the formula above) gradient : vector of all partial derivatives ∂ ∂ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] T ∂ w ∂ w 1 D w w 0 1 4 . 1

  5. Gradient descent Gradient descent an iterative algorithm for optimization w {0} starts from some update using gradient { t +1} { t } { t } ← − α ∇ J ( w ) w w steepest descent direction learning rate cost function converges to a local minima (for maximization : objective function ) ∂ ∂ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] T ∂ w ∂ w 1 D image: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html 4 . 2

  6. Convex function Convex function R N a convex subset of intersects any line in at most one line segment convex not convex a convex function is a function for which the epigraph is a convex set epigraph: set of all points above the graph ′ ′ f ( λw + (1 − λ ) w ) ≤ λf ( w ) + (1 − λ ) f ( w ) 0 < λ < 1 w ′ w 4 . 3

  7. Convex function Convex function Convex functions are easier to minimize: critical points are global minimum gradient descent can find it { t +1} { t } { t } ← − α ∇ J ( w ) w w convex non-convex: gradient descent may find a local optima J ( w ) w w a concave function is a negative of a convex function (easy to maximize ) image: https://www.willamette.edu/~gorr/classes/cs449/momrate.html 4 . 4

  8. Recognizing convex functions Recognizing convex functions T a linear function is convex w x d 2 f ≥ 0 convex if second derivative is positive everywhere x 2 2 d example x , e , − log( x ), − x x sum of convex functions is convex 2 2 ∣∣ WX − Y ∣∣ + λ ∣∣ w ∣∣ example 2 2 x 4 maximum of convex functions is convex f ( y ) = max example y x ∈[1,5] note this is not convex in x composition of convex functions is generally not convex (− log( x )) 2 example g ( f ( x )) however, if f,g are convex, and g is non-decreasing is convex e f ( x ) example for convex f 4 . 5 Winter 2020 | Applied Machine Learning (COMP551)

  9. Gradient Gradient for linear and logistic regression D × N N × 1 T y in both cases: ∇ J ( w ) = X ( ^ − y ) N × D D × 1 ^ = linear regression: y Xw 1 def gradient(X, y, w): 2 N,D = X.shape ^ = σ ( Xw ) logistic regression: y 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 return grad time complexity: O ( ND ) (two matrix multiplications) 2 3 O ( ND + D ) compared to the direct solution for linear regression: gradient descent can be much faster for large D 1 5 . 1

  10. Gradient Descent Gradient Descent implementing gradient descent is easy! 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: Some termination conditions : 10 g = gradient(X, y, w) code on the previous page some max #iterations 11 w = w - lr*g 12 return w small gradient 13 a small change in the objective increasing error on validation set early stopping (one way to avoid overfitting) 5 . 2

  11. Example: GD for Linear Regression Example: GD for Linear Regression applying this to to fit toy data 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 1 def gradient(X, y, w): 13 2 N,D = X.shape 3 yh = np.dot(X, w) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 5 . 3

  12. Example: Example: GD for Linear Regression GD for Linear Regression applying this to to fit toy data 1 #D = 1 2 N = 20 single feature (intercept is zero) 3 X = np.linspace(1,10, N)[:,None] 4 y_truth = np.dot(x, np.array([-3.])) 5 y = y_truth + 10*np.random.randn(N) ( n ) ( n ) ( x , −3 x + noise) using direct solution method −1 w = ( X X ) X y ≈ −3.2 T T y = wx y = −3 x 5 . 4

  13. Example: GD for Linear Regression Example: GD for Linear Regression { t +1} { t } { t } After 22 iterations of Gradient Descent ← − .01∇ J ( w ) w w data space cost function {0} = 0 w 0 y = w x J ( w ) {22} ≈ −3.2 w w 5 . 5

  14. Learning rate Learning rate α Learning rate has a significant effect on GD too small: may take a long time to converge too large: it overshoots α = .01 α = .05 J ( w ) w 5 . 6

  15. GD for logistic Regression GD for logistic Regression example: logistic regression for Iris dataset (D=2, lr=.01) 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 13 1 def gradient(X, y, w): 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) 4 return grad 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)

  16. Stochastic Stochastic Gradient Descent Gradient Descent we can write the cost function as a average over instances 1 ∑ n =1 N J ( w ) = ( w ) J n N cost for a single data-point 1 e.g. for linear regression ( n ) ( n ) 2 ( w ) = ( w x − ) T J y n 2 the same is true for the partial derivatives 1 ∑ n =1 ∂ ∂ N J ( w ) = ( w ) J n ∂ w ∂ w N j j ∇ J ( w ) = E [∇ J ( w )] therefore n 6 . 1

  17. Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n contour plot of the cost function + batch gradient update w ← w − α ∇ J ( w ) with small learning rate: guaranteed improvement at each step w 1 w 0 image:https://jaykanidan.wordpress.com 6 . 2

  18. Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n the steps are "on average" in the right direction each step is using gradient of a different cost J ( w ) n each update is (1/N) of the cost of batch gradient w 1 e.g., for linear regression O ( D ) ( n ) ( n ) ( n ) ∇ J ( w ) = ( w x − ) T x y n w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3

  19. Example: Example: SGD for logistic regression SGD for logistic regression setting 1: using batch gradient logistic regression for Iris dataset (D=2 , ) α = .1 after 8000 iterations 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 13 { t } = (0, 0) w 1 def gradient(X, y, w): 2 N, D = X.shape 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 6 . 4

  20. Example: Example: SGD for logistic regression SGD for logistic regression setting 2: using stochastic gradient logistic regression for Iris dataset (D=2, ) α = .1 1 def Stochastic GradientDescent( 2 X, # N x D 3 y, # N 4 lr=.01, # learning rate 5 eps=1e-2, # termination codition 6 ): 7 N,D = X.shape 8 w = np.zeros(D) 9 g = np.inf 10 while np.linalg.norm(g) > eps: 11 n = np.random.randint(N) 12 g = gradient(X[[n],:], y[[n]], w) 13 w = w - lr*g t =0 = (0, 0) 14 return w w 15 1 def gradient(X, y, w): 2 N, D = X.shape 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 6 . 5

  21. Convergence of SGD Convergence of SGD stochastic gradients are not zero at optimum how to guarantee convergence? schedule to have a smaller learning rate over time Robbins Monro ∞ { t } = ∞ the sequence we use should satisfy: ∑ t =0 α otherwise for large we can't reach the minimum {0} ∗ ∣∣ w − w ∣∣ ∞ the steps should go to zero { t } 2 ∑ t =0 ( α ) < ∞ { t } 10 { t } t −.51 = , α = α example t 6 . 6

Recommend


More recommend