Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Learning objectives Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using adaptive learning rate sub-gradient Application to linear regression and classification 2
Optimization in ML Optimization in ML Inference and learning of a model often involves optimization: optimization is a huge field bold: the setting considered in this class discrete (combinatorial) vs continuous variables constrained vs unconstrained for continuous optimization in ML: convex vs non-convex looking for local vs global optima? analytic gradient ? analytic Hessian? stochastic vs batch smooth vs non-smooth 3
Gradient Gradient for a multivariate function J ( w , w ) 0 1 J partial derivatives instead of derivative = derivative when other vars. are fixed J ( w , w + ϵ )− J ( w , w ) ∂ ) ≜ J ( w , w lim 0 1 0 1 0 1 ϵ →0 w ∂ w ϵ 0 1 w 1 we can estimate this numerically if needed J (use small epsilon in the the formula above) gradient : vector of all partial derivatives ∂ ∂ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] T ∂ w ∂ w 1 D w w 0 1 4 . 1
Gradient descent Gradient descent an iterative algorithm for optimization w {0} starts from some update using gradient { t +1} { t } { t } ← − α ∇ J ( w ) w w steepest descent direction learning rate cost function converges to a local minima (for maximization : objective function ) ∂ ∂ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] T ∂ w ∂ w 1 D image: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html 4 . 2
Convex function Convex function R N a convex subset of intersects any line in at most one line segment convex not convex a convex function is a function for which the epigraph is a convex set epigraph: set of all points above the graph ′ ′ f ( λw + (1 − λ ) w ) ≤ λf ( w ) + (1 − λ ) f ( w ) 0 < λ < 1 w ′ w 4 . 3
Convex function Convex function Convex functions are easier to minimize: critical points are global minimum gradient descent can find it { t +1} { t } { t } ← − α ∇ J ( w ) w w convex non-convex: gradient descent may find a local optima J ( w ) w w a concave function is a negative of a convex function (easy to maximize ) image: https://www.willamette.edu/~gorr/classes/cs449/momrate.html 4 . 4
Recognizing convex functions Recognizing convex functions T a linear function is convex w x d 2 f ≥ 0 convex if second derivative is positive everywhere x 2 2 d example x , e , − log( x ), − x x sum of convex functions is convex 2 2 ∣∣ WX − Y ∣∣ + λ ∣∣ w ∣∣ example 2 2 x 4 maximum of convex functions is convex f ( y ) = max example y x ∈[1,5] note this is not convex in x composition of convex functions is generally not convex (− log( x )) 2 example g ( f ( x )) however, if f,g are convex, and g is non-decreasing is convex e f ( x ) example for convex f 4 . 5 Winter 2020 | Applied Machine Learning (COMP551)
Gradient Gradient for linear and logistic regression D × N N × 1 T y in both cases: ∇ J ( w ) = X ( ^ − y ) N × D D × 1 ^ = linear regression: y Xw 1 def gradient(X, y, w): 2 N,D = X.shape ^ = σ ( Xw ) logistic regression: y 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 return grad time complexity: O ( ND ) (two matrix multiplications) 2 3 O ( ND + D ) compared to the direct solution for linear regression: gradient descent can be much faster for large D 1 5 . 1
Gradient Descent Gradient Descent implementing gradient descent is easy! 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: Some termination conditions : 10 g = gradient(X, y, w) code on the previous page some max #iterations 11 w = w - lr*g 12 return w small gradient 13 a small change in the objective increasing error on validation set early stopping (one way to avoid overfitting) 5 . 2
Example: GD for Linear Regression Example: GD for Linear Regression applying this to to fit toy data 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 1 def gradient(X, y, w): 13 2 N,D = X.shape 3 yh = np.dot(X, w) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 5 . 3
Example: Example: GD for Linear Regression GD for Linear Regression applying this to to fit toy data 1 #D = 1 2 N = 20 single feature (intercept is zero) 3 X = np.linspace(1,10, N)[:,None] 4 y_truth = np.dot(x, np.array([-3.])) 5 y = y_truth + 10*np.random.randn(N) ( n ) ( n ) ( x , −3 x + noise) using direct solution method −1 w = ( X X ) X y ≈ −3.2 T T y = wx y = −3 x 5 . 4
Example: GD for Linear Regression Example: GD for Linear Regression { t +1} { t } { t } After 22 iterations of Gradient Descent ← − .01∇ J ( w ) w w data space cost function {0} = 0 w 0 y = w x J ( w ) {22} ≈ −3.2 w w 5 . 5
Learning rate Learning rate α Learning rate has a significant effect on GD too small: may take a long time to converge too large: it overshoots α = .01 α = .05 J ( w ) w 5 . 6
GD for logistic Regression GD for logistic Regression example: logistic regression for Iris dataset (D=2, lr=.01) 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 13 1 def gradient(X, y, w): 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) 4 return grad 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)
Stochastic Stochastic Gradient Descent Gradient Descent we can write the cost function as a average over instances 1 ∑ n =1 N J ( w ) = ( w ) J n N cost for a single data-point 1 e.g. for linear regression ( n ) ( n ) 2 ( w ) = ( w x − ) T J y n 2 the same is true for the partial derivatives 1 ∑ n =1 ∂ ∂ N J ( w ) = ( w ) J n ∂ w ∂ w N j j ∇ J ( w ) = E [∇ J ( w )] therefore n 6 . 1
Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n contour plot of the cost function + batch gradient update w ← w − α ∇ J ( w ) with small learning rate: guaranteed improvement at each step w 1 w 0 image:https://jaykanidan.wordpress.com 6 . 2
Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n the steps are "on average" in the right direction each step is using gradient of a different cost J ( w ) n each update is (1/N) of the cost of batch gradient w 1 e.g., for linear regression O ( D ) ( n ) ( n ) ( n ) ∇ J ( w ) = ( w x − ) T x y n w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3
Example: Example: SGD for logistic regression SGD for logistic regression setting 1: using batch gradient logistic regression for Iris dataset (D=2 , ) α = .1 after 8000 iterations 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 13 { t } = (0, 0) w 1 def gradient(X, y, w): 2 N, D = X.shape 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 6 . 4
Example: Example: SGD for logistic regression SGD for logistic regression setting 2: using stochastic gradient logistic regression for Iris dataset (D=2, ) α = .1 1 def Stochastic GradientDescent( 2 X, # N x D 3 y, # N 4 lr=.01, # learning rate 5 eps=1e-2, # termination codition 6 ): 7 N,D = X.shape 8 w = np.zeros(D) 9 g = np.inf 10 while np.linalg.norm(g) > eps: 11 n = np.random.randint(N) 12 g = gradient(X[[n],:], y[[n]], w) 13 w = w - lr*g t =0 = (0, 0) 14 return w w 15 1 def gradient(X, y, w): 2 N, D = X.shape 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 6 . 5
Convergence of SGD Convergence of SGD stochastic gradients are not zero at optimum how to guarantee convergence? schedule to have a smaller learning rate over time Robbins Monro ∞ { t } = ∞ the sequence we use should satisfy: ∑ t =0 α otherwise for large we can't reach the minimum {0} ∗ ∣∣ w − w ∣∣ ∞ the steps should go to zero { t } 2 ∑ t =0 ( α ) < ∞ { t } 10 { t } t −.51 = , α = α example t 6 . 6
Recommend
More recommend