Example: GD for Linear Regression Example: GD for Linear Regression applying this to to fit toy data 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 1 def gradient(X, y, w): 13 2 N,D = X.shape 3 yh = np.dot(X, w) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 5 . 3
Example: Example: GD for Linear Regression GD for Linear Regression applying this to to fit toy data 1 #D = 1 2 N = 20 single feature (intercept is zero) 3 X = np.linspace(1,10, N)[:,None] 4 y_truth = np.dot(x, np.array([-3.])) 5 y = y_truth + 10*np.random.randn(N) 5 . 4
Example: GD for Linear Regression Example: GD for Linear Regression applying this to to fit toy data 1 #D = 1 2 N = 20 single feature (intercept is zero) 3 X = np.linspace(1,10, N)[:,None] 4 y_truth = np.dot(x, np.array([-3.])) 5 y = y_truth + 10*np.random.randn(N) ( n ) ( n ) ( x , −3 x + noise) 5 . 4
Example: Example: GD for Linear Regression GD for Linear Regression applying this to to fit toy data 1 #D = 1 2 N = 20 single feature (intercept is zero) 3 X = np.linspace(1,10, N)[:,None] 4 y_truth = np.dot(x, np.array([-3.])) 5 y = y_truth + 10*np.random.randn(N) ( n ) ( n ) ( x , −3 x + noise) y = wx y = −3 x 5 . 4
Example: Example: GD for Linear Regression GD for Linear Regression applying this to to fit toy data 1 #D = 1 2 N = 20 single feature (intercept is zero) 3 X = np.linspace(1,10, N)[:,None] 4 y_truth = np.dot(x, np.array([-3.])) 5 y = y_truth + 10*np.random.randn(N) ( n ) ( n ) ( x , −3 x + noise) using direct solution method −1 w = ( X X ) X y ≈ −3.2 T T y = wx y = −3 x 5 . 4
Example: Example: GD for Linear Regression GD for Linear Regression { t +1} { t } { t } After 22 iterations of Gradient Descent ← − .01∇ J ( w ) w w 5 . 5
Example: Example: GD for Linear Regression GD for Linear Regression { t +1} { t } { t } After 22 iterations of Gradient Descent ← − .01∇ J ( w ) w w cost function {0} = 0 w J ( w ) {22} ≈ −3.2 w w 5 . 5
Example: GD for Linear Regression Example: GD for Linear Regression { t +1} { t } { t } After 22 iterations of Gradient Descent ← − .01∇ J ( w ) w w data space cost function {0} = 0 w 0 y = w x J ( w ) {22} ≈ −3.2 w w 5 . 5
Learning rate Learning rate α Learning rate has a significant effect on GD too small: may take a long time to converge too large: it overshoots α = .01 α = .05 J ( w ) w 5 . 6
GD for logistic Regression GD for logistic Regression example: logistic regression for Iris dataset (D=2, lr=.01) 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 13 5 . 7
GD for logistic Regression GD for logistic Regression example: logistic regression for Iris dataset (D=2, lr=.01) 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 13 1 def gradient(X, y, w): 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) 4 return grad 5 . 7
GD for logistic Regression GD for logistic Regression example: logistic regression for Iris dataset (D=2, lr=.01) 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 13 1 def gradient(X, y, w): 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) 4 return grad 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)
Stochastic Gradient Descent Stochastic Gradient Descent we can write the cost function as a average over instances 1 ∑ n =1 N J ( w ) = ( w ) J n N cost for a single data-point 1 e.g. for linear regression ( n ) ( n ) 2 ( w ) = ( w x − ) T J y n 2 6 . 1
Stochastic Stochastic Gradient Descent Gradient Descent we can write the cost function as a average over instances 1 ∑ n =1 N J ( w ) = ( w ) J n N cost for a single data-point 1 e.g. for linear regression ( n ) ( n ) 2 ( w ) = ( w x − ) T J y n 2 the same is true for the partial derivatives 1 ∑ n =1 ∂ ∂ N J ( w ) = ( w ) J n ∂ w ∂ w N j j 6 . 1
Stochastic Stochastic Gradient Descent Gradient Descent we can write the cost function as a average over instances 1 ∑ n =1 N J ( w ) = ( w ) J n N cost for a single data-point 1 e.g. for linear regression ( n ) ( n ) 2 ( w ) = ( w x − ) T J y n 2 the same is true for the partial derivatives 1 ∑ n =1 ∂ ∂ N J ( w ) = ( w ) J n ∂ w ∂ w N j j ∇ J ( w ) = E [∇ J ( w )] therefore n 6 . 1
Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n 6 . 2
Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n contour plot of the cost function + batch gradient update w ← w − α ∇ J ( w ) with small learning rate: guaranteed improvement at each step w 1 w 0 image:https://jaykanidan.wordpress.com 6 . 2
Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n w 1 w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3
Stochastic Gradient Descent Stochastic Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n the steps are "on average" in the right direction w 1 w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3
Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n the steps are "on average" in the right direction each step is using gradient of a different cost J ( w ) n w 1 w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3
Stochastic Gradient Descent Stochastic Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n the steps are "on average" in the right direction each step is using gradient of a different cost J ( w ) n each update is (1/N) of the cost of batch gradient w 1 w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3
Stochastic Stochastic Gradient Descent Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n the steps are "on average" in the right direction each step is using gradient of a different cost J ( w ) n each update is (1/N) of the cost of batch gradient w 1 e.g., for linear regression O ( D ) ( n ) ( n ) ( n ) ∇ J ( w ) = ( w x − ) T x y n w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3
Example: Example: SGD for logistic regression SGD for logistic regression setting 1: using batch gradient logistic regression for Iris dataset (D=2 , ) α = .1 after 8000 iterations 1 def GradientDescent(X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 g = gradient(X, y, w) 11 w = w - lr*g 12 return w 13 { t } = (0, 0) w 1 def gradient(X, y, w): 2 N, D = X.shape 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 6 . 4
Example: Example: SGD for logistic regression SGD for logistic regression setting 2: using stochastic gradient logistic regression for Iris dataset (D=2, ) α = .1 1 def Stochastic GradientDescent( 2 X, # N x D 3 y, # N 4 lr=.01, # learning rate 5 eps=1e-2, # termination codition 6 ): 7 N,D = X.shape 8 w = np.zeros(D) 9 g = np.inf 10 while np.linalg.norm(g) > eps: 11 n = np.random.randint(N) 12 g = gradient(X[[n],:], y[[n]], w) 13 w = w - lr*g t =0 = (0, 0) 14 return w w 15 1 def gradient(X, y, w): 2 N, D = X.shape 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 return grad 6 . 5
Convergence of SGD Convergence of SGD stochastic gradients are not zero at optimum how to guarantee convergence? 6 . 6
Convergence of SGD Convergence of SGD stochastic gradients are not zero at optimum how to guarantee convergence? schedule to have a smaller learning rate over time 6 . 6
Convergence of SGD Convergence of SGD stochastic gradients are not zero at optimum how to guarantee convergence? schedule to have a smaller learning rate over time Robbins Monro ∞ { t } = ∞ the sequence we use should satisfy: ∑ t =0 α otherwise for large we can't reach the minimum {0} ∗ ∣∣ w − w ∣∣ ∞ the steps should go to zero { t } 2 ∑ t =0 ( α ) < ∞ 6 . 6
Convergence of SGD Convergence of SGD stochastic gradients are not zero at optimum how to guarantee convergence? schedule to have a smaller learning rate over time Robbins Monro ∞ { t } = ∞ the sequence we use should satisfy: ∑ t =0 α otherwise for large we can't reach the minimum {0} ∗ ∣∣ w − w ∣∣ ∞ the steps should go to zero { t } 2 ∑ t =0 ( α ) < ∞ { t } 10 { t } t −.51 = , α = α example t 6 . 6
Minibatch Minibatch SGD SGD use a minibatch to produce gradient estimates ∇ J = ∇ J ( w ) ∑ n ∈ B B n B ⊆ {1, … , N } a subset of the dataset 6 . 7
Minibatch Minibatch SGD SGD use a minibatch to produce gradient estimates 1 def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8): 2 N,D = X.shape ∇ J = ∇ J ( w ) ∑ n ∈ B 3 w = np.zeros(D) B 4 g = np.inf n 5 while np.linalg.norm(g) > eps: 6 minibatch = np.random.randint(N, size=(bsize)) B ⊆ {1, … , N } a subset of the dataset 7 g = gradient(X[minibatch,:], y[inibatch], w) 8 w = w - lr*g 9 return w 10 6 . 7
Minibatch Minibatch SGD SGD use a minibatch to produce gradient estimates 1 def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8): 2 N,D = X.shape ∇ J = ∇ J ( w ) ∑ n ∈ B 3 w = np.zeros(D) B 4 g = np.inf n 5 while np.linalg.norm(g) > eps: 6 minibatch = np.random.randint(N, size=(bsize)) B ⊆ {1, … , N } a subset of the dataset 7 g = gradient(X[minibatch,:], y[inibatch], w) 8 w = w - lr*g 9 return w 10 GD full batch 6 . 7
Minibatch Minibatch SGD SGD use a minibatch to produce gradient estimates 1 def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8): 2 N,D = X.shape ∇ J = ∇ J ( w ) ∑ n ∈ B 3 w = np.zeros(D) B 4 g = np.inf n 5 while np.linalg.norm(g) > eps: 6 minibatch = np.random.randint(N, size=(bsize)) B ⊆ {1, … , N } a subset of the dataset 7 g = gradient(X[minibatch,:], y[inibatch], w) 8 w = w - lr*g 9 return w 10 GD full batch SGD minibatch-size=1 6 . 7
Minibatch Minibatch SGD SGD use a minibatch to produce gradient estimates 1 def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8): 2 N,D = X.shape ∇ J = ∇ J ( w ) ∑ n ∈ B 3 w = np.zeros(D) B 4 g = np.inf n 5 while np.linalg.norm(g) > eps: 6 minibatch = np.random.randint(N, size=(bsize)) B ⊆ {1, … , N } a subset of the dataset 7 g = gradient(X[minibatch,:], y[inibatch], w) 8 w = w - lr*g 9 return w 10 GD full batch SGD minibatch-size=16 SGD minibatch-size=1 6 . 7 Winter 2020 | Applied Machine Learning (COMP551)
Momentum Momentum to help with oscillations of SGD (or even full-batch GD) : use a running average of gradients more recent gradients should have higher weights 7 . 1
Momentum Momentum to help with oscillations of SGD (or even full-batch GD) : use a running average of gradients more recent gradients should have higher weights { t } { t −1} { t } Δ w ← β Δ w + (1 − β )∇ J ( w ) B { t } { t −1} α Δ w { t } ← − w w 7 . 1
Momentum Momentum to help with oscillations of SGD (or even full-batch GD) : use a running average of gradients more recent gradients should have higher weights { t } { t −1} { t } Δ w ← β Δ w + (1 − β )∇ J ( w ) B { t } { t −1} α Δ w { t } ← − w w momentum of 0 reduces to SGD common value > .9 7 . 1
Momentum Momentum to help with oscillations of SGD (or even full-batch GD) : use a running average of gradients more recent gradients should have higher weights { t } { t −1} { t } Δ w ← β Δ w + (1 − β )∇ J ( w ) B { t } { t −1} α Δ w { t } ← − w w momentum of 0 reduces to SGD common value > .9 is effectively an exponential moving average { T } T T − t { t } Δ w = (1 − β )∇ J ( w ) ∑ t =1 β B 7 . 1
Momentum Momentum to help with oscillations of SGD (or even full-batch GD) : use a running average of gradients more recent gradients should have higher weights { t } { t −1} { t } Δ w ← β Δ w + (1 − β )∇ J ( w ) B { t } { t −1} α Δ w { t } ← − w w momentum of 0 reduces to SGD common value > .9 is effectively an exponential moving average { T } T T − t { t } Δ w = (1 − β )∇ J ( w ) ∑ t =1 β B there are other variations of momentum with similar idea 7 . 1
Momentum Momentum to help with oscillations of SGD (or even full-batch GD) : use a running average of gradients more recent gradients should have higher weights 1 def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8, beta=.99): 2 N,D = X.shape 3 w = np.zeros(D) 4 g = np.inf 5 dw = 0 6 while np.linalg.norm(g) > eps: 7 minibatch = np.random.randint(N, size=(bsize)) 8 g = gradient(X[minibatch,:], y[inibatch], w) 9 dw = (1-beta)*g + beta*dw 10 w = w - lr*dw 11 return w 12 7 . 2
Momentum Momentum Example: logistic regression no momentum α = .5, β = 0, ∣ B ∣ = 8 7 . 3
Momentum Momentum Example: logistic regression t −1 t −1 Δ w ← t β Δ w + (1 − β )∇ J ( w ) B no momentum t −1 w ← t − α Δ w t w α = .5, β = 0, ∣ B ∣ = 8 α = .5, β = .99, ∣ B ∣ = 8 7 . 3
Momentum Momentum Example: logistic regression t −1 t −1 Δ w ← t β Δ w + (1 − β )∇ J ( w ) B no momentum t −1 w ← t − α Δ w t w α = .5, β = 0, ∣ B ∣ = 8 α = .5, β = .99, ∣ B ∣ = 8 see the beautiful demo at Distill https://distill.pub/2017/momentum/ 7 . 3 Winter 2020 | Applied Machine Learning (COMP551)
Adagrad Adagrad (Ada Adaptive ptive grad gradient) ient) w use different learning rate for each parameter d also make the learning rate adaptive 8 . 1
Adagrad Adagrad (Ada Adaptive ptive grad gradient) ient) w use different learning rate for each parameter d also make the learning rate adaptive { t } { t −1} ∂ { t −1} 2 ← + J ( w ) S S ∂ w d d d sum of squares of derivatives over all iterations so far (for individual parameter) 8 . 1
Adagrad Adagrad (Ada Adaptive ptive grad gradient) ient) w use different learning rate for each parameter d also make the learning rate adaptive { t } { t −1} ∂ { t −1} 2 ← + J ( w ) S S ∂ w d d d sum of squares of derivatives over all iterations so far (for individual parameter) { t } { t −1} ∂ { t −1} ← − α J ( w ) w w ∂ w d d { t −1} + ϵ d S d the learning rate is adapted to previous updates ϵ is to avoid numerical issues 8 . 1
Adagrad Adagrad (Ada Adaptive ptive grad gradient) ient) w use different learning rate for each parameter d also make the learning rate adaptive { t } { t −1} ∂ { t −1} 2 ← + J ( w ) S S ∂ w d d d sum of squares of derivatives over all iterations so far (for individual parameter) { t } { t −1} ∂ { t −1} ← − α J ( w ) w w ∂ w d d { t −1} + ϵ d S d the learning rate is adapted to previous updates ϵ is to avoid numerical issues useful when parameters are updated at different rates ( e.g., NLP ) 8 . 1
Adagrad Adagrad (Ada Adaptive ptive grad gradient) ient) w different learning rate for each parameter d make the learning rate adaptive α = .1, ∣ B ∣ = 1, T = 80, 000 SGD 8 . 2
Adagrad Adagrad (Ada Adaptive ptive grad gradient) ient) w different learning rate for each parameter d make the learning rate adaptive α = .1, ∣ B ∣ = 1, T = 80, 000 α = .1, ∣ B ∣ = 1, T = 80, 000, ϵ = 1 e − 8 Adagrad SGD 8 . 2
Adagrad (Ada Adagrad Adaptive ptive grad gradient) ient) w different learning rate for each parameter d make the learning rate adaptive α = .1, ∣ B ∣ = 1, T = 80, 000 α = .1, ∣ B ∣ = 1, T = 80, 000, ϵ = 1 e − 8 Adagrad SGD problem: the learning rate goes to zero too quickly 8 . 2 Winter 2020 | Applied Machine Learning (COMP551)
RMSprop RMSprop (Root Mean Squared propagation) solve the problem of diminishing step-size with Adagrad use exponential moving average instead of sum (similar to momentum) { t } { t −1} { t −1} 2 ← + (1 − γ )∇ J ( w ) S γS { t −1} identical to Adagrad { t } { t −1} ← − α ∇ J ( w ) w w d { t −1} + ϵ S 1 def RMSprop(X, y, lr=.01, eps=1e-2, bsize=8, gamma=.9, epsilon=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 g = np.inf 5 S = 0 6 while np.linalg.norm(g) > eps: 7 minibatch = np.random.randint(N, size=(bsize)) 8 g = gradient(X[minibatch,:], y[inibatch], w) 9 S = (1-gamma)*g**2 + gamma*S 10 w = w - lr*g/np.sqrt(S + epsilon) 11 return w 12 9 . 1
Adam (Ada Adam Adaptive ptive M Moment Estimation) oment Estimation) two ideas so far: 1. use momentum to smooth out the oscillations both use exponential moving averages 2. adaptive per-parameter learning rate Adam combines the two : identical to method of momentum { t } { t −1} { t −1} ← + (1 − β )∇ J ( w ) M β M (moving average of the first moment) 1 1 { t −1} 2 identical to RMSProp { t } { t −1} ← + (1 − β )∇ J ( w ) S β S 2 2 (moving average of the second moment) ^ { t } { t −1} { t } { t −1} ← w − αM ∇ J ( w ) w d ^ { t } + ϵ S 9 . 2
Adam Adam (Adaptive Moment Estimation) (Adaptive Moment Estimation) Adam combines thee two: identical to method of momentum { t } { t −1} { t −1} ← + (1 − β )∇ J ( w ) M β M (moving average of the first moment) 1 1 { t −1} 2 identical to RMSProp { t } { t −1} ← + (1 − β )∇ J ( w ) S β S 2 2 (moving average of the second moment) ^ { t } { t −1} { t } { t −1} ← w − αM ∇ J ( w ) w d ^ { t } + ϵ S since M and S are initialized to be zero, at early stages they are biased towards zero ^ { t } M { t } ^ { t } S { t } for large time-steps it has no effect ← ← M S 1− β 1− β t t for small t, it scales up numerator 1 2 9 . 3 Winter 2020 | Applied Machine Learning (COMP551)
In practice In practice the list of methods is growing ... they have recommended range of parameters learning rate, momentum etc. still may need some hyper-parameter tuning logistic regression example these are all first order methods they only need the first derivative 2nd order methods can be much more effective, but also much more expensive image:Alec Radford 10
Adding regularization Adding regularization L 2 do not penalize the bias w 0 11 . 1
Adding regularization Adding regularization L 2 do not penalize the bias w 0 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 grad[1:] += lambdaa * w[1:] weight decay 6 return grad 11 . 1
Adding regularization Adding regularization L 2 do not penalize the bias w 0 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape L2 penalty makes the optimization easier too! 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 grad[1:] += lambdaa * w[1:] weight decay 6 return grad 11 . 1
Adding regularization Adding regularization L 2 do not penalize the bias w 0 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape L2 penalty makes the optimization easier too! 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 grad[1:] += lambdaa * w[1:] weight decay 6 return grad λ = 0 w 0 w 1 11 . 1
Adding regularization Adding regularization L 2 do not penalize the bias w 0 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape L2 penalty makes the optimization easier too! 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 grad[1:] += lambdaa * w[1:] weight decay 6 return grad λ = 0 λ = .01 w 0 w 1 11 . 1
Adding regularization Adding regularization L 2 do not penalize the bias w 0 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape L2 penalty makes the optimization easier too! 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N 5 grad[1:] += lambdaa * w[1:] weight decay 6 return grad λ = 0 λ = .01 λ = .1 w 0 w 1 11 . 1
Adding regularization Adding regularization L 2 do not penalize the bias w 0 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape L2 penalty makes the optimization easier too! 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N w note that the optimal shrinks 5 grad[1:] += lambdaa * w[1:] weight decay 1 6 return grad λ = 0 λ = .01 λ = .1 w 0 w 1 11 . 1
Subgderivatives Subgderivatives L1 penalty is no longer smooth or differentiable (at 0) extend the notion of derivative to non-smooth functions 11 . 2
Subgderivatives Subgderivatives L1 penalty is no longer smooth or differentiable (at 0) extend the notion of derivative to non-smooth functions sub-differential is the set of all sub-derivatives at a point ^ ] [ f ( w )− f ( ) ^ f ( w )− f ( ) ∂ f ( ) = ^ w w lim , lim w w → w ^ − w → w ^ + w − w ^ w − w ^ 11 . 2
Subgderivatives Subgderivatives L1 penalty is no longer smooth or differentiable (at 0) extend the notion of derivative to non-smooth functions sub-differential is the set of all sub-derivatives at a point ^ ] [ f ( w )− f ( ) ^ f ( w )− f ( ) ∂ f ( ) = ^ w w lim , lim w w → w ^ − w → w ^ + w − w ^ w − w ^ ^ w d if f is differentiable at then sub-differential has one member f ( ) ^ w dw 11 . 2
Subgderivatives Subgderivatives L1 penalty is no longer smooth or differentiable (at 0) extend the notion of derivative to non-smooth functions sub-differential is the set of all sub-derivatives at a point ^ ] [ f ( w )− f ( ) ^ f ( w )− f ( ) ∂ f ( ) = ^ w w lim , lim w w → w ^ − w → w ^ + w − w ^ w − w ^ ^ w d if f is differentiable at then sub-differential has one member f ( ) ^ w dw another expression for sub-differential { g ∈ R ∣ f ( w ) > f ( ) + ∂ f ( ) = ^ ^ g ( w − ^ )} w w w ^ w 11 . 2
Subgradient Subgradient subdifferential absolute f ( w ) = ∣ w ∣ example ∂ f (0) = [−1, 1] ∂ f ( w = 0) = {sign( w )} image credit: G. Gordon 11 . 3
Subgradient Subgradient subdifferential absolute f ( w ) = ∣ w ∣ example ∂ f (0) = [−1, 1] ∂ f ( w = 0) = {sign( w )} recall, gradient was the vector of partial derivatives subgradient is a vector of sub-derivatives image credit: G. Gordon 11 . 3
Subgradient Subgradient subdifferential absolute f ( w ) = ∣ w ∣ example ∂ f (0) = [−1, 1] ∂ f ( w = 0) = {sign( w )} recall, gradient was the vector of partial derivatives subgradient is a vector of sub-derivatives subdifferential for functions of multiple variables { g ∈ R ∣ f ( w ) > ∂ f ( ) = ^ f ( ) + ^ g ( w − ^ )} D T w w w image credit: G. Gordon 11 . 3
Subgradient Subgradient subdifferential absolute f ( w ) = ∣ w ∣ example ∂ f (0) = [−1, 1] ∂ f ( w = 0) = {sign( w )} recall, gradient was the vector of partial derivatives subgradient is a vector of sub-derivatives subdifferential for functions of multiple variables { g ∈ R ∣ f ( w ) > ∂ f ( ) = ^ f ( ) + ^ g ( w − ^ )} D T w w w we can use sub-gradient with diminishing step-size for optimization image credit: G. Gordon 11 . 3
Adding regularization Adding regularization L 1 L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression 11 . 4
Adding regularization Adding regularization L 1 L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression do not penalize the bias w 0 using diminishing learning rate 11 . 4
Adding regularization Adding regularization L 1 L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape do not penalize the bias w 0 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N using diminishing learning rate 5 grad[1:] += lambdaa * np.sign(w[1:]) 6 return grad 11 . 4
Adding regularization Adding regularization L 1 L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape do not penalize the bias w 0 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N using diminishing learning rate 5 grad[1:] += lambdaa * np.sign(w[1:]) 6 return grad λ = 0 w 1 w 0 11 . 4
Adding regularization Adding regularization L 1 L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression 1 def gradient(X, y, w, lambdaa): 2 N,D = X.shape do not penalize the bias w 0 3 yh = logistic(np.dot(X, w)) 4 grad = np.dot(X.T, yh - y) / N using diminishing learning rate 5 grad[1:] += lambdaa * np.sign(w[1:]) 6 return grad note that the optimal becomes 0 w 1 λ = 1 λ = 0 λ = .1 λ = .1 λ = .1 w 1 w 0 11 . 4 Winter 2020 | Applied Machine Learning (COMP551)
Summary Summary learning: optimizing the model parameters (minimizing a cost function) use gradient descent to find local minimum easy to implement (esp. using automated differentiation) for convex functions gives global minimum 12
Recommend
More recommend