Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12
Announcements ◮ Midterm: Weds, Feb 7th. Policies: ◮ You may use a single side of a single sheet of handwritten notes that you prepared. ◮ You must turn your sheet of notes in, with your name on it, in at the conclusion of the exam, even if you never looked at it. ◮ You may not use electronics devices of any sort. ◮ A few comments on the course difficulty ◮ Today: New: GD and SGD 1 / 12
Course difficulty Why is it difficult/what should we learn? ◮ homeworks ◮ exams ◮ grading 1 / 12
Review 1 / 12
Gradient Descent: Convergence ◮ Denote: z ∗ = argmin z F ( z ) : the global minimum z ( k ) : our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth” (e.g. works for square loss and the logistic loss). Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . ◮ A constant learning rate means no parameter tuning! 2 / 12
Probabilistic machine learning: Probabilistic machine learning: ◮ define a probabilistic model relating random variables x to y ◮ estimate its parameters . 2 / 12
A Probabilistic Model for Binary Classification: Logistic Regression ◮ For Y ∈ {− 1 , 1 } define p w ,b ( Y | X ) as: 1. Transform feature vector x via the “activation” function: a = w · x + b 2. Transform a into a binomial probability by passing it through the logistic function: 1 1 p w ,b ( Y = +1 | x ) = 1 + exp − a = 1 + exp − ( w · x + b ) 0.8 0.4 0.0 -10 -5 0 5 10 ◮ If we learn p w ,b ( Y | x ) , we can (almost) do whatever we like! 3 / 12
Maximum Likelihood Estimation and the Log loss The principle of maximum likelihood estimation is to choose our parameters to make our observed data as likely as possible (under our model). ◮ Mathematically: find ˆ w that maximizes the probability of the labels y 1 , . . . y n given the inputs x 1 , . . . x n . ◮ The Maximum Likelihood Estimator (the ’MLE’ ) is: N � w = argmax p w ( y n | x n ) ˆ w n =1 N � = argmin − log p w ( y n | x n ) w n =1 4 / 12
The MLE for Logistic Regression ◮ the MLE for the logistic regression model: N N � � argmin − log p w ( y n | x n ) = argmin log (1 + exp( − y n w · x n )) w w n =1 n =1 ◮ This is the logistic loss function that we saw earlier. ◮ How do we find the MLE? 5 / 12
Derivation for Log loss for Logistic Regression: scratch space 5 / 12
Today 5 / 12
Linear Regression as a Probabilistic Model Linear regression defines p w ( Y | X ) as follows: 1. Observe the feature vector x ; transform it via the activation function: µ = w · x 2. Let µ be the mean of a normal distribution and define the density: 2 π exp − ( Y − µ ) 2 1 √ p w ( Y | x ) = 2 σ 2 σ 3. Sample Y from p w ( Y | x ) . 6 / 12
Linear Regression-MLE is (Unregularized) Squared Loss Minimization! N N 1 � � ( y n − w · x n ) 2 argmin − log p w ( y n | x n ) ≡ argmin N � �� � w w n =1 n =1 SquaredLoss n ( w ,b ) Where did the variance go? What is GD here? 7 / 12
Loss Minimization & Gradient Descent N 1 � w ∗ = argmin ℓ ( x n , y n , w ) + R ( w ) N � �� � w n =1 ℓ n ( w ) What is GD here? What do we do if N is large? 8 / 12
Stochastic Gradient Descent (SGD): by example N 1 � ( y n − w · x n ) 2 argmin N w n =1 ◮ Gradient descent: ◮ Note we are computing an average. What is a crude way to estimate an average? ◮ Stochastic gradient descent: Will it converge? 9 / 12
Stochastic Gradient Descent (SGD): by example N 1 � ( y n − w · x n ) 2 argmin N w n =1 ◮ Gradient descent: ◮ Note we are computing an average. What is a crude way to estimate an average? ◮ Stochastic gradient descent: Will it converge? If the step size in SGD is a constant, we will not converge. 9 / 12
Stochastic Gradient Descent (SGD) (without regularization) Data : loss functions ℓ ( · ) , training data, number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : parameters w ∈ R d initialize: w (0) = 0 ; for k ∈ { 1 , . . . , K } do i ∼ Uniform( { 1 , . . . , N } ) ; w ( k ) = w ( k − 1) − η ( k ) · ∇ w ℓ i ( w ( k − 1) ) ; end return w ( K ) ; Algorithm 1: SGD 10 / 12
Stochastic Gradient Descent: Convergence N 1 � w ∗ = argmin ℓ n ( w ) N w n =1 ◮ w ( k ) : our parameter after k updates. ◮ Thm: Suppose ℓ ( · ) is convex (and satisfies mild regularity conditions). There exists a way to decrease our step sizes η ( k ) over time so that our function value, F ( w ( k ) ) will converge to the minimal function value F ( w ∗ ) . ◮ This Thm is different from GD in that we need to turn down our step sizes over time! 11 / 12
Recommend
More recommend