Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12
Announcements ◮ Midterm: Weds, Feb 7th. Policies: ◮ You may use a single side of a single sheet of handwritten notes that you prepared. ◮ You must turn your sheet of notes in, with your name on it, in at the conclusion of the exam, even if you never looked at it. ◮ You may not use electronics devices of any sort. ◮ Today: Review: Regularization and Optimization New: (wrap up GD) + probabilistic modeling! 1 / 12
Review 1 / 12
Regularization / Ridge Regression ◮ Regularize the optimization problem: N 1 1 � ( y n − w · x n ) 2 + λ � w � 2 = min N � Y − X ⊤ w � 2 + λ � w � 2 min N w w n =1 ◮ This particular case: “Ridge” Regression, Tikhonov regularization ◮ The solution is the least squares estimator : � 1 � − 1 � 1 � w least squares = N X ⊤ X + λ I N X ⊤ Y Regularization is often necessary for the “exact” solution method (regardless of if d bigger/less than N ) 2 / 12
Gradient Descent ◮ Want to solve: min F ( z ) z ◮ How should we update z? 3 / 12
Gradient Descent Data : function F : R d → R , number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : z ∈ R d initialize: z (0) = 0 ; for k ∈ { 1 , . . . , K } do z ( k ) = z ( k − 1) − η ( k ) · ∇ z F ( z ( k − 1) ) ; end return z ( K ) ; Algorithm 1: GradientDescent 3 / 12
Today 3 / 12
Gradient Descent: Convergence ◮ Denote: z ∗ = argmin z F ( z ) : the global minimum z ( k ) : our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth”. Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . ◮ This Thm applies to both the square loss and logistic loss! 4 / 12
Proof intuition: smoothness and GD Convergence ◮ L -Smooth functions: “The gradients don’t change quickly.” Precisely, For all z, z ′ �∇ F ( z ) − ∇ F ( z ′ ) � ≤ L � z − z ′ � ◮ Proof idea: 1. If our gradient is large, we will make good progress decreasing our function value: 2. If our gradient is small, we must have value near the optimal value: 5 / 12
A better idea? ◮ Remember the Bayes optimal classifier. D ( x, y ) is the true probability of ( x, y ) . f ( BO ) ( x ) = argmax D ( x, y ) y = argmax D ( y | x ) y ◮ Of course, we don’t have D ( y | x ) . Probabilistic machine learning: define a probabilistic model relating random variables x to y and estimate its parameters . 6 / 12
A Probabilistic Model for Binary Classification: Logistic Regression ◮ For Y ∈ {− 1 , 1 } define p w ,b ( Y | X ) as: 1. Transform feature vector x via the “activation” function: a = w · x + b 2. Transform a into a binomial probability by passing it through the logistic function: 1 1 p w ,b ( Y = +1 | x ) = 1 + exp − a = 1 + exp − ( w · x + b ) 0.8 0.4 0.0 -10 -5 0 5 10 ◮ If we learn p w ,b ( Y | x ) , we can (almost) do whatever we like! 7 / 12
Maximum Likelihood Estimation The principle of maximum likelihood estimation is to choose our parameters to make our observed data as likely as possible (under our model). ◮ Mathematically: find ˆ w that maximizes the probability of the labels y 1 , . . . y n given the inputs x 1 , . . . x n . ◮ Note, by the i.i.d. assumption: D ( y 1 , . . . y n | x 1 , . . . x N ) = ◮ The Maximum Likelihood Estimator (the ’MLE’ ) is: N � w = argmax p w ( y n | x n ) ˆ w n =1 8 / 12
Maximum Likelihood Estimation and the Log loss ◮ The ’MLE’ is: N � w = argmax p w ( y n | x n ) ˆ w n =1 N � = argmax log p w ( y n | x n ) w n =1 N � = argmax log p w ( y n | x n ) w n =1 N � = argmin − log p w ( y n | x n ) w n =1 ◮ This is referred to as the log loss . 9 / 12
The MLE for Logistic Regression ◮ the MLE for the logistic regression model: N N � � argmin − log p w ( y n | x n ) = argmin log (1 + exp( − y n w · x n )) w w n =1 n =1 ◮ This is the logistic loss function that we saw earlier. ◮ How do we find the MLE? 10 / 12
Derivation for Log loss for Logistic Regression: scratch space 10 / 12
Linear Regression as a Probabilistic Model Linear regression defines p w ( Y | X ) as follows: 1. Observe the feature vector x ; transform it via the activation function: µ = w · x 2. Let µ be the mean of a normal distribution and define the density: 2 π exp − ( Y − µ ) 2 1 √ p w ( Y | x ) = 2 σ 2 σ 3. Sample Y from p w ( Y | x ) . 11 / 12
Linear Regression-MLE is (Unregularized) Squared Loss Minimization! N N 1 � � ( y n − w · x n ) 2 argmin − log p w ( y n | x n ) ≡ argmin N � �� � w w n =1 n =1 SquaredLoss n ( w ,b ) Where did the variance go? 12 / 12
Recommend
More recommend