logistic regression
play

Logistic Regression Matt Gormley Lecture 9 Feb. 13, 2019 1 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Feb. 13, 2019 1 Q&A Q: In recitation, we only covered the Perceptron


  1. 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Feb. 13, 2019 1

  2. Q&A Q: In recitation, we only covered the Perceptron mistake bound for linearly separable data . Isn’t that an unrealistic setting? A: Not at all! Even if your data isn’t linearly separable to begin with, we can often add features to make it so. Exercise : Add x 1 x 2 y another feature to +1 +1 + transform this +1 -1 - nonlinearly separable data into linearly -1 +1 - separable data. -1 -1 + 2

  3. Reminders • Homework 3: KNN, Perceptron, Lin.Reg. – Out: Wed, Feb 6 – Due: Fri, Feb 15 at 11:59pm • Homework 4: Logistic Regression – Out: Fri, Feb 15 – Due: Fri, Mar 1 at 11:59pm • Midterm Exam 1 – Thu, Feb 21, 6:30pm – 8:00pm • Today’s In-Class Poll – http://p9.mlcourse.org 3

  4. PROBABILISTIC LEARNING 5

  5. Probabilistic Learning Function Approximation Probabilistic Learning Previously, we assumed that our Today, we assume that our output was generated using a output is sampled from a deterministic target function : conditional probability distribution : Our goal is to learn a probability Our goal was to learn a distribution p(y| x ) that best hypothesis h( x ) that best approximates p * (y| x ) approximates c * ( x ) 6

  6. Robotic Farming Deterministic Probabilistic Classification Is this a picture of Is this plant (binary output) a wheat kernel? drought resistant? Regression How many wheat What will the yield (continuous kernels are in this of this plant be? output) picture? 7

  7. Bayes Optimal Classifier Whiteboard – Bayes Optimal Classifier – Reducible / irreducible error – Ex: Bayes Optimal Classifier for 0/1 Loss 8

  8. Learning from Data (Frequentist) Whiteboard – Principle of Maximum Likelihood Estimation (MLE) – Strawmen: • Example: Bernoulli • Example: Gaussian • Example: Conditional #1 (Bernoulli conditioned on Gaussian) • Example: Conditional #2 (Gaussians conditioned on Bernoulli) 10

  9. MOTIVATION: LOGISTIC REGRESSION 12

  10. Example: Image Classification • ImageNet LSVRC-2010 contest: – Dataset : 1.2 million labeled images, 1000 classes – Task : Given a new image, label it with the correct class – Multiclass classification problem • Examples from http://image-net.org/ 15

  11. 16

  12. 17

  13. 18

  14. Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest • Five convolutional layers Input 1000-way (w/max-pooling) image softmax (pixels) • Three fully connected layers 19

  15. Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest • Five convolutional layers Input 1000-way (w/max-pooling) image softmax (pixels) • Three fully connected layers The rest is just This “softmax” some fancy layer is Logistic feature extraction Regression! (discussed later in the course) 20

  16. LOGISTIC REGRESSION 21

  17. Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. We are back to classification. Despite the name logistic regression. 22

  18. Recall… Linear Models for Classification Key idea: Try to learn this hyperplane directly Looking ahead: Directly modeling the • We’ll see a number of hyperplane would use a commonly used Linear decision function: Classifiers • These include: h ( � ) = sign ( θ T � ) – Perceptron – Logistic Regression – Naïve Bayes (under for: certain conditions) – Support Vector y ∈ { − 1 , +1 } Machines

  19. Recall… Background: Hyperplanes Hyperplane (Definition 1): Notation Trick : fold the H = { x : w T x = b } bias b and the weights w into a single vector θ by Hyperplane (Definition 2): prepending a constant to x and increasing dimensionality by one! w Half-spaces:

  20. Using gradient ascent for linear classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 25

  21. Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 h ( � ) = sign ( θ T � ) p θ ( y = 1 | � ) = 1 + ��� ( − θ T � ) 1 sign(x) logistic( u ) ≡ 1 + e − u 26

  22. Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 h ( � ) = sign ( θ T � ) p θ ( y = 1 | � ) = 1 + ��� ( − θ T � ) 1 sign(x) logistic( u ) ≡ 1 + e − u 27

  23. Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. Model: Logistic function applied to dot product of parameters with input vector. 1 p θ ( y = 1 | � ) = 1 + ��� ( − θ T � ) Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ Prediction: Output is the most probable class. y = ������ ˆ p θ ( y | � ) y ∈ { 0 , 1 } 28

  24. Logistic Regression Whiteboard – Bernoulli interpretation – Logistic Regression Model – Decision boundary 29

  25. Learning for Logistic Regression Whiteboard – Partial derivative for Logistic Regression – Gradient for Logistic Regression 30

  26. Logistic Regression 31

  27. Logistic Regression 32

  28. Logistic Regression 33

  29. LEARNING LOGISTIC REGRESSION 34

  30. Maximum Conditional Likelihood Estimation Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ We minimize the negative log conditional likelihood: N � p θ ( y ( i ) | � ( i ) ) J ( θ ) = − ��� i =1 Why? 1. We can’t maximize likelihood (as in Naïve Bayes) because we don’t have a joint model p(x,y) 2. It worked well for Linear Regression (least squares is MCLE) 35

  31. Maximum Conditional Likelihood Estimation θ ∗ = argmin Learning: Four approaches to solving J ( θ ) θ Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) 36

  32. Maximum Conditional Likelihood Estimation θ ∗ = argmin Learning: Four approaches to solving J ( θ ) θ Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) Logistic Regression does not have a closed form solution for MLE parameters. 37

  33. SGD for Logistic Regression Question: Which of the following is a correct description of SGD for Logistic Regression? Answer: At each step (i.e. iteration) of SGD for Logistic Regression we… A. (1) compute the gradient of the log-likelihood for all examples (2) update all the parameters using the gradient B. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down, (3) report that answer C. (1) compute the gradient of the log-likelihood for all examples (2) randomly pick an example (3) update only the parameters for that example D. (1) randomly pick a parameter, (2) compute the partial derivative of the log- likelihood with respect to that parameter, (3) update that parameter for all examples E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood for that example, (3) update all the parameters using that gradient F. (1) randomly pick a parameter and an example, (2) compute the gradient of the log-likelihood for that example with respect to that parameter, (3) update that parameter using that gradient 38

  34. Recall… Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD ( D , θ (0) ) θ � θ (0) 2: while not converged do 3: θ � θ + λ � θ J ( θ ) — 4: return θ 5: d d θ 1 J ( θ )   In order to apply GD to Logistic d Regression all we need is the d θ 2 J ( θ )   gradient of the objective � θ J ( θ ) =   . .   function (i.e. vector of partial .   derivatives). d d θ N J ( θ ) 39

  35. Recall… Stochastic Gradient Descent (SGD) — We can also apply SGD to solve the MCLE problem for Logistic Regressio n. We need a per-example objective: Let J ( θ ) = � N i =1 J ( i ) ( θ ) where J ( i ) ( θ ) = − ��� p θ ( y i | � i ) . 40

  36. Mini-Batch SGD • Gradient Descent : Compute true gradient exactly from all N examples • Mini-Batch SGD : Approximate true gradient by the average gradient of K randomly chosen examples • Stochastic Gradient Descent (SGD) : Approximate true gradient by the gradient of one randomly chosen example 41

  37. Mini-Batch SGD Three variants of first-order optimization: 42

Recommend


More recommend