binary logistic regression multinomial logistic regression
play

Binary Logistic Regression + Multinomial Logistic Regression Matt - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17, 2020 1 Reminders


  1. 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17, 2020 1

  2. Reminders • Midterm Exam 1 – Tue, Feb. 18, 7:00pm – 9:00pm • Homework 4: Logistic Regression – Out: Wed, Feb. 19 – Due: Fri, Feb. 28 at 11:59pm • Today’s In-Class Poll – http://p10.mlcourse.org • Reading on Probabilistic Learning is reused later in the course for MLE/MAP 3

  3. � ������ ������ � MLE Suppose we have data D = { x ( i ) } N i =1 Principle of Maximum Likelihood Estimation: MLE Choose the parameters that maximize the likelihood N of the data. θ MLE = ������ � p ( � ( i ) | θ ) θ MAP i =1 Maximum Likelihood Estimate (MLE) θ 2 θ MLE L(θ) L(θ 1 , θ 2 ) θ MLE θ 1 5

  4. MLE What does maximizing likelihood accomplish? • There is only a finite amount of probability mass (i.e. sum-to-one constraint) • MLE tries to allocate as much probability mass as possible to the things we have observed… … at the expense of the things we have not observed 6

  5. MOTIVATION: LOGISTIC REGRESSION 7

  6. Example: Image Classification • ImageNet LSVRC-2010 contest: – Dataset : 1.2 million labeled images, 1000 classes – Task : Given a new image, label it with the correct class – Multiclass classification problem • Examples from http://image-net.org/ 10

  7. 11

  8. 12

  9. 13

  10. Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest • Five convolutional layers Input 1000-way (w/max-pooling) image softmax (pixels) • Three fully connected layers 14

  11. Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest • Five convolutional layers Input 1000-way (w/max-pooling) image softmax (pixels) • Three fully connected layers The rest is just This “softmax” some fancy layer is Logistic feature extraction Regression! (discussed later in the course) 15

  12. LOGISTIC REGRESSION 16

  13. Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. We are back to classification. Despite the name logistic regression. 17

  14. Recall… Linear Models for Classification Key idea: Try to learn this hyperplane directly Looking ahead: Directly modeling the • We’ll see a number of hyperplane would use a commonly used Linear decision function: Classifiers • These include: h ( � ) = sign ( θ T � ) – Perceptron – Logistic Regression – Naïve Bayes (under for: certain conditions) – Support Vector y ∈ { − 1 , +1 } Machines

  15. Recall… Background: Hyperplanes Hyperplane (Definition 1): Notation Trick : fold the H = { x : w T x = b } bias b and the weights w into a single vector θ by Hyperplane (Definition 2): prepending a constant to x and increasing dimensionality by one! w Half-spaces:

  16. Using gradient ascent for linear classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 20

  17. Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 h ( � ) = sign ( θ T � ) p θ ( y = 1 | � ) = 1 + ��� ( − θ T � ) 1 sign(x) logistic( u ) ≡ 1 + e − u 21

  18. Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 h ( � ) = sign ( θ T � ) p θ ( y = 1 | � ) = 1 + ��� ( − θ T � ) 1 sign(x) logistic( u ) ≡ 1 + e − u 22

  19. Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. Model: Logistic function applied to dot product of parameters with input vector. 1 p θ ( y = 1 | � ) = 1 + ��� ( − θ T � ) Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ Prediction: Output is the most probable class. y = ������ ˆ p θ ( y | � ) y ∈ { 0 , 1 } 23

  20. Logistic Regression Whiteboard – Bernoulli interpretation – Logistic Regression Model – Decision boundary 24

  21. Learning for Logistic Regression Whiteboard – Partial derivative for Logistic Regression – Gradient for Logistic Regression 25

  22. LOGISTIC REGRESSION ON GAUSSIAN DATA 26

  23. Logistic Regression 27

  24. Logistic Regression 28

  25. Logistic Regression 29

  26. LEARNING LOGISTIC REGRESSION 30

  27. Maximum Conditional Likelihood Estimation Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ We minimize the negative log conditional likelihood: N � p θ ( y ( i ) | � ( i ) ) J ( θ ) = − ��� i =1 Why? 1. We can’t maximize likelihood (as in Naïve Bayes) because we don’t have a joint model p(x,y) 2. It worked well for Linear Regression (least squares is MCLE) 31

  28. Maximum Conditional Likelihood Estimation θ ∗ = argmin Learning: Four approaches to solving J ( θ ) θ Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) 32

  29. Maximum Conditional Likelihood Estimation θ ∗ = argmin Learning: Four approaches to solving J ( θ ) θ Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) Logistic Regression does not have a closed form solution for MLE parameters. 33

  30. SGD for Logistic Regression Question: Which of the following is a correct description of SGD for Logistic Regression? Answer: At each step (i.e. iteration) of SGD for Logistic Regression we… A. (1) compute the gradient of the log-likelihood for all examples (2) update all the parameters using the gradient B. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down, (3) report that answer C. (1) compute the gradient of the log-likelihood for all examples (2) randomly pick an example (3) update only the parameters for that example D. (1) randomly pick a parameter, (2) compute the partial derivative of the log- likelihood with respect to that parameter, (3) update that parameter for all examples E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood for that example, (3) update all the parameters using that gradient F. (1) randomly pick a parameter and an example, (2) compute the gradient of the log-likelihood for that example with respect to that parameter, (3) update that parameter using that gradient 34

  31. Recall… Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD ( D , θ (0) ) θ � θ (0) 2: while not converged do 3: θ � θ + λ � θ J ( θ ) — 4: return θ 5: d d θ 1 J ( θ )   In order to apply GD to Logistic d Regression all we need is the d θ 2 J ( θ )   gradient of the objective � θ J ( θ ) =  .  .   function (i.e. vector of partial .   derivatives). d d θ N J ( θ ) 35

  32. Recall… Stochastic Gradient Descent (SGD) — We can also apply SGD to solve the MCLE problem for Logistic Regressio n. We need a per-example objective: Let J ( θ ) = � N i =1 J ( i ) ( θ ) where J ( i ) ( θ ) = − ��� p θ ( y i | � i ) . 36

  33. Logistic Regression vs. Perceptron Question: True or False: Just like Perceptron, one step (i.e. iteration) of SGD for Logistic Regression will result in a change to the parameters only if the current example is incorrectly classified. Answer: 37

  34. Matching Game Goal: Match the Algorithm to its Update Rule 1. SGD for Logistic Regression 4. θ k ← θ k + ( h θ ( x ( i ) ) − y ( i ) ) h θ ( x ) = p ( y | x ) 1 2. Least Mean Squares 5. θ k ← θ k + 1 + exp λ ( h θ ( x ( i ) ) − y ( i ) ) h θ ( x ) = θ T x 3. Perceptron 6. θ k ← θ k + λ ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) h θ ( x ) = sign( θ T x ) k A. 1=5, 2=4, 3=6 E. 1=6, 2=6, 3=6 B. 1=5, 2=6, 3=4 F. 1=6, 2=5, 3=5 C. 1=6, 2=4, 3=4 G. 1=5, 2=5, 3=5 D. 1=5, 2=6, 3=6 H. 1=4, 2=5, 3=6 38

  35. OPTIMIZATION METHOD #4: MINI-BATCH SGD 39

  36. Mini-Batch SGD • Gradient Descent : Compute true gradient exactly from all N examples • Stochastic Gradient Descent (SGD) : Approximate true gradient by the gradient of one randomly chosen example • Mini-Batch SGD : Approximate true gradient by the average gradient of K randomly chosen examples 40

  37. Mini-Batch SGD Three variants of first-order optimization: 41

Recommend


More recommend