lecture 9
play

Lecture 9: Logistic Regression Discriminative vs. Generative - PowerPoint PPT Presentation

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Linear Discriminant Functions Aykut Erdem October 2017 Hacettepe University Administrative Midterm exam will be held on November 6. Project proposals


  1. Lecture 9: − Logistic Regression − Discriminative vs. Generative Classification − Linear Discriminant Functions Aykut Erdem October 2017 Hacettepe University

  2. Administrative • Midterm exam will be held on November 6. • Project proposals due is today! - No lecture next Thursday but we will talk about your proposals • Assignment 3 will be out next week • Make-up lecture will be on November 11 - will check the availability of the classrooms 2

  3. Last time… Naïve Bayes Classifier – Given : – ,… – Class prior P(Y) – d conditionally independent features X 1 ,… X d given the – class label Y – For each X i feature, we have the conditional likelihood P(X i |Y) Naïve Bayes Decision rule: slide by Barnabás Póczos & Aarti Singh 3

  4. Last time… Naïve Bayes Algorithm for discrete features discrete features We need to estimate these probabilities! Estimators For Class Prior For Likelihood slide by Barnabás Póczos & Aarti Singh NB Prediction for test data: 19 4

  5. Last time… Text Classification MEDLINE Article MeSH Subject Category 
 Hierarchy • Antogonists and Inhibitors • Blood Supply ? • Chemistry • Drug Therapy • Embryology • Epidemiology • … How to represent a text document? slide by Dan Jurafsky 5

  6. Last time… Bag of words model Typical additional assumption: Position in ¡document ¡doesn’t ¡matter : P(X i =x i |Y=y) = P(X k =x i |Y=y) – “Bag ¡of ¡words” ¡model ¡– order of words on the page ignored The document is just a bag of words: i.i.d. words – Sounds really silly, but often works very well! ) K( 50000-1) parameters to estimate The probability of a document with words x 1 ,x 2 ,… ¡ slide by Barnabás Póczos & Aarti Singh 27 6

  7. Doc Words Class Last time… Bag Training 1 Chinese Beijing Chinese c of words model 2 Chinese Chinese Shanghai c 3 Chinese Macao c P ( c ) = N c 4 Tokyo Japan Chinese j ˆ Test 5 Chinese Chinese Chinese Tokyo Japan ? N P ( w | c ) = count ( w , c ) + 1 ˆ count ( c ) + | V | Priors: 3 P ( c )= 4 Choosing a class: 1 P ( j )= P(c|d 5 ) 3/4 * (3/7) 3 * 1/14 * 1/14 4 ∝ ≈ 0.0003 Conditional Probabilities: (5+1) / (8+6) = 6/14 = 3/7 P(Chinese| c ) = (0+1) / (8+6) = 1/14 P(Tokyo| c ) = P(j|d 5 ) 1/4 * (2/9) 3 * 2/9 * 2/9 ∝ (0+1) / (8+6) = 1/14 P(Japan| c ) = ≈ 0.0001 (1+1) / (3+6) = 2/9 P(Chinese| j ) = slide by Dan Jurafsky P(Tokyo| j ) = (1+1) / (3+6) = 2/9 P(Japan| j ) = (1+1) / (3+6) = 2/9 7

  8. Last time… What if features are continuous? e.g., character recognition: X i is intensity at i th pixel ecognition: i is intensity a Gaussian Naïve Bayes (GNB): � Naïve Bayes (GNB): • Gaussian Naïve Bayes (GNB): � • � • tinuous mean and variance for each class k and each pixel i Di ff erent mean and variance for each class k and each pixel i. � • � Sometimes assume variance • slide by Barnabás Póczos & Aarti Singh � ates: • � • • is independent of Y (i.e., σ i ), � • � • • or independent of X i (i.e., σ k ) • or both (i.e., σ ) 8

  9. Logistic Regression 9

  10. 
 
 
 Recap: Naïve Bayes • NB Assumption: 
 :% • NB Classifier: 
 • Assume parametric form for P(X i |Y) and P(Y) - Estimate parameters using MLE/MAP and plug in slide by Aarti Singh & Barnabás Póczos 10

  11. 
 
 
 Gaussian Naïve Bayes (GNB) • There are several distributions that can lead to a linear boundary. • As an example, consider Gaussian Naïve Bayes: 
 Gaussian class conditional densities Gaussian class conditional densities slide by Aarti Singh & Barnabás Póczos • What if we assume variance is independent of class, i.e. .e.%%%%%%%%%%%?% 11

  12. GNB with equal variance is a Linear Classifier! fier!( Decision(boundary:( Decision boundary: d d Y Y P ( X i | Y = 0) P ( Y = 0) = P ( X i | Y = 1) P ( Y = 1) i =1 i =1 slide by Aarti Singh & Barnabás Póczos 12

  13. GNB with equal variance is a Linear Classifier! fier!( Decision(boundary:( Decision boundary: d d Y Y P ( X i | Y = 0) P ( Y = 0) = P ( X i | Y = 1) P ( Y = 1) i =1 i =1 d log P ( Y = 0) Q d i =1 P ( X i | Y = 0) = log 1 − π log P ( X i | Y = 0) X + P ( Y = 1) Q d P ( X i | Y = 1) i =1 P ( X i | Y = 1) π i =1 slide by Aarti Singh & Barnabás Póczos 13

  14. GNB with equal variance is a Linear Classifier! fier!( Decision(boundary:( Decision boundary: d d Y Y P ( X i | Y = 0) P ( Y = 0) = P ( X i | Y = 1) P ( Y = 1) i =1 i =1 d log P ( Y = 0) Q d i =1 P ( X i | Y = 0) = log 1 − π log P ( X i | Y = 0) X + P ( Y = 1) Q d P ( X i | Y = 1) i =1 P ( X i | Y = 1) π i =1 slide by Aarti Singh & Barnabás Póczos { { Constant term First-order term 14

  15. Gaussian Naive Bayes (GNB) Decision(Boundary( Decision Boundary = ( x 1 , x 2 ) X slide by Aarti Singh & Barnabás Póczos = P ( Y = 0) P 1 = P ( Y = 1) P 2 p 1 ( X ) = p ( X | Y = 0) ∼ N ( M 1 , Σ 1 ) p ( X | Y = 1) ∼ N ( M 2 , Σ 2 ) p 2 ( X ) = 15

  16. Generative vs. Discriminative Classifiers • Generative classifiers (e.g. Naïve Bayes) - Assume some functional form for P(X,Y) (or P(X|Y) and P(Y)) - Estimate parameters of P(X|Y), P(Y) directly from training data • But arg max_Y P(X|Y) P(Y) = arg max_Y P(Y|X) • Why not learn P(Y|X) directly? Or better yet, why not learn the decision boundary directly? • Discriminative classifiers (e.g. Logistic Regression) - Assume some functional form for P(Y|X) or for the decision boundary slide by Aarti Singh & Barnabás Póczos - Estimate parameters of P(Y|X) directly from training data 16

  17. Logistic Regression Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y ∣ X): Logis$c%func$on%applied%to%a%linear% Logistic function applied to linear func$on%of%the%data% function of the data logit%(z)% Logis&c( Logistic 
 func&on( function 
 slide by Aarti Singh & Barnabás Póczos (or(Sigmoid):( (or Sigmoid): z% Features(can(be(discrete(or(con&nuous!( Features can be discrete or continuous! 8% 17

  18. Logistic Regression is a Linear Classifier! Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y ∣ X): % % % % Decision boundary: Decision%boundary:% 1% slide by Aarti Singh & Barnabás Póczos 1% (Linear Decision Boundary) (Linear Decision Boundary) 9% 18

  19. Logistic Regression is a Linear Classifier! Assumes%the%following%func$onal%form%for%P(Y|X) Assumes the following functional form for P(Y ∣ X): % % % % slide by Aarti Singh & Barnabás Póczos 1% 1% 1% 19

  20. Logistic Regression for more than 2 classes • Logis$c%regression%in%more%general%case,%where%% Logistic regression in more general case, where Y% {y 1 ,…,y K }% ∈ %for% k<K% for k<K % % % %for% k=K% (normaliza$on,%so%no%weights%for%this%class)% for k=K (normalization, so no weights for this class) slide by Aarti Singh & Barnabás Póczos % % 20

  21. Training Logistic Regression We’ll%focus%on%binary%classifica$on:% We’ll focus on binary classification: % How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training%Data% Training Data Maximum%Likelihood%Es$mates% Maximum Likelihood Estimates % slide by Aarti Singh & Barnabás Póczos But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X) 12% 21

  22. Training Logistic Regression We’ll%focus%on%binary%classifica$on:% We’ll focus on binary classification: % How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training%Data% Training Data Maximum%Likelihood%Es$mates% Maximum Likelihood Estimates % slide by Aarti Singh & Barnabás Póczos But there is a problem … But there is a problem … But there is a problem… Don’t have a model for P(X) or P(X|Y) – only for P(Y|X) Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) 12% 22

  23. Training Logistic Regression How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training Data Training%Data% Maximum%(Condi$onal)%Likelihood%Es$mates% Maximum (Conditional) Likelihood Estimates % % % slide by Aarti Singh & Barnabás Póczos Discriminative philosophy — Don’t waste e ff ort learning P(X), 
 Discrimina$ve%philosophy%–%Don’t%waste%effort%learning%P(X),% focus on P(Y|X) — that’s all that matters for classification! focus%on%P(Y|X)%–%that’s%all%that%maders%for%classifica$on!% 23

  24. Expressing Conditional log Likelihood Y l ln P ( Y l = 1 | X l , W )+( 1 � Y l ) ln P ( Y l = 0 | X l , W ) l ( W ) = ∑ l 1 P ( Y = 0 | X ) = 1 + exp ( w 0 + ∑ n i = 1 w i X i ) exp ( w 0 + ∑ n i = 1 w i X i ) P ( Y = 1 | X ) = 1 + exp ( w 0 + ∑ n i = 1 w i X i ) slide by Aarti Singh & Barnabás Póczos we can reexpress the log of the conditional likelihood Y can take only values 0 or 1, so only one of the two terms in the expression will be non-zero for any given Y l 24

  25. Expressing Conditional log Likelihood Y l ln P ( Y l = 1 | X l , W )+( 1 � Y l ) ln P ( Y l = 0 | X l , W ) = ∑ l ( W ) l slide by Aarti Singh & Barnabás Póczos 25

Recommend


More recommend