lecture 9
play

Lecture 9: Logistic Regression Discriminative vs. Generative - PowerPoint PPT Presentation

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Aykut Erdem March 2016 Hacettepe University Administrative Assignment 2 is out! It is due March 18 (i.e. in 2 weeks) You will implement Naive Bayes


  1. Lecture 9: − Logistic Regression − Discriminative vs. Generative Classification Aykut Erdem March 2016 Hacettepe University

  2. Administrative • Assignment 2 is out! − It is due March 18 (i.e. in 2 weeks) − You will implement Naive Bayes Classifier for sentiment analysis 
 • on Twitter data • Project proposal due March 10! − a half page description − problem to be investigated, why it is interesting, what data you will use, etc. − http://goo.gl/forms/S5sRXJhKUl 2

  3. This week • Logistic Regression • Discriminative vs. Generative Classification 
 • Linear Discriminant Functions - Two Classes - Multiple Classes - Fisher’s Linear Discriminant • Perceptron 
 3

  4. Logistic Regression 4

  5. 
 
 
 Last time… Naïve Bayes • NB Assumption: 
 :% • NB Classifier: 
 • Assume parametric form for P(X i |Y) and P(Y) - Estimate parameters using MLE/MAP and plug in slide by Aarti Singh & Barnabás Póczos 5

  6. 
 
 
 Gaussian Naïve Bayes (GNB) • There are several distributions that can lead to a linear boundary. • As an example, consider Gaussian Naïve Bayes: 
 Gaussian class conditional densities Gaussian class conditional densities slide by Aarti Singh & Barnabás Póczos • What if we assume variance is independent of class, i.e. .e.%%%%%%%%%%%?% 6

  7. GNB with equal variance is a Linear Classifier! fier!( Decision(boundary:( Decision boundary: d d Y Y P ( X i | Y = 0) P ( Y = 0) = P ( X i | Y = 1) P ( Y = 1) i =1 i =1 d log P ( Y = 0) Q d i =1 P ( X i | Y = 0) = log 1 − π log P ( X i | Y = 0) X + P ( Y = 1) Q d P ( X i | Y = 1) i =1 P ( X i | Y = 1) π i =1 slide by Aarti Singh & Barnabás Póczos { { Constant term First-order term 7

  8. Gaussian Naive Bayes (GNB) Decision(Boundary( Decision Boundary = ( x 1 , x 2 ) X slide by Aarti Singh & Barnabás Póczos = P ( Y = 0) P 1 = P ( Y = 1) P 2 p 1 ( X ) = p ( X | Y = 0) ∼ N ( M 1 , Σ 1 ) p ( X | Y = 1) ∼ N ( M 2 , Σ 2 ) p 2 ( X ) = 8

  9. Generative vs. Discriminative Classifiers • Generative classifiers (e.g. Naïve Bayes) - Assume some functional form for P(X,Y) (or P(X|Y) and P(Y)) - Estimate parameters of P(X|Y), P(Y) directly from training data • But arg max_Y P(X|Y) P(Y) = arg max_Y P(Y|X) • Why not learn P(Y|X) directly? Or better yet, why not learn the decision boundary directly? • Discriminative classifiers (e.g. Logistic Regression) - Assume some functional form for P(Y|X) or for the decision boundary slide by Aarti Singh & Barnabás Póczos - Estimate parameters of P(Y|X) directly from training data 9

  10. Logistic Regression Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y ∣ X): Logis$c%func$on%applied%to%a%linear% Logistic function applied to linear func$on%of%the%data% function of the data logit%(z)% Logis&c( Logistic 
 func&on( function 
 slide by Aarti Singh & Barnabás Póczos (or(Sigmoid):( (or Sigmoid): z% Features(can(be(discrete(or(con&nuous!( Features can be discrete or continuous! 8% 10

  11. Logistic Regression is a Linear Classifier! Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y ∣ X): % % % % Decision boundary: Decision%boundary:% 1% slide by Aarti Singh & Barnabás Póczos 1% (Linear Decision Boundary) (Linear Decision Boundary) 9% 11

  12. Logistic Regression is a Linear Classifier! Assumes%the%following%func$onal%form%for%P(Y|X) Assumes the following functional form for P(Y ∣ X): % % % % slide by Aarti Singh & Barnabás Póczos 1% 1% 1% 12

  13. Logistic Regression for more than 2 classes • Logis$c%regression%in%more%general%case,%where%% Logistic regression in more general case, where Y% {y 1 ,…,y K }% ∈ %for% k<K% for k<K % % % %for% k=K% (normaliza$on,%so%no%weights%for%this%class)% for k<K (normalization, so no weights for this class) slide by Aarti Singh & Barnabás Póczos % % 13

  14. Training Logistic Regression We’ll%focus%on%binary%classifica$on:% We’ll focus on binary classification: % How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training%Data% Training Data Maximum%Likelihood%Es$mates% Maximum Likelihood Estimates % slide by Aarti Singh & Barnabás Póczos But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X) 12% 14

  15. Training Logistic Regression We’ll%focus%on%binary%classifica$on:% We’ll focus on binary classification: % How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training%Data% Training Data Maximum%Likelihood%Es$mates% Maximum Likelihood Estimates % slide by Aarti Singh & Barnabás Póczos But there is a problem … But there is a problem … But there is a problem… Don’t have a model for P(X) or P(X|Y) – only for P(Y|X) Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) 12% 15

  16. Training Logistic Regression How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training Data Training%Data% Maximum%(Condi$onal)%Likelihood%Es$mates% Maximum (Conditional) Likelihood Estimates % % % slide by Aarti Singh & Barnabás Póczos Discriminative philosophy — Don’t waste e ff ort learning P(X), 
 Discrimina$ve%philosophy%–%Don’t%waste%effort%learning%P(X),% focus on P(Y|X) — that’s all that matters for classification! focus%on%P(Y|X)%–%that’s%all%that%maders%for%classifica$on!% 16

  17. Expressing Conditional log Likelihood Y l ln P ( Y l = 1 | X l , W )+( 1 � Y l ) ln P ( Y l = 0 | X l , W ) l ( W ) = ∑ l 1 P ( Y = 0 | X ) = 1 + exp ( w 0 + ∑ n i = 1 w i X i ) exp ( w 0 + ∑ n i = 1 w i X i ) P ( Y = 1 | X ) = 1 + exp ( w 0 + ∑ n i = 1 w i X i ) slide by Aarti Singh & Barnabás Póczos we can reexpress the log of the conditional likelihood Y can take only values 0 or 1, so only one of the two terms in the expression will be non-zero for any given Y l 17

  18. Expressing Conditional log Likelihood Y l ln P ( Y l = 1 | X l , W )+( 1 � Y l ) ln P ( Y l = 0 | X l , W ) = ∑ l ( W ) l Y l ln P ( Y l = 1 | X l , W ) P ( Y l = 0 | X l , W ) + ln P ( Y l = 0 | X l , W ) = ∑ l n n = ∑ Y l ( w 0 + w i X l w i X l ∑ ∑ i ) � ln ( 1 + exp ( w 0 + i )) i i l slide by Aarti Singh & Barnabás Póczos 18

  19. Maximizing Conditional log Likelihood Bad%news:%no%closed8form%solu$on%to%maximize% l ( w ) Bad news: no closed-form solution to maximize l ( w ) Good news: l ( w ) is concave function of w! concave Good%news:%% l ( w )%is%concave%func$on%of% w( ! %concave%func$ons% slide by Aarti Singh & Barnabás Póczos functions easy to optimize (unique maximum) easy%to%op$mize%(unique%maximum)% 19

  20. Optimizing concave/convex functions • Condi$onal%likelihood%for%Logis$c%Regression%is%concave% • Conditional likelihood for Logistic Regression is concave • Maximum%of%a%concave%func$on%=%minimum%of%a%convex%func$on% • Maximum of a concave function = minimum of a convex function Gradient Ascent (concave)/ Gradient Descent (convex) Gradient(Ascent((concave)/(Gradient(Descent((convex)( Gradient:( Gradient: Update rule: Update(rule:( Learning(rate,( η >0( Learning rate, ƞ >0 slide by Aarti Singh & Barnabás Póczos 20

  21. Gradient Ascent for Logistic Regression Gradient ascent algorithm: iterate until change < ɛ Gradient%ascent%algorithm:%iterate%un$l%change%<% ε# # # For i-1,…,d, For%i=1,…,d,%% % % Predict%what%current%weight% Predict what current weight 
 slide by Aarti Singh & Barnabás Póczos repeat repeat%%%% thinks%label%Y%should%be% thinks label Y should be • Gradient ascent is simplest of optimisation approaches − e.g. Newton method, Conjugate gradient ascent, IRLS (see Bishop 4.3.3) 21

  22. E ff ect of step-size η slide by Aarti Singh & Barnabás Póczos Large ƞ → Fast convergence but larger residual error 
 Also possible oscillations Small ƞ → Slow convergence but small residual error 22

  23. Naïve Bayes vs. Logistic Regression Set(of(Gaussian(( Set(of(Logis&c(( Naïve(Bayes(parameters( Regression(parameters( (feature(variance(( independent(of(class(label)( • Representation equivalence − But only in a special case!!! (GNB with class-independent 
 variances) • But what’s the di ff erence??? slide by Aarti Singh & Barnabás Póczos 23

  24. Naïve Bayes vs. Logistic Regression Set(of(Gaussian(( Set(of(Logis&c(( Naïve(Bayes(parameters( Regression(parameters( (feature(variance(( independent(of(class(label)( • Representation equivalence − But only in a special case!!! (GNB with class-independent 
 variances) • But what’s the di ff erence??? • LR makes no assumption about P( X |Y) in learning!!! slide by Aarti Singh & Barnabás Póczos • Loss function!!! − Optimize di ff erent functions! Obtain di ff erent solutions 24

  25. Naïve Bayes vs. Logistic Regression Consider Y Boolean, X i continuous X=<X 1 … X d > Number of parameters: %%%% π , (µ 1, y , µ 2, y , … , µ d , y ),% ( σ 2 2, y , … , σ 2 d , y )%%%% • NB: 4d+1 y=0,1 1, y , σ 2 • LR: d+1 %%%%w 0 ,%w 1 ,%…,%w d% Estimation method: slide by Aarti Singh & Barnabás Póczos • NB parameter estimates are uncoupled • LR parameter estimates are coupled 25

Recommend


More recommend