Lecture 9: Logistic Regression Discriminative vs. Generative - PowerPoint PPT Presentation

Lecture 9: − Logistic Regression − Discriminative vs. Generative Classification Aykut Erdem March 2016 Hacettepe University

Administrative • Assignment 2 is out! − It is due March 18 (i.e. in 2 weeks) − You will implement Naive Bayes Classifier for sentiment analysis   • on Twitter data • Project proposal due March 10! − a half page description − problem to be investigated, why it is interesting, what data you will use, etc. − http://goo.gl/forms/S5sRXJhKUl 2

This week • Logistic Regression • Discriminative vs. Generative Classification   • Linear Discriminant Functions - Two Classes - Multiple Classes - Fisher’s Linear Discriminant • Perceptron   3

Logistic Regression 4

      Last time… Naïve Bayes • NB Assumption:   :% • NB Classifier:   • Assume parametric form for P(X i |Y) and P(Y) - Estimate parameters using MLE/MAP and plug in slide by Aarti Singh & Barnabás Póczos 5

      Gaussian Naïve Bayes (GNB) • There are several distributions that can lead to a linear boundary. • As an example, consider Gaussian Naïve Bayes:   Gaussian class conditional densities Gaussian class conditional densities slide by Aarti Singh & Barnabás Póczos • What if we assume variance is independent of class, i.e. .e.%%%%%%%%%%%?% 6

GNB with equal variance is a Linear Classifier! fier!( Decision(boundary:( Decision boundary: d d Y Y P ( X i | Y = 0) P ( Y = 0) = P ( X i | Y = 1) P ( Y = 1) i =1 i =1 d log P ( Y = 0) Q d i =1 P ( X i | Y = 0) = log 1 − π log P ( X i | Y = 0) X + P ( Y = 1) Q d P ( X i | Y = 1) i =1 P ( X i | Y = 1) π i =1 slide by Aarti Singh & Barnabás Póczos { { Constant term First-order term 7

Gaussian Naive Bayes (GNB) Decision(Boundary( Decision Boundary = ( x 1 , x 2 ) X slide by Aarti Singh & Barnabás Póczos = P ( Y = 0) P 1 = P ( Y = 1) P 2 p 1 ( X ) = p ( X | Y = 0) ∼ N ( M 1 , Σ 1 ) p ( X | Y = 1) ∼ N ( M 2 , Σ 2 ) p 2 ( X ) = 8

Generative vs. Discriminative Classifiers • Generative classifiers (e.g. Naïve Bayes) - Assume some functional form for P(X,Y) (or P(X|Y) and P(Y)) - Estimate parameters of P(X|Y), P(Y) directly from training data • But arg max_Y P(X|Y) P(Y) = arg max_Y P(Y|X) • Why not learn P(Y|X) directly? Or better yet, why not learn the decision boundary directly? • Discriminative classifiers (e.g. Logistic Regression) - Assume some functional form for P(Y|X) or for the decision boundary slide by Aarti Singh & Barnabás Póczos - Estimate parameters of P(Y|X) directly from training data 9

Logistic Regression Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y ∣ X): Logis$c%func$on%applied%to%a%linear% Logistic function applied to linear func$on%of%the%data% function of the data logit%(z)% Logis&c( Logistic   func&on( function   slide by Aarti Singh & Barnabás Póczos (or(Sigmoid):( (or Sigmoid): z% Features(can(be(discrete(or(con&nuous!( Features can be discrete or continuous! 8% 10

Logistic Regression is a Linear Classifier! Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y ∣ X): % % % % Decision boundary: Decision%boundary:% 1% slide by Aarti Singh & Barnabás Póczos 1% (Linear Decision Boundary) (Linear Decision Boundary) 9% 11

Logistic Regression is a Linear Classifier! Assumes%the%following%func$onal%form%for%P(Y|X) Assumes the following functional form for P(Y ∣ X): % % % % slide by Aarti Singh & Barnabás Póczos 1% 1% 1% 12

Logistic Regression for more than 2 classes • Logis$c%regression%in%more%general%case,%where%% Logistic regression in more general case, where Y% {y 1 ,…,y K }% ∈ %for% k<K% for k<K % % % %for% k=K% (normaliza$on,%so%no%weights%for%this%class)% for k<K (normalization, so no weights for this class) slide by Aarti Singh & Barnabás Póczos % % 13

Training Logistic Regression We’ll%focus%on%binary%classifica$on:% We’ll focus on binary classification: % How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training%Data% Training Data Maximum%Likelihood%Es$mates% Maximum Likelihood Estimates % slide by Aarti Singh & Barnabás Póczos But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X) 12% 14

Training Logistic Regression We’ll%focus%on%binary%classifica$on:% We’ll focus on binary classification: % How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training%Data% Training Data Maximum%Likelihood%Es$mates% Maximum Likelihood Estimates % slide by Aarti Singh & Barnabás Póczos But there is a problem … But there is a problem … But there is a problem… Don’t have a model for P(X) or P(X|Y) – only for P(Y|X) Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) 12% 15

Training Logistic Regression How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training Data Training%Data% Maximum%(Condi$onal)%Likelihood%Es$mates% Maximum (Conditional) Likelihood Estimates % % % slide by Aarti Singh & Barnabás Póczos Discriminative philosophy — Don’t waste e ff ort learning P(X),   Discrimina$ve%philosophy%–%Don’t%waste%effort%learning%P(X),% focus on P(Y|X) — that’s all that matters for classification! focus%on%P(Y|X)%–%that’s%all%that%maders%for%classifica$on!% 16

Expressing Conditional log Likelihood Y l ln P ( Y l = 1 | X l , W )+( 1 � Y l ) ln P ( Y l = 0 | X l , W ) l ( W ) = ∑ l 1 P ( Y = 0 | X ) = 1 + exp ( w 0 + ∑ n i = 1 w i X i ) exp ( w 0 + ∑ n i = 1 w i X i ) P ( Y = 1 | X ) = 1 + exp ( w 0 + ∑ n i = 1 w i X i ) slide by Aarti Singh & Barnabás Póczos we can reexpress the log of the conditional likelihood Y can take only values 0 or 1, so only one of the two terms in the expression will be non-zero for any given Y l 17

Expressing Conditional log Likelihood Y l ln P ( Y l = 1 | X l , W )+( 1 � Y l ) ln P ( Y l = 0 | X l , W ) = ∑ l ( W ) l Y l ln P ( Y l = 1 | X l , W ) P ( Y l = 0 | X l , W ) + ln P ( Y l = 0 | X l , W ) = ∑ l n n = ∑ Y l ( w 0 + w i X l w i X l ∑ ∑ i ) � ln ( 1 + exp ( w 0 + i )) i i l slide by Aarti Singh & Barnabás Póczos 18

Maximizing Conditional log Likelihood Bad%news:%no%closed8form%solu$on%to%maximize% l ( w ) Bad news: no closed-form solution to maximize l ( w ) Good news: l ( w ) is concave function of w! concave Good%news:%% l ( w )%is%concave%func$on%of% w( ! %concave%func$ons% slide by Aarti Singh & Barnabás Póczos functions easy to optimize (unique maximum) easy%to%op$mize%(unique%maximum)% 19

Optimizing concave/convex functions • Condi$onal%likelihood%for%Logis$c%Regression%is%concave% • Conditional likelihood for Logistic Regression is concave • Maximum%of%a%concave%func$on%=%minimum%of%a%convex%func$on% • Maximum of a concave function = minimum of a convex function Gradient Ascent (concave)/ Gradient Descent (convex) Gradient(Ascent((concave)/(Gradient(Descent((convex)( Gradient:( Gradient: Update rule: Update(rule:( Learning(rate,( η >0( Learning rate, ƞ >0 slide by Aarti Singh & Barnabás Póczos 20

Gradient Ascent for Logistic Regression Gradient ascent algorithm: iterate until change < ɛ Gradient%ascent%algorithm:%iterate%un$l%change%<% ε# # # For i-1,…,d, For%i=1,…,d,%% % % Predict%what%current%weight% Predict what current weight   slide by Aarti Singh & Barnabás Póczos repeat repeat%%%% thinks%label%Y%should%be% thinks label Y should be • Gradient ascent is simplest of optimisation approaches − e.g. Newton method, Conjugate gradient ascent, IRLS (see Bishop 4.3.3) 21

E ff ect of step-size η slide by Aarti Singh & Barnabás Póczos Large ƞ → Fast convergence but larger residual error   Also possible oscillations Small ƞ → Slow convergence but small residual error 22

Naïve Bayes vs. Logistic Regression Set(of(Gaussian(( Set(of(Logis&c(( Naïve(Bayes(parameters( Regression(parameters( (feature(variance(( independent(of(class(label)( • Representation equivalence − But only in a special case!!! (GNB with class-independent   variances) • But what’s the di ff erence??? slide by Aarti Singh & Barnabás Póczos 23

Naïve Bayes vs. Logistic Regression Set(of(Gaussian(( Set(of(Logis&c(( Naïve(Bayes(parameters( Regression(parameters( (feature(variance(( independent(of(class(label)( • Representation equivalence − But only in a special case!!! (GNB with class-independent   variances) • But what’s the di ff erence??? • LR makes no assumption about P( X |Y) in learning!!! slide by Aarti Singh & Barnabás Póczos • Loss function!!! − Optimize di ff erent functions! Obtain di ff erent solutions 24

Naïve Bayes vs. Logistic Regression Consider Y Boolean, X i continuous X=<X 1 … X d > Number of parameters: %%%% π , (µ 1, y , µ 2, y , … , µ d , y ),% ( σ 2 2, y , … , σ 2 d , y )%%%% • NB: 4d+1 y=0,1 1, y , σ 2 • LR: d+1 %%%%w 0 ,%w 1 ,%…,%w d% Estimation method: slide by Aarti Singh & Barnabás Póczos • NB parameter estimates are uncoupled • LR parameter estimates are coupled 25

Lecture 9: Logistic Regression Discriminative vs. Generative - PowerPoint PPT Presentation

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Aykut Erdem March 2016 Hacettepe University Administrative Assignment 2 is out! It is due March 18 (i.e. in 2 weeks) You will implement Naive Bayes

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

E9 205 Machine Learning for Signal Procesing Support Vector Machines 9-10-2019 Linear

CS480/680 Lecture 7: May 29, 2019 Classification with Mixture of Gaussians [B] Sections 4.2,

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview

Test Case Software

Kindergarten students to another location). Overflow does become an added cost to the district

Class 15: Calculation of natural frequency Class 15: Calculation of natural frequency Old Slide

The Short Introduction to Imbalanced Classification Zeyu Qin 07.02.2020 Overview Reference

MA/CSSE 473 Day 35 Greedy Algorithms MA/CSSE 473 Day 35 HW 13 due tomorrow HW 14