Pattern Recognition 2019 Linear Models for Classification Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 55
Classification Problems We are concerned with the problems of 1 Predicting the class of an object, on the basis of a number of variables that describe the object. 2 Estimating the class probabilities of an object. Interconnected, since prediction is usually based on the estimated probabilities. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 55
Examples of Classification Problems Churn: is customer going to leave for a competitor? SPAM filter: e-mail message is SPAM or not? Medical diagnosis: does patient have breast cancer? Handwritten digit recognition. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 3 / 55
Classification Problems In this kind of classification problem there is a target variable t that assumes values in an unordered discrete set. An important special case is when there are only two classes, in which case we usually choose t ∈ { 0 , 1 } . The goal of a classification procedure is to predict the target value (class label) given a set of input values x = { x 1 , . . . , x D } measured on the same object. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 4 / 55
Classification Problems At a particular point x the value of t is not uniquely determined. It can assume both its values with respective probabilities that depend on the location of the point x in the input space. We write p ( C 1 | x ) = 1 − p ( C 2 | x ) = y ( x ) . The goal of a classification procedure is to produce an estimate of y ( x ) at every input point. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 5 / 55
Two types of approaches to classification Discriminative Models (“regression”; section 4.3). Generative Models (“density estimation”; section 4.2). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 6 / 55
Discriminative Models Discriminative methods only model the conditional distribution of t given x . The probability distribution of x itself is not modeled. For the binary classification problem: y ( x ) = p ( C 1 | x ) = p ( t = 1 | x ) = f ( x , w ) where f ( x , w ) is some deterministic function of x . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 7 / 55
Discriminative Models Examples of discriminative classification methods: Linear probability model Logistic regression Feed-forward neural networks . . . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 8 / 55
Generative Models An alternative paradigm for estimating y ( x ) is based on density estimation. Here Bayes’ theorem y ( x ) = p ( C 1 | x ) p ( C 1 ) p ( x |C 1 ) = p ( C 1 ) p ( x |C 1 ) + p ( C 2 ) P ( x |C 2 ) is applied where p ( x |C k ) are the class conditional probability density functions and p ( C k ) are the unconditional (“prior”) probabilities of each class. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 9 / 55
Generative Models Examples of generative classification methods: Linear/Quadratic Discriminant Analysis, Naive Bayes classifier, . . . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 10 / 55
Discriminative Models: linear probability model In the linear probability model, we assume that: p ( t = 1 | x ) = E [ t | x ] = w ⊤ x Problem: The linear function w ⊤ x is not guaranteed to produce values between 0 and 1. Negative probabilities and probabilities bigger than 1 go against the axioms of probability. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 11 / 55
Linear response function 1 0 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 12 / 55
Logistic regression Logistic response function e w ⊤ x E [ t | x ] = p ( t = 1 | x ) = 1 + e w ⊤ x or (divide numerator and denominator by e w ⊤ x ) 1 1 + e − w ⊤ x = (1 + e − w ⊤ x ) − 1 p ( t = 1 | x ) = (4.59 and 4.87) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 13 / 55
Logistic Response Function 1.0 0.5 0.0 0 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 14 / 55
Linearization: the logit transformation Since p ( t = 1 | x ) and p ( t = 0 | x ) have to add up to one, it follows that: 1 p ( t = 0 | x ) = 1 + e w ⊤ x Hence, p ( t = 1 | x ) p ( t = 0 | x ) = e w ⊤ x Therefore � p ( t = 1 | x ) � = w ⊤ x ln p ( t = 0 | x ) The ratio p ( t = 1 | x ) / p ( t = 0 | x )) is called the odds . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 15 / 55
Linear Separation Assign to class t = 1 if p ( t = 1 | x ) > p ( t = 0 | x ), i.e. if p ( t = 1 | x ) p ( t = 0 | x ) > 1 This is true if � p ( t = 1 | x ) � ln > 0 p ( t = 0 | x ) So � 1 if w ⊤ x > 0 Assign to class t = 0 otherwise Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 16 / 55
Maximum Likelihood Estimation t = 1 if heads, t = 0 if tails. µ = p ( t = 1). One coin flip p ( t ) = µ t (1 − µ ) 1 − t Note that p (1) = µ , and p (0) = 1 − µ as required. Sequence of N independent coin flips N � µ t n (1 − µ ) 1 − t n p ( t ) = p ( t 1 , t 2 , ..., t N ) = n =1 which defines the likelihood function when viewed as a function of µ . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 17 / 55
Maximum Likelihood Estimation In a sequence of 10 coin flips we observe t = (1 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 0). The corresponding likelihood function is p ( t | µ ) = µ · (1 − µ ) · µ · µ · (1 − µ ) · µ · µ · µ · µ · (1 − µ ) = µ 7 (1 − µ ) 3 The corresponding loglikelihood function is ln p ( t | µ ) = ln( µ 7 (1 − µ ) 3 ) = 7 ln µ + 3 ln(1 − µ ) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 18 / 55
Computing the maximum To determine the maximum we take the derivative and equate it to zero d ln p ( t | µ ) = 7 3 µ − 1 − µ = 0 d µ which yields maximum likelihood estimate µ ML = 0 . 7. This is just the relative frequency of heads in the sample. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 19 / 55
Loglikelihood function for t = (1 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 0) −10 −15 −20 −25 −30 0.0 0.2 0.4 0.6 0.8 1.0 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 20 / 55
ML estimation for logistic regression Now probability of success p ( t n = 1) depends on the value of x n : p ( t n = 1 | x n ) = (1 + e − w ⊤ x n ) − 1 = y n p ( t n = 0 | x n ) = (1 + e w ⊤ x n ) − 1 1 − y n = we can represent its probability distribution as follows n (1 − y n ) 1 − t n p ( t n ) = y t n t n ∈ { 0 , 1 } ; n = 1 , . . . , N Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 21 / 55
ML estimation for logistic regression Example p ( t n ) n x n t n (1 + e w 0 +8 w 1 ) − 1 1 8 0 (1 + e w 0 +12 w 1 ) − 1 2 12 0 (1 + e − w 0 − 15 w 1 ) − 1 3 15 1 (1 + e − w 0 − 10 w 1 ) − 1 4 10 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 55
LR: likelihood function Since the t n observations are independent: N N � � n (1 − y n ) 1 − t n y t n p ( t | w ) = p ( t n ) = (4.89) n =1 n =1 Or, taking minus the natural log: N � n (1 − y n ) 1 − t n y t n − ln p ( t | w ) = − ln n =1 N � = − { t n ln y n + (1 − t n ) ln(1 − y n ) } (4.90) n =1 This is called the cross-entropy error function. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 55
LR: error function Since for the logistic regression model (1 + e − w ⊤ x n ) − 1 y n = (1 + e w ⊤ x n ) − 1 1 − y n = we get � � N � t n ln(1 + e − w ⊤ x n ) + (1 − t n ) ln(1 + e w ⊤ x n ) E ( w ) = n =1 Non-linear function of the parameters. No closed form solution. Error function globally convex. Estimate with e.g. gradient descent . . . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 55
Fitted Response Function Substitute maximum likelihood estimates into the response function to obtain the fitted response function e w ⊤ ML x p ( t = 1 | x ) = ˆ 1 + e w ⊤ ML x Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 55
Example: Programming Assignment Model the probability of successfully completing a programming assignment. Explanatory variable: “programming experience”. We find w 0 = − 3 . 0597 and w 1 = 0 . 1615, so e − 3 . 0597+0 . 1615 x n p ( t = 1 | x n ) = ˆ 1 + e − 3 . 0597+0 . 1615 x n 14 months of programming experience: e − 3 . 0597+0 . 1615(14) p ( t = 1 | x = 14) = ˆ 1 + e − 3 . 0597+0 . 1615(14) ≈ 0 . 31 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 26 / 55
Interpretation of weights In case of a single predictor variable, the odds of t = 1 are given by: p ( t = 1 | x ) p ( t = 0 | x ) = e w 0 + w 1 x If we increase x by 1 unit, then the odds become: e w 0 + w 1 ( x +1) = e w 0 + w 1 x + w 1 = e w 0 + w 1 x e w 1 , since e a + b = e a × e b . We have e w 1 = e 0 . 1615 ≈ 1 . 175 Hence, every extra month of programming experience increases the odds of success by 17 . 5%. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 55
Recommend
More recommend