Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Steven J Zeil Old Dominion Univ. Fall 2010 1
Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Discriminant-Based Classification 1 Linearly Separable Systems Pairwise Separation Posteriors 2 Logistic Discrimination 3 2
Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Likelihood-based: Assume a model for p ( � x | C i ). Use Bayes’ rule to calculate P ( C i | � x ) x ) = log P ( C i | � g i ( � x ) x | � Discriminant-based: Assume a model for g i ( � φ i ). Vapnik: Estimating the class densities is a harder problem than estimating the class discriminants. It does not make sense to solve a hard problem to solve an easier one. 3
Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Linear discriminant: d � w T x | � g i ( � w i , w i 0 ) = � i � x + w i 0 = w ij x j + w i 0 j =1 Advantages: Simple: O(d) space/computation 4
Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Linear discriminant: d � w T x | � g i ( � w i , w i 0 ) = � i � x + w i 0 = w ij x j + w i 0 j =1 Advantages: Simple: O(d) space/computation Knowledge extraction: Weights sizes give an indication of significance of contribution of each attribute 4
Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Linear discriminant: d � w T x | � g i ( � w i , w i 0 ) = � i � x + w i 0 = w ij x j + w i 0 j =1 Advantages: Simple: O(d) space/computation Knowledge extraction: Weights sizes give an indication of significance of contribution of each attribute x | C i ) are Gaussian with shared covariance Optimal when p ( � matrix 4
Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Linear discriminant: d � w T x | � g i ( � w i , w i 0 ) = � i � x + w i 0 = w ij x j + w i 0 j =1 Advantages: Simple: O(d) space/computation Knowledge extraction: Weights sizes give an indication of significance of contribution of each attribute x | C i ) are Gaussian with shared covariance Optimal when p ( � matrix Useful when classes are (almost) linearly separable 4
Discriminant-Based Classification Posteriors Logistic Discrimination More General Linear Models d � x | � g i ( � w i , w i 0 ) = w ij x j + w i 0 j =1 We can replace the x i on the right by any linearly independent set of basis functions: x ) − g 2 ( � g ( � x ) = g 1 ( � x ) w T � = x + w 0 � � C 1 if g ( � x ) > 0 Choose C 2 ow 5
Discriminant-Based Classification Posteriors Logistic Discrimination Geometric Interpretation Rewrite � x as w � � x = � x p + r || � w || where � x p is the projection of � x onto the hyperplane g ( � x ) = 0 w is normal to � the hyperplane r = g ( � x ) w || is the || � (signed) distance 6
Discriminant-Based Classification Posteriors Logistic Discrimination Linearly Separable Systems For multiple classes with w T g i ( � x | � w i , w i 0 ) = � i � x + w i 0 with the � w i normalized Choose C i if k g i ( � x ) = max j =1 g j ( � x ) 7
Discriminant-Based Classification Posteriors Logistic Discrimination Pairwise Separation If not linearly separable, compute discriminants between each pair of classes: w T x | � g ij ( � w ij , w ij 0 ) = � ij � x + w ij 0 Choose C i if ∀ j � = i , g ij ( � x ) > 0 8
Discriminant-Based Classification Posteriors Logistic Discrimination Revisiting Parametric Methods When p ( � x | C i ) ∼ N ( � µ, Σ), w T g i ( � x | � w i , w i 0 ) = � i � x + w i 0 w i = Σ − 1 � � µ i w i 0 = − 1 i Σ − 1 � µ T 2 � µ i + log P ( C i ) Let y ≡ P ( C 1 | � x ). Then P ( C 2 | � x ) = 1 − y y We choose C 1 if y > 0 . 5, or alternatively, if 1 − y > 1. � � y Equivalently, if log > 0 1 − y The latter is called the log odds of y or logit . 9
Discriminant-Based Classification Posteriors Logistic Discrimination log odds For 2 normal classes with a shared cov. matrix, the log odds is linear log P ( C 1 | � x ) logit ( P ( C 1 | � x )) = P ( C 2 | � x ) log P ( � x | C 1 ) x | C 2 ) + log P ( C 1 ) = P ( � P ( C 2 ) x | C 2 ) + log P ( C 1 ) = log P ( � x | C 1 ) − log P ( � P ( C 2 ) The P ( � x | C ) terms are exponential in � x (Gaussian pdf), so the log is linear w T � logit ( P ( C 1 | � x )) = � x + w 0 w = Σ − 1 ( � µ 2 ), w 0 = − 1 µ 2 ) T Σ − 1 ( � with � µ 1 − � 2 ( � µ 1 + � µ 1 + � µ 2 ) 10
Discriminant-Based Classification Posteriors Logistic Discrimination logistic The inverse of the logit function: w T � logit ( P ( C 1 | � x )) = � x + w 0 is called the logistic a.k.a. the sigmoid : 1 w T � P ( C 1 | � x ) = sigmoid ( � x + w 0 ) = w T � 1 + exp[ � x + w 0 ] 11
Discriminant-Based Classification Posteriors Logistic Discrimination Using the Sigmoid During training During training, estimate m 1 , � � m 2 , S , then compute the � w During testing, either Calculate x | � w T � g ( � w , w 0 ) = � x + w 0 and choose C i if g ( � x ) > 0, or Calculate w T � y = sigmoid ( � x + w 0 ) and choose C i if y > 0 . 5 12
Discriminant-Based Classification Posteriors Logistic Discrimination Logistic Discrimination For two classes, assume the log likelihood ratio is linear log p ( � x | C 1 ) w T � x | C 2 ) = � x + w 0 p ( � w T � logit ( p ( C 1 )) = � x + w 0 1 y = ˆ P ( C 1 | � x ) = w T � 1 + exp [ � x + w 0 ] Likelihood � ( y t ) r t (1 − y t ) 1 − r t l ( � w , w 0 |X ) = t Error (“cross-entropy”) r t log y t + (1 − r t ) log (1 − y t ) � E ( � w , w 0 |X ) = − t Train by numerical optimization to minimize E 13
Discriminant-Based Classification Posteriors Logistic Discrimination Estimating w 14
Discriminant-Based Classification Posteriors Logistic Discrimination Multiple classes For K classes, take C K as a reference class log p ( � x | C i ) w T � x | C K ) = � x + w 0 p ( � p ( C i | � x ) � � w T � x ) = exp � x + w 0 p ( C K | � w T � � � exp i � x + w i 0 y i = ˆ P ( C i | � x ) = � � 1 + � K w T j =1 exp � j � x + w j 0 This is called the softmax function because exponentiation combined with normalization tends to exaggerate weight of the maximum term Likelihood i ) r t � � ( y t l ( � w , w 0 |X ) = i t i 15
Discriminant-Based Classification Posteriors Logistic Discrimination Multiple classes (cont.) Error (“cross-entropy)”) � � r t i log y t w , w 0 |X ) = − E ( � i t i Train by numerical optimization to minimize E 16
Discriminant-Based Classification Posteriors Logistic Discrimination Softmax Classification 17
Discriminant-Based Classification Posteriors Logistic Discrimination Softmax Discriminants 18
Recommend
More recommend