linear discrimination
play

Linear Discrimination Discriminant-Based Classification 1 Linear - PowerPoint PPT Presentation

Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Discriminant-Based Classification 1 Linear Discrimination Linearly Separable


  1. Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Discriminant-Based Classification 1 Linear Discrimination Linearly Separable Systems Pairwise Separation Steven J Zeil Old Dominion Univ. Posteriors 2 Fall 2010 Logistic Discrimination 3 1 2 Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Linear Discrimination Likelihood-based: Assume a model for p ( � x | C i ). Use Bayes’ rule to Linear discriminant: calculate P ( C i | � x ) d � w T g i ( � x | � w i , w i 0 ) = � i � x + w i 0 = w ij x j + w i 0 g i ( � x ) = log P ( C i | � x ) j =1 x | � Discriminant-based: Assume a model for g i ( � φ i ). Advantages: Vapnik: Estimating the class densities is a harder Simple: O(d) space/computation problem than estimating the class discriminants. It Knowledge extraction: Weights sizes give an indication of does not make sense to solve a hard problem to solve significance of contribution of each attribute an easier one. Optimal when p ( � x | C i ) are Gaussian with shared covariance matrix Useful when classes are (almost) linearly separable 3 4

  2. Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination More General Linear Models Geometric Interpretation Rewrite � x as d � � w g i ( � x | � w i , w i 0 ) = w ij x j + w i 0 � x = � x p + r || � w || j =1 We can replace the x i on the right by any linearly independent set where � x p is the of basis functions: projection of � x onto the hyperplane x ) − g 2 ( � g ( � x ) = g 1 ( � x ) g ( � x ) = 0 w T � = � x + w 0 w is normal to � � C 1 the hyperplane if g ( � x ) > 0 r = g ( � x ) Choose w || is the C 2 ow || � (signed) distance 5 6 Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination Linearly Separable Systems Pairwise Separation For multiple classes with If not linearly separable, compute discriminants w T g i ( � x | � w i , w i 0 ) = � i � x + w i 0 between each pair of classes: with the � w i normalized Choose C i if w T g ij ( � x | � w ij , w ij 0 ) = � x + w ij 0 ij � k g i ( � x ) = max j =1 g j ( � x ) Choose C i if ∀ j � = i , g ij ( � x ) > 0 7 8

  3. Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination Revisiting Parametric Methods log odds When p ( � x | C i ) ∼ N ( � µ, Σ), For 2 normal classes with a shared cov. matrix, the log odds is linear w T g i ( � x | � w i , w i 0 ) = � i � x + w i 0 log P ( C 1 | � x ) logit ( P ( C 1 | � x )) = w i = Σ − 1 � P ( C 2 | � x ) � µ i x | C 1 ) log P ( � x | C 2 ) + log P ( C 1 ) w i 0 = − 1 µ T i Σ − 1 � = µ i + log P ( C i ) 2 � P ( � P ( C 2 ) x | C 2 ) + log P ( C 1 ) Let y ≡ P ( C 1 | � x ). Then P ( C 2 | � x ) = 1 − y = log P ( � x | C 1 ) − log P ( � y P ( C 2 ) We choose C 1 if y > 0 . 5, or alternatively, if 1 − y > 1. � � y Equivalently, if log > 0 1 − y The P ( � x | C ) terms are exponential in � x (Gaussian pdf), so the log The latter is called the log odds of y or logit . is linear w T � logit ( P ( C 1 | � x )) = � x + w 0 µ 2 ), w 0 = − 1 w = Σ − 1 ( � µ 2 ) T Σ − 1 ( � µ 1 − � with � 2 ( � µ 1 + � µ 1 + � µ 2 ) 9 10 Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination logistic Using the Sigmoid The inverse of the logit function: During training During training, estimate w T � logit ( P ( C 1 | � x )) = � x + w 0 m 1 , � � m 2 , S , then compute the � w is called the logistic a.k.a. the sigmoid : During testing, either Calculate 1 w T � P ( C 1 | � x ) = sigmoid ( � x + w 0 ) = w T � x | � g ( � w , w 0 ) = � x + w 0 w T � 1 + exp[ � x + w 0 ] and choose C i if g ( � x ) > 0, or Calculate w T � y = sigmoid ( � x + w 0 ) and choose C i if y > 0 . 5 11 12

  4. Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination Logistic Discrimination Estimating w For two classes, assume the log likelihood ratio is linear log p ( � x | C 1 ) w T � x | C 2 ) = � x + w 0 p ( � w T � logit ( p ( C 1 )) = � x + w 0 1 y = ˆ P ( C 1 | � x ) = w T � 1 + exp [ � x + w 0 ] Likelihood ( y t ) r t (1 − y t ) 1 − r t � l ( � w , w 0 |X ) = t Error (“cross-entropy”) r t log y t + (1 − r t ) log (1 − y t ) � E ( � w , w 0 |X ) = − t Train by numerical optimization to minimize E 13 14 Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination Multiple classes Multiple classes (cont.) For K classes, take C K as a reference class Error (“cross-entropy)”) log p ( � x | C i ) w T � x | C K ) = � x + w 0 p ( � � � r t i log y t w , w 0 |X ) = − E ( � i p ( C i | � x ) t � � i w T � x ) = exp � x + w 0 p ( C K | � Train by numerical optimization to minimize E � � w T � exp x + w i 0 i � y i = ˆ P ( C i | � x ) = � � 1 + � K w T j =1 exp � j � x + w j 0 This is called the softmax function because exponentiation combined with normalization tends to exaggerate weight of the maximum term Likelihood i ) r t � � ( y t w , w 0 |X ) = l ( � i t i 15 16

  5. Discriminant-Based Classification Posteriors Logistic Discrimination Discriminant-Based Classification Posteriors Logistic Discrimination Softmax Classification Softmax Discriminants 17 18

Recommend


More recommend