mle
play

(MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - PowerPoint PPT Presentation

Maximum Likelihood Estimation (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an


  1. Maximum Likelihood Estimation (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A – Winter 2012 – UCSD

  2. Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an approximating function f ( x )  y   ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2

  3. Optimal Classifiers • Performance depends on the data/feature space metric • Some metrics are better than others • The meaning of “better” is connected to how well adapted the metric is to the properties of the data • But can we be more rigorous? What do we mean by “ optimal ”? • To talk about optimality we need to talk about cost or loss y  x ˆ f x ( ) ˆ L ( y , y ) ( ) · f – Average Loss ( Risk ) is the function that we want to minimize ˆ – Risk depends on true y and the prediction y – Tells us how good our predictor/estimator is 3

  4. Data-Conditional Risk, R ( x,i ) , for 0/1 Loss • An important special case of interest: – zero loss for no error and equal loss for two error types This is equivalent to the snake dart regular “zero/one” loss : prediction frog frog   0 i j regular 1 0      L i j   1 i j dart 0 1 • Under this loss      * ( ) argmin ( | ) i x L j i P j x | Y X i j   argmin ( | ) P j x | Y X i  4 j i

  5. Data-Conditional Risk, R ( x,i ) , for 0/1 Loss • Note , then, that in the 0/1 loss case,        ( , ) E [ ]| ( | ). R x i L Y i x P Y j x  | j i Y X • I.e., the data-conditional risk under the 0/1 loss is equal to the data-conditional Probability of Error ,     ( , ) ( | ) 1 ( | ) R x i P i x P i x • Thus the optimal Bayesian decision rule (BDR) under 0/1 loss minimizes the conditional probability of error . This is given by the MAP BDR :  * ( ) argma x ( | ) . i x P i x i 5

  6. Data-Conditional Risk, R ( x,i ) , for 0/1 Loss • Summarizing:   * ( ) argmin ( | ) i x P j x | Y X i  j i     argmin 1 ( | ) P i x   | Y X i  argmax ( | ) P i x | Y X i • The optimal decision rule is the MAP Rule : – Pick the class with largest probability given the observation x • This the Bayes Decision Rule (BDR) for the 0/1 loss – We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used 6

  7. BDR (under 0/1 Loss) • For the zero/one loss, the following three decision rules are optimal and equivalent  * ( ) argmax ( | ) – 1) i x P i x | Y X i    * ( ) argmax ( | ) ( ) – 2) i x P x i P i   | X Y Y i     * ( ) argmax log ( | ) log ( ) – 3) i x P x i P i   | X Y Y i – Form 1) is usually hard to use, 3) is frequently easier than 2) 7

  8. Gaussian BDR Classifier (0/1 Loss) • A very important case is that of Gaussian classes – The pdf of each class i is a Gaussian of mean m i and covariance S i   1 1     m S  m 1  T  ( | ) exp ( ) ( ) P x i x x X Y | i i i    S 2 d (2 ) | | i • The Gaussian BDR under 0/1 Loss is  1     m S  m * T 1 ( ) argmax ( ) ( ) i x x x  i i i  2 i  1   S  d log(2 ) log ( ) P i  i Y  2 8

  9. Gaussian Classifier (0/1 Loss) • This can be written as ( | ) = 0.5    x m  a * 2 ( ) argmin ( , ) i x d   i i i i with    S  2 1 T ( , ) ( ) ( ) d x y x y x y i i a   S  d log(2 ) 2log ( ) P i i i Y and can be interpeted as a nearest class-neighbor classifier which uses a “funny metric” – Note that each class has its own distance function which is related to the sum of the square of the Mahalanobis distance for that class plus the a term for that class – we effectively use different metrics in different regions of the space 9

  10. Gaussian Classifier (0/1 Loss) • A special case of interest is when ( | ) = 0.5 – all classes have the same S i = S    x m  a * ( ) 2 argmin ( , ) i x d   i i i with    S  2 ( , ) 1 T ( ) ( ) d x y x y x y a   2log ( ) P i i Y • Note: – a i can be dropped when all classes have equal probability (the case shown in the above figure). In this case the classifier is close in form to a NN classifier with Mahalanobis distance, but instead of finding the nearest training data point , it looks for the nearest class prototype m i using the Mahalanobis distance 10

  11. Gaussian Classifier (0/1 Loss) discriminant for • Binary y Cl Classifi ificatio cation n wi with S i = S ( | ) = 0.5 – One important property of this case is that the decision boundary is a hyperplane (Homework) This can be shown by computing the set of points x such that m  a  m  a 2 2 ( , ) ( , ) d x d x 0 0 1 1 and showing that they satisfy   T ( ) 0. w x x 0 w  This is the equation of a hyperplane x 0 x 0 with normal w . x 0 can be any fixed point x 1 on the hyperplane, but it is standard to x choose it to have minimum norm , in which case w and x 0 are then parallel x n 11 x 2 x 3 .

  12. Gaussian Classifier (0/1 Loss) • Furthermore, if all the covariances are the identity S i = I    x m  a * ( ) 2 argmin ( , ) i x d   i i i ? with *   2 ( , ) || 2 || d x y x y a   2log ( ) P i i Y • This is just Eu Eucli lidean ean Di Distance nce Templat plate e Matchi tching g with class means as templates – E.g. for digit classification – Compare complexity to nearest neighbors! 12

  13. The Sigmoid in 0/1 Loss Detection • We have derived all of this from the log-based 0/1 BDR     * ( ) argmax log ( | ) log ( ) i x P x i P i   | X Y Y i • When there are only two classes, it is also interesting to look at the original definition in an alternative form:  * ( ) argmax ( ) i x g x i i ( | ) ( ) P x i P i with   | X Y Y ( ) ( | ) g x P i x | i Y X ( ) P x X ( | ) ( ) P x i P i  | X Y Y  ( | 0) (0) ( |1) (1) P x P P x P | | X Y Y X Y Y 13

  14. The Sigmoid in MAP Detection • Note that this can be written as  1 * ( ) argmax ( ) i x g x  ( ) g x i i 0 ( |1) (1) P x P  | X Y Y 1   ( ) 1 ( ) g x g x ( | 0) (0) P x P 1 0 | X Y Y • For Gaussian classes , the posterior probability for “0” is 1  ( ) g x   0  m  m  a  a 2 2 1 exp ( , ) ( , ) d x d x 0 0 1 1 0 1    S  where, as before, 2 1 T ( , ) ( ) ( ) d x y x y x y i i a   S  d log(2 ) 2log ( ) P i 14 i i Y

  15. The Sigmoid in MAP detection • The posterior density for class “0”, 1  ( ) g x   0  m  m  a  a 2 2 1 exp ( , ) ( , ) d x d x 0 0 1 1 0 1 is a sigmoid and looks like this ( 1 | ) = 0.5 15

  16. The Sigmoid in Neural Networks • The sigmoid function also appears in neural networks – In neural networks, it can be interpreted as a posterior density for a Gaussian problem where the covariances are the same. 16

  17. The Sigmoid in Neural Networks • But not necessarily when the covariances are different 17

  18. Implementation • All of this is appealing, but in practice one doesn’t know the values of the parameters m , S , P Y (1) 1) • In the homework we use an “intuitive solution” to design a Gaussian classifier: – Start from a collection of datasets: D (i) = { x 1 (i) , ..., x n (i) } = set of examples from class i – For each class estimate the Gaussian BDR parameters using, 1 1 n   ˆ ( ) ˆ m  S   m  m  ˆ ˆ ˆ ( ) i ( ) i ( ) i T i ( )( ) x x x P i i j i j i j i Y n n T j j i i where T is the total number of examples (over all classes) – E.g., below are sample means computed for digit classification: 18

  19. A Practical Gaussian MAP Classifier • Instead of the ideal BDR  1     m S  m * T 1 ( ) argmax ( ) ( ) i x x x  i i i  2 i  1   S  d log(2 ) log ( ) P i  i Y  2 use the estimate of the BDR found from  1 ˆ ˆ     m S  m ˆ ˆ * T 1 ( ) argmax ( ) ( ) i x x x  i i i  2 i  1 ˆ ˆ   S  d log(2 ) log ( ) P i  i Y  2 19

  20. Important • Warning: at this point all optimality claims for the BDR cease to be valid!! • The BDR is guaranteed to achieve the minimum loss only when we use the true probabilities • When we “plug in” probability estimates , we could be implementing a classifier that is quite distant from the optimal – E.g. if the P X|Y (x|i) look like the example above one could never approximate it well by using simple parametric models (e.g. a single Gaussian). 20

Recommend


More recommend