(MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - PowerPoint PPT Presentation

Maximum Likelihood Estimation (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A – Winter 2012 – UCSD

Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an approximating function f ( x )  y   ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2

Optimal Classifiers • Performance depends on the data/feature space metric • Some metrics are better than others • The meaning of “better” is connected to how well adapted the metric is to the properties of the data • But can we be more rigorous? What do we mean by “ optimal ”? • To talk about optimality we need to talk about cost or loss y  x ˆ f x ( ) ˆ L ( y , y ) ( ) · f – Average Loss ( Risk ) is the function that we want to minimize ˆ – Risk depends on true y and the prediction y – Tells us how good our predictor/estimator is 3

Data-Conditional Risk, R ( x,i ) , for 0/1 Loss • An important special case of interest: – zero loss for no error and equal loss for two error types This is equivalent to the snake dart regular “zero/one” loss : prediction frog frog   0 i j regular 1 0      L i j   1 i j dart 0 1 • Under this loss      * ( ) argmin ( | ) i x L j i P j x | Y X i j   argmin ( | ) P j x | Y X i  4 j i

Data-Conditional Risk, R ( x,i ) , for 0/1 Loss • Note , then, that in the 0/1 loss case,        ( , ) E [ ]| ( | ). R x i L Y i x P Y j x  | j i Y X • I.e., the data-conditional risk under the 0/1 loss is equal to the data-conditional Probability of Error ,     ( , ) ( | ) 1 ( | ) R x i P i x P i x • Thus the optimal Bayesian decision rule (BDR) under 0/1 loss minimizes the conditional probability of error . This is given by the MAP BDR :  * ( ) argma x ( | ) . i x P i x i 5

Data-Conditional Risk, R ( x,i ) , for 0/1 Loss • Summarizing:   * ( ) argmin ( | ) i x P j x | Y X i  j i     argmin 1 ( | ) P i x   | Y X i  argmax ( | ) P i x | Y X i • The optimal decision rule is the MAP Rule : – Pick the class with largest probability given the observation x • This the Bayes Decision Rule (BDR) for the 0/1 loss – We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used 6

BDR (under 0/1 Loss) • For the zero/one loss, the following three decision rules are optimal and equivalent  * ( ) argmax ( | ) – 1) i x P i x | Y X i    * ( ) argmax ( | ) ( ) – 2) i x P x i P i   | X Y Y i     * ( ) argmax log ( | ) log ( ) – 3) i x P x i P i   | X Y Y i – Form 1) is usually hard to use, 3) is frequently easier than 2) 7

Gaussian BDR Classifier (0/1 Loss) • A very important case is that of Gaussian classes – The pdf of each class i is a Gaussian of mean m i and covariance S i   1 1     m S  m 1  T  ( | ) exp ( ) ( ) P x i x x X Y | i i i    S 2 d (2 ) | | i • The Gaussian BDR under 0/1 Loss is  1     m S  m * T 1 ( ) argmax ( ) ( ) i x x x  i i i  2 i  1   S  d log(2 ) log ( ) P i  i Y  2 8

Gaussian Classifier (0/1 Loss) • This can be written as ( | ) = 0.5    x m  a * 2 ( ) argmin ( , ) i x d   i i i i with    S  2 1 T ( , ) ( ) ( ) d x y x y x y i i a   S  d log(2 ) 2log ( ) P i i i Y and can be interpeted as a nearest class-neighbor classifier which uses a “funny metric” – Note that each class has its own distance function which is related to the sum of the square of the Mahalanobis distance for that class plus the a term for that class – we effectively use different metrics in different regions of the space 9

Gaussian Classifier (0/1 Loss) • A special case of interest is when ( | ) = 0.5 – all classes have the same S i = S    x m  a * ( ) 2 argmin ( , ) i x d   i i i with    S  2 ( , ) 1 T ( ) ( ) d x y x y x y a   2log ( ) P i i Y • Note: – a i can be dropped when all classes have equal probability (the case shown in the above figure). In this case the classifier is close in form to a NN classifier with Mahalanobis distance, but instead of finding the nearest training data point , it looks for the nearest class prototype m i using the Mahalanobis distance 10

Gaussian Classifier (0/1 Loss) discriminant for • Binary y Cl Classifi ificatio cation n wi with S i = S ( | ) = 0.5 – One important property of this case is that the decision boundary is a hyperplane (Homework) This can be shown by computing the set of points x such that m  a  m  a 2 2 ( , ) ( , ) d x d x 0 0 1 1 and showing that they satisfy   T ( ) 0. w x x 0 w  This is the equation of a hyperplane x 0 x 0 with normal w . x 0 can be any fixed point x 1 on the hyperplane, but it is standard to x choose it to have minimum norm , in which case w and x 0 are then parallel x n 11 x 2 x 3 .

Gaussian Classifier (0/1 Loss) • Furthermore, if all the covariances are the identity S i = I    x m  a * ( ) 2 argmin ( , ) i x d   i i i ? with *   2 ( , ) || 2 || d x y x y a   2log ( ) P i i Y • This is just Eu Eucli lidean ean Di Distance nce Templat plate e Matchi tching g with class means as templates – E.g. for digit classification – Compare complexity to nearest neighbors! 12

The Sigmoid in MAP Detection • Note that this can be written as  1 * ( ) argmax ( ) i x g x  ( ) g x i i 0 ( |1) (1) P x P  | X Y Y 1   ( ) 1 ( ) g x g x ( | 0) (0) P x P 1 0 | X Y Y • For Gaussian classes , the posterior probability for “0” is 1  ( ) g x   0  m  m  a  a 2 2 1 exp ( , ) ( , ) d x d x 0 0 1 1 0 1    S  where, as before, 2 1 T ( , ) ( ) ( ) d x y x y x y i i a   S  d log(2 ) 2log ( ) P i 14 i i Y

The Sigmoid in MAP detection • The posterior density for class “0”, 1  ( ) g x   0  m  m  a  a 2 2 1 exp ( , ) ( , ) d x d x 0 0 1 1 0 1 is a sigmoid and looks like this ( 1 | ) = 0.5 15

The Sigmoid in Neural Networks • The sigmoid function also appears in neural networks – In neural networks, it can be interpreted as a posterior density for a Gaussian problem where the covariances are the same. 16

The Sigmoid in Neural Networks • But not necessarily when the covariances are different 17

Implementation • All of this is appealing, but in practice one doesn’t know the values of the parameters m , S , P Y (1) 1) • In the homework we use an “intuitive solution” to design a Gaussian classifier: – Start from a collection of datasets: D (i) = { x 1 (i) , ..., x n (i) } = set of examples from class i – For each class estimate the Gaussian BDR parameters using, 1 1 n   ˆ ( ) ˆ m  S   m  m  ˆ ˆ ˆ ( ) i ( ) i ( ) i T i ( )( ) x x x P i i j i j i j i Y n n T j j i i where T is the total number of examples (over all classes) – E.g., below are sample means computed for digit classification: 18

A Practical Gaussian MAP Classifier • Instead of the ideal BDR  1     m S  m * T 1 ( ) argmax ( ) ( ) i x x x  i i i  2 i  1   S  d log(2 ) log ( ) P i  i Y  2 use the estimate of the BDR found from  1 ˆ ˆ     m S  m ˆ ˆ * T 1 ( ) argmax ( ) ( ) i x x x  i i i  2 i  1 ˆ ˆ   S  d log(2 ) log ( ) P i  i Y  2 19

Important • Warning: at this point all optimality claims for the BDR cease to be valid!! • The BDR is guaranteed to achieve the minimum loss only when we use the true probabilities • When we “plug in” probability estimates , we could be implementing a classifier that is quite distant from the optimal – E.g. if the P X|Y (x|i) look like the example above one could never approximate it well by using simple parametric models (e.g. a single Gaussian). 20

(MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - PowerPoint PPT Presentation

Maximum Likelihood Estimation (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

Homework 2 MLE and Naive Bayes Instructions Answer the questions and upload your answers to

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM:

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

TUTORIAL TUTORIAL Matthieu R Bloch Tuesday, March 24, 2020 1 MLE FOR UNIFORM DISTRIBUTIONS

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CSE 527, Additional notes on MLE & EM Based on earlier notes by C. Grant & M. Narasimhan

Likelihood-Based Statistical Decisions Marco Cattaneo Seminar for Statistics ETH Z urich,

10-701 Probability and MLE (brief) intro to probability Basic notations Random variable -

Statistical inference for incomplete Ins Couso ranking data: A comparison of two Mohsen Ahmadi

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

Tutorial on Probabilistic Programming in Machine Learning Frank Wood Play Along 1. Download

Introduction to (profiled) side-channel analysis Annelie Heuser In this talk back to

Statistical Tests Matthieu de Lapparent matthieu.delapparent@epfl.ch Transport and Mobility

Statistical Tests Amanda Stathopoulos amanda.stathopoulos@epfl.ch Transport and Mobility

(MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - PowerPoint PPT Presentation

Maximum Likelihood Estimation (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

Homework 2 MLE and Naive Bayes Instructions Answer the questions and upload your answers to

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM:

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

TUTORIAL TUTORIAL Matthieu R Bloch Tuesday, March 24, 2020 1 MLE FOR UNIFORM DISTRIBUTIONS

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CSE 527, Additional notes on MLE &amp; EM Based on earlier notes by C. Grant &amp; M. Narasimhan

Likelihood-Based Statistical Decisions Marco Cattaneo Seminar for Statistics ETH Z urich,

10-701 Probability and MLE (brief) intro to probability Basic notations Random variable -

Statistical inference for incomplete Ins Couso ranking data: A comparison of two Mohsen Ahmadi

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

Tutorial on Probabilistic Programming in Machine Learning Frank Wood Play Along 1. Download

Introduction to (profiled) side-channel analysis Annelie Heuser In this talk back to

Statistical Tests Matthieu de Lapparent matthieu.delapparent@epfl.ch Transport and Mobility

Statistical Tests Amanda Stathopoulos amanda.stathopoulos@epfl.ch Transport and Mobility

CSE 527, Additional notes on MLE & EM Based on earlier notes by C. Grant & M. Narasimhan