Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu
Recap Previous Lecture 2 C. Long Lecture 4 January 30, 2018
Recap Previous Lecture = å c l w w R a ( / ) x ( a / ) ( P / ) x i i j j = j 1 From a medical image, we want to classify (determine) whether it contains cancer tissues or not. w l - l P ( )( ) q = 2 12 22 b w l - l P ( )( ) 1 21 11 q = w w P ( )/ P ( ) a 2 1 Ground truths is always unknown for classifiers. 3 C. Long Lecture 4 January 30, 2018
Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 4 C. Long Lecture 4 January 30, 2018
Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 5 C. Long Lecture 4 January 30, 2018
Error Bounds Exact error calculations could be difficult – easier • to estimate error bounds ! or min[P ( ω 1 / x ), P ( ω 2 / x ) ] P ( error ) 6 C. Long Lecture 4 January 30, 2018
Error Bounds P ( error ) If the class conditional distributions are Gaussian , • then where 7 C. Long Lecture 4 January 30, 2018
Error Bounds The Chernoff bound is obtained by minimizing e κ ( β ) • This is a 1-D optimization problem, regardless to the dimensionality of the class conditional densities. 8 C. Long Lecture 4 January 30, 2018
Error Bounds The Bhattacharyya bound is obtained by setting • β =0.5 Easier to compute than Chernoff error but looser . Note : the Chernoff and Bhattacharyya bounds will not • be good bounds if the densities are not Gaussian . 9 C. Long Lecture 4 January 30, 2018
Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 10 C. Long Lecture 4 January 30, 2018
Receiver Operating Characteristic (ROC) Curve Every classifier typically employs some kind of a • threshold . q = w w P ( )/ P ( ) a 2 1 w l - l P ( )( ) q = 2 12 22 b w l - l P ( )( ) 1 21 11 Changing the threshold will affect the performance • of the classifier . ROC curves allow us to evaluate the performance • of a classifier using different thresholds . 11 C. Long Lecture 4 January 30, 2018
Example: Person Authentication Authenticate a person using biometrics ( e . g ., • fingerprints ). There are two possible distributions ( i . e ., classes ): • 12 C. Long Lecture 4 January 30, 2018
Example: Person Authentication Possible decisions : • (1) correct acceptance ( true positive ): X belongs to A , and we decide A (2) incorrect acceptance ( false positive ): X belongs to I , and we decide A (3) correct rejection ( true negative ): X belongs to I , and we decide I (4) incorrect rejection ( false negative ): X belongs to A , and we decide I correct rejection correct acceptance false negative false positive 13 C. Long Lecture 4 January 30, 2018
ROC Curve correct rejection correct acceptance false negative false positive FPR : False Positive Rate (X-axis) TRR : True Postive Rate (Y-axis) 14 C. Long Lecture 4 January 30, 2018
Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 15 C. Long Lecture 4 January 30, 2018
Missing Features Suppose x =( x 1 , x 2 ) is a test vector where x 1 is missing • and via x 2 = how can we classify it ? ˆ x 2 If we set x 1 equal to the average value , we will classify x as ω 3 But is larger ; should classify x as ω 2 ? w ˆ p x ( / ) 2 2 16 C. Long Lecture 4 January 30, 2018
Missing Features Suppose x = [ x g , x b ] ( x g : good features , x b : bad features ) • Derive the Bayes rule using the good features : • marginalize posterior probability over bad features . 17 C. Long Lecture 4 January 30, 2018
Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 18 C. Long Lecture 4 January 30, 2018
Compound Bayesian Decision Theory Sequential decision • Decide as each pattern ( e . g ., fish ) emerges . Compound decision • Wait for n patterns ( e . g ., fish ) to emerge . Make all n decisions jointly . Could improve performance when consecutive states of nature are not be statistically independent . 19 C. Long Lecture 4 January 30, 2018
Compound Bayesian Decision Theory Suppose denotes the n states of • nature where can take one of c values ω 1 , ω 2 , … , ω c ( i . e ., c categories ) Suppose is the prior probability of the n states of • nature . Suppose are n observed vectors . • It is unacceptable to simplify the problem of calculating P(ω) by assuming that the states of nature are independent. 20 C. Long Lecture 4 January 30, 2018
Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 21 C. Long Lecture 4 January 30, 2018
Intuition We could design an optimal classifier if we knew : • – ( priors ) – ( class conditional densities ) – Unfortunately , we rarely have this complete information ! Design a classifier from training data . • Samples are often too small for class conditional • estimation (large dimension of feature space) 22 C. Long Lecture 4 January 30, 2018
Supervised Learning in a Nutshell 23 C. Long Lecture 4 January 30, 2018
Statistical Estimation View Probabilities to the rescue : • x and y are random variables • • IID : Independent Identically Distributed • Both training & testing data sampled IID from P ( X , Y ) • Learn on training set • Have some hope of generalizing to test set • 24 C. Long Lecture 4 January 30, 2018
Parameter Estimation Use a priori information about the problem • E . g .: Normality of Simplify problem • From estimating unknown distribution function • To estimating parameters • 25 C. Long Lecture 4 January 30, 2018
Why Gaussians? Why does the entire world seem to always be harping • on about Gaussians ? – Central Limit Theorem ! – They’re easy ( and we like easy ) – Closely related to squared loss ( for regression ) – Mixture of Gaussians is sufficient to approximate many distributions 26 C. Long Lecture 4 January 30, 2018
Parameter Parameter Parameter estimation Bayesian estimation: Maximum likelihood: parameters as random values of parameters variables having some known a are fixed but unknown priori distribution 27 C. Long Lecture 4 January 30, 2018
Parameter Estimation Parameters in ML estimation are fixed but unknown ! • Best parameters are obtained by maximizing the • probability of obtaining the samples observed . Bayesian methods view the parameters as random • variables having some known distribution . In either approach , we use for our classification • rule 28 C. Long Lecture 4 January 30, 2018
Maximum Likelihood Estimation: Independence Across Classes For each class we have a proposed density • with unknown parameters which we need to estimate . Since we assumed independence of data across the • classes , estimation is an identical procedure for all classes . To simplify notation , we drop sub - indexes and say that • we need to estimate parameters θ for density p ( x ) 29 C. Long Lecture 4 January 30, 2018
Maximum-Likelihood Estimation Has good convergence properties as the sample • size increases Simpler than alternative techniques • General principle • Assume c datasets ( classes ) D 1, D 2, … , Dc • drawn independently according to Assume that has known parametric form • determined by parameter vector Further assume that Di gives no information about • ( ) 30 C. Long Lecture 4 January 30, 2018
Maximum-Likelihood Estimation Use set of independent samples to estimate • Our goal is to determine ( value of that best • agrees with observed training data ) Note if D is fixed is not a density • 31 C. Long Lecture 4 January 30, 2018
Example: Gaussian case Assume we have c classes and • Use the information provided by the training samples • to estimate each is associated with each category. Suppose that D contains n samples, • 32 C. Long Lecture 4 January 30, 2018
Maximum-Likelihood Estimation is called the likelihood of w . r . t the set of • samples . ML estimate of is , by definition the value that • maximizes “It is the value of that best agrees with the actually observed training sample” 33 C. Long Lecture 4 January 30, 2018
Optimal Estimation Let and let be the gradient operator • We define as the log likelihood function • New problem statement : • determine that maximizes the log likelihood • 34 C. Long Lecture 4 January 30, 2018
Recommend
More recommend