Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 1 / 33
Introduction ◮ Bayesian Decision Theory shows us how to design an optimal classifier if we know the prior probabilities P ( w i ) and the class-conditional densities p ( x | w i ) . ◮ Unfortunately, we rarely have complete knowledge of the probabilistic structure. ◮ However, we can often find design samples or training data that include particular representatives of the patterns we want to classify. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 2 / 33
Introduction ◮ To simplify the problem, we can assume some parametric form for the conditional densities and estimate these parameters using training data. ◮ Then, we can use the resulting estimates as if they were the true values and perform classification using the Bayesian decision rule. ◮ We will consider only the supervised learning case where the true class label for each sample is known. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 3 / 33
Introduction ◮ We will study two estimation procedures: ◮ Maximum likelihood estimation ◮ Views the parameters as quantities whose values are fixed but unknown. ◮ Estimates these values by maximizing the probability of obtaining the samples observed. ◮ Bayesian estimation ◮ Views the parameters as random variables having some known prior distribution. ◮ Observing new samples converts the prior to a posterior density. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 4 / 33
Maximum Likelihood Estimation ◮ Suppose we have a set D = { x 1 , . . . , x n } of independent and identically distributed ( i.i.d. ) samples drawn from the density p ( x | θ ) . ◮ We would like to use training samples in D to estimate the unknown parameter vector θ . ◮ Define L ( θ |D ) as the likelihood function of θ with respect to D as n � L ( θ |D ) = p ( D| θ ) = p ( x 1 , . . . , x n | θ ) = p ( x i | θ ) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 5 / 33
Maximum Likelihood Estimation ◮ The maximum likelihood estimate (MLE) of θ is, by definition, the value ˆ θ that maximizes L ( θ |D ) and can be computed as ˆ θ = arg max L ( θ |D ) . θ ◮ It is often easier to work with the logarithm of the likelihood function ( log-likelihood function ) that gives n � ˆ θ = arg max log L ( θ |D ) = arg max log p ( x i | θ ) . θ θ i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 6 / 33
Maximum Likelihood Estimation ◮ If the number of parameters is p , i.e., θ = ( θ 1 , . . . , θ p ) T , define the gradient operator ∂ ∂ θ 1 . . ∇ θ ≡ . . ∂ ∂ θ p ◮ Then, the MLE of θ should satisfy the necessary conditions n � ∇ θ log L ( θ |D ) = ∇ θ log p ( x i | θ ) = 0 . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 7 / 33
Maximum Likelihood Estimation ◮ Properties of MLEs: ◮ The MLE is the parameter point for which the observed sample is the most likely. ◮ The procedure with partial derivatives may result in several local extrema. We should check each solution individually to identify the global optimum. ◮ Boundary conditions must also be checked separately for extrema. ◮ Invariance property: if ˆ θ is the MLE of θ , then for any function f ( θ ) , the MLE of f ( θ ) is f ( ˆ θ ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 8 / 33
The Gaussian Case ◮ Suppose that p ( x | θ ) = N ( µ , Σ ) . ◮ When Σ is known but µ is unknown: n µ = 1 � ˆ x i n i =1 ◮ When both µ and Σ are unknown: n n µ = 1 Σ = 1 � ˆ � µ ) T and ( x i − ˆ µ )( x i − ˆ ˆ x i n n i =1 i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 9 / 33
The Bernoulli Case ◮ Suppose that P ( x | θ ) = Bernoulli ( θ ) = θ x (1 − θ ) 1 − x where x = 0 , 1 and 0 ≤ θ ≤ 1 . ◮ The MLE of θ can be computed as n θ = 1 ˆ � x i . n i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 10 / 33
Bias of Estimators ◮ Bias of an estimator ˆ θ is the difference between the expected value of ˆ θ and θ . ◮ The MLE of µ is an unbiased estimator for µ because E [ ˆ µ ] = µ . ◮ The MLE of Σ is not an unbiased estimator for Σ because E [ ˆ Σ ] = n − 1 n Σ � = Σ . ◮ The sample covariance n 1 S 2 = � µ ) T ( x i − ˆ µ )( x i − ˆ n − 1 i =1 is an unbiased estimator for Σ . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 11 / 33
Goodness-of-fit ◮ To measure how well a fitted distribution resembles the sample data ( goodness-of-fit ), we can use the Kolmogorov-Smirnov test statistic. ◮ It is defined as the maximum value of the absolute difference between the cumulative distribution function estimated from the sample and the one calculated from the fitted distribution. ◮ After estimating the parameters for different distributions, we can compute the Kolmogorov-Smirnov statistic for each distribution and choose the one with the smallest value as the best fit to our sample. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 12 / 33
Maximum Likelihood Estimation Examples Random sample from N(10,2 2 ) Random sample from 0.5 N(10,0.4 2 ) + 0.5 N(11,0.5 2 ) 0.2 0.7 Histogram Histogram Gaussian fit Gaussian fit 0.18 0.6 0.16 0.5 0.14 0.12 0.4 pdf 0.1 pdf 0.3 0.08 0.06 0.2 0.04 0.1 0.02 0 0 6 8 10 12 14 16 9 9.5 10 10.5 11 11.5 12 x x (a) True pdf is N (10 , 4) . Estimated pdf is (b) True pdf is 0 . 5 N (10 , 0 . 16) + 0 . 5 N (11 , 0 . 25) . N (9 . 98 , 4 . 05) . Estimated pdf is N (10 . 50 , 0 . 47) . Random sample from Gamma(4,4) Cumulative distribution functions 0.07 1 Histogram Gaussian fit 0.9 Gamma fit 0.06 0.8 0.05 0.7 0.6 0.04 pdf cdf 0.5 0.03 0.4 0.3 0.02 0.2 0.01 True cdf 0.1 Gaussian fit cdf Gamma fit cdf 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 x x (c) True pdf is Gamma (4 , 4) . Estimated pdfs (d) Cumulative distribution functions for the ex- are N (16 . 1 , 67 . 4) and Gamma (3 . 8 , 4 . 2) . ample in (c). Figure 1: Histograms of samples and estimated densities for different distributions. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 13 / 33
Bayesian Estimation ◮ Suppose the set D = { x 1 , . . . , x n } contains the samples drawn independently from the density p ( x | θ ) whose form is assumed to be known but θ is not known exactly. ◮ Assume that θ is a quantity whose variation can be described by the prior probability distribution p ( θ ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 14 / 33
Bayesian Estimation ◮ Given D , the prior distribution can be updated to form the posterior distribution using the Bayes rule p ( θ |D ) = p ( D| θ ) p ( θ ) p ( D ) where � p ( D ) = p ( D| θ ) p ( θ ) d θ and n � p ( D| θ ) = p ( x i | θ ) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 15 / 33
Bayesian Estimation ◮ The posterior distribution p ( θ |D ) can be used to find estimates for θ (e.g., the expected value of p ( θ |D ) can be used as an estimate for θ ). ◮ Then, the conditional density p ( x |D ) can be computed as � p ( x |D ) = p ( x | θ ) p ( θ |D ) d θ and can be used in the Bayesian classifier. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 16 / 33
MLEs vs. Bayes Estimates ◮ Maximum likelihood estimation finds an estimate of θ based on the samples in D but a different sample set would give rise to a different estimate. ◮ Bayes estimate takes into account the sampling variability. ◮ We assume that we do not know the true value of θ , and instead of taking a single estimate, we take a weighted sum of the densities p ( x | θ ) weighted by the distribution p ( θ |D ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 17 / 33
The Gaussian Case ◮ Consider the univariate case p ( x | µ ) = N ( µ, σ 2 ) where µ is the only unknown parameter with a prior distribution p ( µ ) = N ( µ 0 , σ 2 ( σ 2 , µ 0 and σ 2 0 ) 0 are all known). ◮ This corresponds to drawing a value for µ from the population with density p ( µ ) , treating it as the true value in the density p ( x | µ ) , and drawing samples for x from this density. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 18 / 33
The Gaussian Case ◮ Given D = { x 1 , . . . , x n } , we obtain n � p ( µ |D ) ∝ p ( x i | µ ) p ( µ ) i =1 � 1 �� n � n �� − 1 σ 2 + 1 � � x i + µ 0 µ 2 − 2 � ∝ exp µ σ 2 σ 2 σ 2 2 0 0 i =1 = N ( µ n , σ 2 n ) where n nσ 2 σ 2 � � � � � � µ n = 1 � 0 µ n = µ n + ˆ µ 0 ˆ x i nσ 2 nσ 2 0 + σ 2 0 + σ 2 n i =1 σ 2 0 σ 2 σ 2 n = 0 + σ 2 . nσ 2 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 19 / 33
Recommend
More recommend