ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani
Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2
Relation of learning & statistics } Target model in the learning problems can be considered as a statistical model } For a fixed set of data and underlying target (statistical model), the estimation methods try to estimate the target from the available data 3
Density estimation } Estimating the probability density function π(π) , given a ( set of data points π % drawn from it. %&' } Main approaches of density estimation: } Parametric: assuming a parameterized model for density function Β¨ A number of parameters are optimized by fitting the model to the data set } Nonparametric (Instance-based): No specific parametric model is assumed } The form of the density function is determined entirely by the data 4
Parametric density estimation } Estimating the probability density function π(π) , given a ( set of data points π % drawn from it. %&' } Assume that π(π) in terms of a specific functional form which has a number of adjustable parameters. } Methods for parameter estimation } Maximum likelihood estimation } Maximum A Posteriori (MAP) estimation 5
Parametric density estimation } Goal: estimate parameters of a distribution from a dataset π = {π ' , . . . , π (() } } π contains π independent, identically distributed (i.i.d.) training samples. } We need to determine πΎ given {π ' , β¦ , π (() } } How to represent πΎ ? } πΎ β or π(πΎ) ? 6
Example π π¦ π = π(π¦|π, 1) 7
Example 8
Maximum Likelihood Estimation (MLE) } Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given data. } Likelihood is the conditional probability of observations π = π (') , π (9) , β¦ , π (() given the value of parameters πΎ } Assuming i.i.d. observations: ( π π πΎ = : π(π (%) |πΎ) %&' likelihood of πΎ w.r.t. the samples } Maximum Likelihood estimation ; <= = argmax πΎ π π πΎ πΎ 9
Maximum Likelihood Estimation (MLE) D best agrees with the observed samples π 10
Maximum Likelihood Estimation (MLE) D best agrees with the observed samples π 11
Maximum Likelihood Estimation (MLE) D best agrees with the observed samples π 12
Maximum Likelihood Estimation (MLE) ( ( β πΎ = ln π π πΎ = ln : π π (%) πΎ = H ln π π (%) πΎ %&' %&' ( H ln π π (%) πΎ ; <= = argmax πΎ β(πΎ) = argmax πΎ πΎ %&' } Thus, we solve πΌ πΎ β πΎ = π } to find global optimum 13
MLE Bernoulli } Given: π = π¦ (') , π¦ (9) , β¦ , π¦ (() , π heads (1), π β π tails (0) π π¦ π = π M 1 β π 'NM ( ( = : π M O 1 β π 'NM O π π π = : π(π¦ % |π) %&' %&' ( ( ln π π π = H ln π(π¦ % |π) = H{π¦ % ln π + (1 β π¦ % ) ln 1 β π } %&' %&' ( π¦ (%) = 0 β π <= = β π ln π π π = π %&' ππ π π 14
MLE Bernoulli: example U D <= = } Example: π = {1,1,1} , π U = 1 } Prediction: all future tosses will land heads up } Overfitting to π 15
MLE: Multinomial distribution } Multinomial distribution (on variable with πΏ state): Parameter space: πΎ Y M X = π ' , β¦ , π Y π π πΎ = : π W π % β 0,1 W&' Y H π W = 1 π π¦ W = 1 = π W W&' π 9 π = π¦ ' , β¦ , π¦ Y π¦ W β {0,1} Y H π¦ W = 1 π ' W&' π U 16
MLE: Multinomial distribution π = π (') , π (9) , β¦ , π (() ( ( Y Y (O) (O) [ β M X M X π π πΎ = : π(π % |πΎ) O\] = : : π W = : π W W&' W&' ( %&' %&' (%) π W = H π¦ W %&' Y Y H π W = π β πΎ, π = ln π π πΎ + π(1 β H π W ) W&' W&' (%) ( β π¦ W = π W %&' _ W = π π π 17
οΏ½ οΏ½ MLE Gaussian: unknown π 1 π N ' 9e f MNg f π π¦ π = 2π π π β 1 9 ln π(π¦ % |π) = β ln 2π 9 π¦ % β π 2π ( ( πβ π = 0 β π = 0 β H 1 H ln π π¦ (%) π π 9 π¦ % β π ππ ππ %&' %&' ( = 0 β πΜ <= = 1 π H π¦ % %&' MLE corresponds to many well-known estimation methods. 18
MLE Gaussian: unknown π and π πΎ = π, π πΌ πΎ β πΎ = π ( πβ π, π = 0 β πΜ <= = 1 π H π¦ % ππ %&' ( πβ π, π i π<= = 1 9 π H π¦ % β πΜ <= = 0 β π ππ %&' 19
Maximum A Posteriori (MAP) estimation } MAP estimation ; <kl = argmax πΎ π πΎ π πΎ } Since π πΎ|π β π π |πΎ π(πΎ) ; <kl = argmax πΎ π π πΎ π(πΎ) πΎ } Example of prior distribution: π π = πͺ(π o , π 9 ) 20
MAP estimation Gaussian: unknown π π(π¦|π)~π(π, π 9 ) π is the only unknown parameter 9 ) π o and π o are known π(π|π o )~π(π o , π o ( π ππ ln π(π) : π π¦ % π = 0 %&' ( β H 1 β 1 π 9 π¦ % β π 9 π β π o = 0 π o %&' 9 π o + π o ( π¦ % π 9 β %&' β π i <kl = 9 1 + π o π 9 π [ f M O β e r e f β« 1 or π β β β πΜ <kl = πΜ <= = O\] ( 21
Maximum A Posteriori (MAP) estimation } Given a set of observations π and a prior distribution π(πΎ) on parameters, the parameter vector that maximizes π π πΎ π(πΎ) is found. π π π π π π D <kl β π D <= D <kl > π D <= π π 9 π 9 ππ o π ( = 9 + π 9 π o + 9 + π 9 π <= ππ o ππ o 22
MAP estimation Gaussian: unknown π (known π ) π π π β π π π(π |π) π π π = π π π ( , π ( 9 π o + π o ( π¦ % π 9 β %&' π ( = 9 1 + π o π 9 π π(π) 1 9 = 1 9 + π π 9 π ( π o [Bishop] More samples βΉ sharper π(π|π ) Higher confidence in estimation 23
Conjugate Priors } We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties } Choosing a prior such that the posterior distribution that is proportional to π(π |πΎ)π(πΎ) will have the same functional form as the prior. βπ·, π βπ· | π(πΎ|π· | ) β π π πΎ π(πΎ|π·) Having the same functional form 24
Prior for Bernoulli Likelihood π½ ' πΉ π = π½ o + π½ ' } Beta distribution over π β [0,1] : π½ ' β 1 D = π π½ o β 1 + π½ ' β 1 Beta π π½ ' , π½ o β π Ζ ] N' 1 β π Ζ r N' most probable π Beta π π½ ' , π½ o = Ξ(π½ o + π½ ' ) Ξ(π½ o )Ξ(π½ ' ) π Ζ ] N' 1 β π Ζ r N' } Beta distribution is the conjugate prior of Bernoulli: π π¦ π = π M 1 β π 'NM 25
Beta distribution 26
Benoulli likelihood: posterior Given: π = π¦ (') , π¦ (9) , β¦ , π¦ (() , π heads (1), π β π tails (0) π π π β π π π π(π) ( : π M O 1 β π 'NM O = Beta π π½ ' , π½ o %&' β π β β‘Ζ ] N' 1 β π (Nβ β‘Ζ r N' β π Ζ ] N' 1 β π Ζ r N' ( π = H π¦ (%) | , π½ o | β π π π β πΆππ’π π π½ ' %&' | = π½ ' + π π½ ' | = π½ o + π β π π½ o 27
Example π π¦ π = π M 1 β π 'NM Prior Beta: π½ o = π½ ' = 2 Bernoulli π π¦ = 1 π π π Given: π = π¦ (') , π¦ (9) , β¦ , π¦ (() : Posterior π heads (1), π β π tails (0) | = 5, π½ o | = 2 Beta: π½ ' π½ o = π½ ' = 2 π = 1,1,1 β π = 3, π = 3 | β 1 π½ ' | β 1 = 4 D <kl = argmax π π π π = | β 1 + π½ o π½ ' 5 Ε π 28
Toss example } MAP estimation can avoid overfitting D <= = 1 } π = {1,1,1} , π D <kl = 0.8 (with prior π π = Beta π 2,2 ) } π 29
Bayesian inference } Parameters πΎ as random variables with a priori distribution } Bayesian estimation utilizes the available prior information about the unknown parameter } As opposed to ML and MAP estimation, it does not seek a specific point estimate of the unknown parameter vector πΎ } The observed samples π convert the prior densities π πΎ into a posterior density π πΎ|π } Keep track of beliefs about πΎ βs values and uses these beliefs for reaching conclusions } In the Bayesian approach, we first specify π πΎ|π and then we compute the predictive distribution π(π|π ) 30
Recommend
More recommend