ml map estimation and bayesian
play

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2 Relation


  1. ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

  2. Outline } Introduction } Maximum-Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian inference 2

  3. Relation of learning & statistics } Target model in the learning problems can be considered as a statistical model } For a fixed set of data and underlying target (statistical model), the estimation methods try to estimate the target from the available data 3

  4. Density estimation } Estimating the probability density function π‘ž(π’š) , given a ( set of data points π’š % drawn from it. %&' } Main approaches of density estimation: } Parametric: assuming a parameterized model for density function Β¨ A number of parameters are optimized by fitting the model to the data set } Nonparametric (Instance-based): No specific parametric model is assumed } The form of the density function is determined entirely by the data 4

  5. Parametric density estimation } Estimating the probability density function π‘ž(π’š) , given a ( set of data points π’š % drawn from it. %&' } Assume that π‘ž(π’š) in terms of a specific functional form which has a number of adjustable parameters. } Methods for parameter estimation } Maximum likelihood estimation } Maximum A Posteriori (MAP) estimation 5

  6. Parametric density estimation } Goal: estimate parameters of a distribution from a dataset 𝒠 = {π’š ' , . . . , π’š (() } } 𝒠 contains 𝑂 independent, identically distributed (i.i.d.) training samples. } We need to determine 𝜾 given {π’š ' , … , π’š (() } } How to represent 𝜾 ? } 𝜾 βˆ— or π‘ž(𝜾) ? 6

  7. Example 𝑄 𝑦 𝜈 = 𝑂(𝑦|𝜈, 1) 7

  8. Example 8

  9. Maximum Likelihood Estimation (MLE) } Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given data. } Likelihood is the conditional probability of observations 𝒠 = π’š (') , π’š (9) , … , π’š (() given the value of parameters 𝜾 } Assuming i.i.d. observations: ( π‘ž 𝒠 𝜾 = : π‘ž(π’š (%) |𝜾) %&' likelihood of 𝜾 w.r.t. the samples } Maximum Likelihood estimation ; <= = argmax 𝜾 π‘ž 𝒠 𝜾 𝜾 9

  10. Maximum Likelihood Estimation (MLE) D best agrees with the observed samples πœ„ 10

  11. Maximum Likelihood Estimation (MLE) D best agrees with the observed samples πœ„ 11

  12. Maximum Likelihood Estimation (MLE) D best agrees with the observed samples πœ„ 12

  13. Maximum Likelihood Estimation (MLE) ( ( β„’ 𝜾 = ln π‘ž 𝒠 𝜾 = ln : π‘ž π’š (%) 𝜾 = H ln π‘ž π’š (%) 𝜾 %&' %&' ( H ln π‘ž π’š (%) 𝜾 ; <= = argmax 𝜾 β„’(𝜾) = argmax 𝜾 𝜾 %&' } Thus, we solve 𝛼 𝜾 β„’ 𝜾 = 𝟏 } to find global optimum 13

  14. MLE Bernoulli } Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) π‘ž 𝑦 πœ„ = πœ„ M 1 βˆ’ πœ„ 'NM ( ( = : πœ„ M O 1 βˆ’ πœ„ 'NM O π‘ž 𝒠 πœ„ = : π‘ž(𝑦 % |πœ„) %&' %&' ( ( ln π‘ž 𝒠 πœ„ = H ln π‘ž(𝑦 % |πœ„) = H{𝑦 % ln πœ„ + (1 βˆ’ 𝑦 % ) ln 1 βˆ’ πœ„ } %&' %&' ( 𝑦 (%) = 0 β‡’ πœ„ <= = βˆ‘ πœ– ln π‘ž 𝒠 πœ„ = 𝑛 %&' πœ–πœ„ 𝑂 𝑂 14

  15. MLE Bernoulli: example U D <= = } Example: 𝒠 = {1,1,1} , πœ„ U = 1 } Prediction: all future tosses will land heads up } Overfitting to 𝒠 15

  16. MLE: Multinomial distribution } Multinomial distribution (on variable with 𝐿 state): Parameter space: 𝜾 Y M X = πœ„ ' , … , πœ„ Y 𝑄 π’š 𝜾 = : πœ„ W πœ„ % ∈ 0,1 W&' Y H πœ„ W = 1 𝑄 𝑦 W = 1 = πœ„ W W&' πœ„ 9 π’š = 𝑦 ' , … , 𝑦 Y 𝑦 W ∈ {0,1} Y H 𝑦 W = 1 πœ„ ' W&' πœ„ U 16

  17. MLE: Multinomial distribution 𝒠 = π’š (') , π’š (9) , … , π’š (() ( ( Y Y (O) (O) [ βˆ‘ M X M X 𝑄 𝒠 𝜾 = : 𝑄(π’š % |𝜾) O\] = : : πœ„ W = : πœ„ W W&' W&' ( %&' %&' (%) 𝑂 W = H 𝑦 W %&' Y Y H 𝑂 W = 𝑂 β„’ 𝜾, πœ‡ = ln π‘ž 𝒠 𝜾 + πœ‡(1 βˆ’ H πœ„ W ) W&' W&' (%) ( βˆ‘ 𝑦 W = 𝑂 W %&' _ W = πœ„ 𝑂 𝑂 17

  18. οΏ½ οΏ½ MLE Gaussian: unknown 𝜈 1 𝑓 N ' 9e f MNg f π‘ž 𝑦 𝜈 = 2𝜌 𝜏 𝜏 βˆ’ 1 9 ln π‘ž(𝑦 % |𝜈) = βˆ’ ln 2𝜏 9 𝑦 % βˆ’ 𝜈 2𝜌 ( ( πœ–β„’ 𝜈 = 0 β‡’ πœ– = 0 β‡’ H 1 H ln π‘ž 𝑦 (%) 𝜈 𝜏 9 𝑦 % βˆ’ 𝜈 πœ–πœˆ πœ–πœˆ %&' %&' ( = 0 β‡’ πœˆΜ‚ <= = 1 𝑂 H 𝑦 % %&' MLE corresponds to many well-known estimation methods. 18

  19. MLE Gaussian: unknown 𝜈 and 𝜏 𝜾 = 𝜈, 𝜏 𝛼 𝜾 β„’ 𝜾 = 𝟏 ( πœ–β„’ 𝜈, 𝜏 = 0 β‡’ πœˆΜ‚ <= = 1 𝑂 H 𝑦 % πœ–πœˆ %&' ( πœ–β„’ 𝜈, 𝜏 i πŸ‘<= = 1 9 𝑂 H 𝑦 % βˆ’ πœˆΜ‚ <= = 0 β‡’ 𝜏 πœ–πœ %&' 19

  20. Maximum A Posteriori (MAP) estimation } MAP estimation ; <kl = argmax 𝜾 π‘ž 𝜾 𝒠 𝜾 } Since π‘ž 𝜾|𝒠 ∝ π‘ž 𝒠|𝜾 π‘ž(𝜾) ; <kl = argmax 𝜾 π‘ž 𝒠 𝜾 π‘ž(𝜾) 𝜾 } Example of prior distribution: π‘ž πœ„ = π’ͺ(πœ„ o , 𝜏 9 ) 20

  21. MAP estimation Gaussian: unknown 𝜈 π‘ž(𝑦|𝜈)~𝑂(𝜈, 𝜏 9 ) 𝜈 is the only unknown parameter 9 ) 𝜈 o and 𝜏 o are known π‘ž(𝜈|𝜈 o )~𝑂(𝜈 o , 𝜏 o ( 𝑒 π‘’πœˆ ln π‘ž(𝜈) : π‘ž 𝑦 % 𝜈 = 0 %&' ( β‡’ H 1 βˆ’ 1 𝜏 9 𝑦 % βˆ’ 𝜈 9 𝜈 βˆ’ 𝜈 o = 0 𝜏 o %&' 9 𝜈 o + 𝜏 o ( 𝑦 % 𝜏 9 βˆ‘ %&' β‡’ 𝜈 i <kl = 9 1 + 𝜏 o 𝜏 9 𝑂 [ f M O βˆ‘ e r e f ≫ 1 or 𝑂 β†’ ∞ β‡’ πœˆΜ‚ <kl = πœˆΜ‚ <= = O\] ( 21

  22. Maximum A Posteriori (MAP) estimation } Given a set of observations 𝒠 and a prior distribution π‘ž(𝜾) on parameters, the parameter vector that maximizes π‘ž 𝒠 𝜾 π‘ž(𝜾) is found. π‘ž 𝒠 πœ„ π‘ž 𝒠 πœ„ D <kl β‰… πœ„ D <= D <kl > πœ„ D <= πœ„ πœ„ 9 𝜏 9 π‘‚πœ o 𝜈 ( = 9 + 𝜏 9 𝜈 o + 9 + 𝜏 9 𝜈 <= π‘‚πœ o π‘‚πœ o 22

  23. MAP estimation Gaussian: unknown 𝜈 (known 𝜏 ) π‘ž 𝜈 𝒠 ∝ π‘ž 𝜈 π‘ž(𝒠|𝜈) π‘ž 𝜈 𝒠 = 𝑂 𝜈 𝜈 ( , 𝜏 ( 9 𝜈 o + 𝜏 o ( 𝑦 % 𝜏 9 βˆ‘ %&' 𝜈 ( = 9 1 + 𝜏 o 𝜏 9 𝑂 π‘ž(𝜈) 1 9 = 1 9 + 𝑂 𝜏 9 𝜏 ( 𝜏 o [Bishop] More samples ⟹ sharper π‘ž(𝜈|𝒠) Higher confidence in estimation 23

  24. Conjugate Priors } We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties } Choosing a prior such that the posterior distribution that is proportional to π‘ž(𝒠|𝜾)π‘ž(𝜾) will have the same functional form as the prior. βˆ€πœ·, 𝒠 βˆƒπœ· | 𝑄(𝜾|𝜷 | ) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷) Having the same functional form 24

  25. Prior for Bernoulli Likelihood 𝛽 ' 𝐹 πœ„ = 𝛽 o + 𝛽 ' } Beta distribution over πœ„ ∈ [0,1] : 𝛽 ' βˆ’ 1 D = πœ„ 𝛽 o βˆ’ 1 + 𝛽 ' βˆ’ 1 Beta πœ„ 𝛽 ' , 𝛽 o ∝ πœ„ Ζ’ ] N' 1 βˆ’ πœ„ Ζ’ r N' most probable πœ„ Beta πœ„ 𝛽 ' , 𝛽 o = Ξ“(𝛽 o + 𝛽 ' ) Ξ“(𝛽 o )Ξ“(𝛽 ' ) πœ„ Ζ’ ] N' 1 βˆ’ πœ„ Ζ’ r N' } Beta distribution is the conjugate prior of Bernoulli: 𝑄 𝑦 πœ„ = πœ„ M 1 βˆ’ πœ„ 'NM 25

  26. Beta distribution 26

  27. Benoulli likelihood: posterior Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) π‘ž πœ„ 𝒠 ∝ π‘ž 𝒠 πœ„ π‘ž(πœ„) ( : πœ„ M O 1 βˆ’ πœ„ 'NM O = Beta πœ„ 𝛽 ' , 𝛽 o %&' ∝ πœ„ †‑ƒ ] N' 1 βˆ’ πœ„ (N†‑ƒ r N' ∝ πœ„ Ζ’ ] N' 1 βˆ’ πœ„ Ζ’ r N' ( 𝑛 = H 𝑦 (%) | , 𝛽 o | β‡’ π‘ž πœ„ 𝒠 ∝ 𝐢𝑓𝑒𝑏 πœ„ 𝛽 ' %&' | = 𝛽 ' + 𝑛 𝛽 ' | = 𝛽 o + 𝑂 βˆ’ 𝑛 𝛽 o 27

  28. Example π‘ž 𝑦 πœ„ = πœ„ M 1 βˆ’ πœ„ 'NM Prior Beta: 𝛽 o = 𝛽 ' = 2 Bernoulli π‘ž 𝑦 = 1 πœ„ πœ„ πœ„ Given: 𝒠 = 𝑦 (') , 𝑦 (9) , … , 𝑦 (() : Posterior 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) | = 5, 𝛽 o | = 2 Beta: 𝛽 ' 𝛽 o = 𝛽 ' = 2 𝒠 = 1,1,1 β‡’ 𝑂 = 3, 𝑛 = 3 | βˆ’ 1 𝛽 ' | βˆ’ 1 = 4 D <kl = argmax πœ„ 𝑄 πœ„ 𝒠 = | βˆ’ 1 + 𝛽 o 𝛽 ' 5 Ε’ πœ„ 28

  29. Toss example } MAP estimation can avoid overfitting D <= = 1 } 𝒠 = {1,1,1} , πœ„ D <kl = 0.8 (with prior π‘ž πœ„ = Beta πœ„ 2,2 ) } πœ„ 29

  30. Bayesian inference } Parameters 𝜾 as random variables with a priori distribution } Bayesian estimation utilizes the available prior information about the unknown parameter } As opposed to ML and MAP estimation, it does not seek a specific point estimate of the unknown parameter vector 𝜾 } The observed samples 𝒠 convert the prior densities π‘ž 𝜾 into a posterior density π‘ž 𝜾|𝒠 } Keep track of beliefs about 𝜾 ’s values and uses these beliefs for reaching conclusions } In the Bayesian approach, we first specify π‘ž 𝜾|𝒠 and then we compute the predictive distribution π‘ž(π’š|𝒠) 30

Recommend


More recommend