lecture 8
play

Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a - PowerPoint PPT Presentation

Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a posteriori (MAP) estimation Nave Bayes Classifier Aykut Erdem March 2016 Hacettepe University Last time Flipping a Coin I have a coin, if I flip it, whats the


  1. Lecture 8: − Maximum Likelihood Estimation (MLE) (cont’d.) − Maximum a posteriori (MAP) estimation − Naïve Bayes Classifier Aykut Erdem March 2016 Hacettepe University

  2. Last time… Flipping a Coin I have a coin, if I flip it, what’s the probability that it will fall with the head up? Let us flip it a few times to estimate the probability: slide by Barnabás Póczos & Alex Smola “Frequency of heads” The estimated probability is: 3/5 2

  3. Last time… Flipping a Coin 3/5 “Frequency of heads” The estimated probability is: Questions: (1) Why frequency of heads??? (2) How good is this estimation??? slide by Barnabás Póczos & Alex Smola (3) Why is this a machine learning problem??? We are going to answer these questions 3

  4. Question (1) Why frequency of heads??? 
 • Frequency of heads is exactly the 
 maximum likelihood estimator for this problem 
 • MLE has nice properties 
 (interpretation, statistical guarantees, simple) slide by Barnabás Póczos & Alex Smola 4

  5. MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 5

  6. MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 6

  7. MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 7

  8. MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 8

  9. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws iden,cally 
 distributed slide by Barnabás Póczos & Alex Smola 9

  10. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws iden,cally 
 distributed slide by Barnabás Póczos & Alex Smola 10

  11. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically 
 distributed slide by Barnabás Póczos & Alex Smola 11

  12. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically 
 distributed slide by Barnabás Póczos & Alex Smola 12

  13. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically 
 distributed slide by Barnabás Póczos & Alex Smola 13

  14. Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 14

  15. Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 15

  16. Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 16

  17. Question (2) • How good is this MLE estimation??? slide by Barnabás Póczos & Alex Smola 17

  18. How many flips do I need? I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails? slide by Barnabás Póczos & Alex Smola • Which estimator should we trust more? • The more the merrier??? 18

  19. Simple bound Let θ * be the true parameter. For n = α H + α T , and For any ε >0: Hoe ff ding’s inequality: slide by Barnabás Póczos & Alex Smola 19

  20. Probably Approximate Correct 
 (PAC) Learning I want to know the coin parameter θ , within ε = 0.1 
 error with probability at least 1- δ = 0.95. How many flips do I need? Sample complexity: slide by Barnabás Póczos & Alex Smola 20

  21. Question (3) Why is this a machine learning problem??? • improve their performance (accuracy of the predicted prob. ) • at some task (predicting the probability of heads) • with experience (the more coins we flip the better we are) slide by Barnabás Póczos & Alex Smola 21

  22. What about continuous 
 features? 3 4 5 6 7 8 9 Let us try Gaussians… slide by Barnabás Póczos & Alex Smola σ 2 2 2 2 σ σ σ σ 2 2 2 2 σ σ σ µ =0 µ =0 µ µ µ µ µ µ 22

  23. MLE for Gaussian mean 
 and variance and variance Choose θ = ( µ , σ 2 ) that maximizes the probability of observed data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola 23

  24. 
 MLE for Gaussian mean 
 and variance and variance Note: MLE for the variance of a Gaussian is biased slide by Barnabás Póczos & Alex Smola [Expected result of estimation is not the true parameter!] Unbiased variance estimator: 24

  25. What about prior knowledge? 
 (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 25

  26. What about prior knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data slide by Barnabás Póczos & Aarti Singh 50-50 26

  27. Prior distribution What prior? What distribution do we want for a prior? • Represents expert knowledge (philosophical approach) • Simple posterior form (engineer’s approach) 
 Uninformative priors: • Uniform distribution 
 Conjugate priors: • Closed-form representation of posterior slide by Barnabás Póczos & Aarti Singh • P( θ ) and P( θ |D) have the same form 
 27

  28. In order to proceed we will need: Bayes Rule slide by Barnabás Póczos & Aarti Singh 28

  29. Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh Bayes rule is important for reverse conditioning. 29

  30. Bayesian Learning • Use Bayes rule: • Or equivalently: posterior likelihood prior slide by Barnabás Póczos & Aarti Singh 30

  31. MAP estimation for Binomial distribution Coin flip problem Likelihood is Binomial If the prior is Beta distribution, ) posterior is Beta distribution slide by Barnabás Póczos & Aarti Singh P( � ) and P( � | D) have the same form! [Conjugate prior] 31

  32. Beta distribution slide by Barnabás Póczos & Aarti Singh More concentrated as values of α , β increase 32

  33. Beta conjugate prior slide by Barnabás Póczos & Aarti Singh As n = α H + α T increases As we get more samples, e ff ect of prior is “washed out” 33

  34. Han Solo and Bayesian Priors C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds! https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors 34

  35. MLE vs. MAP Maximum Likelihood estimation (MLE) ! Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimation ! Choose value that is most probable given observed data and prior belief slide by Barnabás Póczos & Aarti Singh When is MAP same as MLE? 35

  36. 
 From Binomial to Multinomial Example: Dice roll problem (6 outcomes instead of 2) ) Likelihood is ~ Multinomial( θ = { θ 1 , θ 2 , ... , θ k }) If prior is Dirichlet distribution, chlet distribution, Then posterior is Dirichlet distribution slide by Barnabás Póczos & Aarti Singh For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution 36

  37. Bayesians vs. Frequentists You are no good when sample is You give a small different answer for different slide by Barnabás Póczos & Aarti Singh priors 37

  38. Recap: What about prior knowledge? 
 (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 38

  39. Recap: What about prior knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data slide by Barnabás Póczos & Aarti Singh 50-50 39

  40. Recap: Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh 40

  41. 
 
 Recap: Bayesian Learning D is the measured data Our goal is to estimate parameter θ � • Use Bayes rule: 
 � • Or equivalently: � � slide by Barnabás Póczos & Aarti Singh posterior prior likelihood 41

  42. 
 Recap: MAP estimation for Binomial distribution In the coin flip problem: Likelihood is Binomial: If the prior is Beta: slide by Barnabás Póczos & Aarti Singh then the posterior is Beta distribution 42

Recommend


More recommend