Lecture 8: − Maximum Likelihood Estimation (MLE) (cont’d.) − Maximum a posteriori (MAP) estimation − Naïve Bayes Classifier Aykut Erdem March 2016 Hacettepe University
Last time… Flipping a Coin I have a coin, if I flip it, what’s the probability that it will fall with the head up? Let us flip it a few times to estimate the probability: slide by Barnabás Póczos & Alex Smola “Frequency of heads” The estimated probability is: 3/5 2
Last time… Flipping a Coin 3/5 “Frequency of heads” The estimated probability is: Questions: (1) Why frequency of heads??? (2) How good is this estimation??? slide by Barnabás Póczos & Alex Smola (3) Why is this a machine learning problem??? We are going to answer these questions 3
Question (1) Why frequency of heads??? • Frequency of heads is exactly the maximum likelihood estimator for this problem • MLE has nice properties (interpretation, statistical guarantees, simple) slide by Barnabás Póczos & Alex Smola 4
MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 5
MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 6
MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 7
MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 8
Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws iden,cally distributed slide by Barnabás Póczos & Alex Smola 9
Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws iden,cally distributed slide by Barnabás Póczos & Alex Smola 10
Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically distributed slide by Barnabás Póczos & Alex Smola 11
Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically distributed slide by Barnabás Póczos & Alex Smola 12
Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically distributed slide by Barnabás Póczos & Alex Smola 13
Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 14
Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 15
Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 16
Question (2) • How good is this MLE estimation??? slide by Barnabás Póczos & Alex Smola 17
How many flips do I need? I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails? slide by Barnabás Póczos & Alex Smola • Which estimator should we trust more? • The more the merrier??? 18
Simple bound Let θ * be the true parameter. For n = α H + α T , and For any ε >0: Hoe ff ding’s inequality: slide by Barnabás Póczos & Alex Smola 19
Probably Approximate Correct (PAC) Learning I want to know the coin parameter θ , within ε = 0.1 error with probability at least 1- δ = 0.95. How many flips do I need? Sample complexity: slide by Barnabás Póczos & Alex Smola 20
Question (3) Why is this a machine learning problem??? • improve their performance (accuracy of the predicted prob. ) • at some task (predicting the probability of heads) • with experience (the more coins we flip the better we are) slide by Barnabás Póczos & Alex Smola 21
What about continuous features? 3 4 5 6 7 8 9 Let us try Gaussians… slide by Barnabás Póczos & Alex Smola σ 2 2 2 2 σ σ σ σ 2 2 2 2 σ σ σ µ =0 µ =0 µ µ µ µ µ µ 22
MLE for Gaussian mean and variance and variance Choose θ = ( µ , σ 2 ) that maximizes the probability of observed data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola 23
MLE for Gaussian mean and variance and variance Note: MLE for the variance of a Gaussian is biased slide by Barnabás Póczos & Alex Smola [Expected result of estimation is not the true parameter!] Unbiased variance estimator: 24
What about prior knowledge? (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 25
What about prior knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data slide by Barnabás Póczos & Aarti Singh 50-50 26
Prior distribution What prior? What distribution do we want for a prior? • Represents expert knowledge (philosophical approach) • Simple posterior form (engineer’s approach) Uninformative priors: • Uniform distribution Conjugate priors: • Closed-form representation of posterior slide by Barnabás Póczos & Aarti Singh • P( θ ) and P( θ |D) have the same form 27
In order to proceed we will need: Bayes Rule slide by Barnabás Póczos & Aarti Singh 28
Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh Bayes rule is important for reverse conditioning. 29
Bayesian Learning • Use Bayes rule: • Or equivalently: posterior likelihood prior slide by Barnabás Póczos & Aarti Singh 30
MAP estimation for Binomial distribution Coin flip problem Likelihood is Binomial If the prior is Beta distribution, ) posterior is Beta distribution slide by Barnabás Póczos & Aarti Singh P( � ) and P( � | D) have the same form! [Conjugate prior] 31
Beta distribution slide by Barnabás Póczos & Aarti Singh More concentrated as values of α , β increase 32
Beta conjugate prior slide by Barnabás Póczos & Aarti Singh As n = α H + α T increases As we get more samples, e ff ect of prior is “washed out” 33
Han Solo and Bayesian Priors C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds! https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors 34
MLE vs. MAP Maximum Likelihood estimation (MLE) ! Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimation ! Choose value that is most probable given observed data and prior belief slide by Barnabás Póczos & Aarti Singh When is MAP same as MLE? 35
From Binomial to Multinomial Example: Dice roll problem (6 outcomes instead of 2) ) Likelihood is ~ Multinomial( θ = { θ 1 , θ 2 , ... , θ k }) If prior is Dirichlet distribution, chlet distribution, Then posterior is Dirichlet distribution slide by Barnabás Póczos & Aarti Singh For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution 36
Bayesians vs. Frequentists You are no good when sample is You give a small different answer for different slide by Barnabás Póczos & Aarti Singh priors 37
Recap: What about prior knowledge? (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 38
Recap: What about prior knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data slide by Barnabás Póczos & Aarti Singh 50-50 39
Recap: Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh 40
Recap: Bayesian Learning D is the measured data Our goal is to estimate parameter θ � • Use Bayes rule: � • Or equivalently: � � slide by Barnabás Póczos & Aarti Singh posterior prior likelihood 41
Recap: MAP estimation for Binomial distribution In the coin flip problem: Likelihood is Binomial: If the prior is Beta: slide by Barnabás Póczos & Aarti Singh then the posterior is Beta distribution 42
Recommend
More recommend