applied machine learning
play

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning - PowerPoint PPT Presentation

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551 (fall 2020) Objectives understand what it means to learn a probabilistic model of the data using maximum likelihood principle using Bayesian


  1. Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551 (fall 2020)

  2. Objectives understand what it means to learn a probabilistic model of the data using maximum likelihood principle using Bayesian inference prior, posterior, posterior predictive MAP inference Beta-Bernoulli conjugate pairs

  3. Parameter estimation a thumbtack's head/tail outcome has a Bernoulli distribtion = 1 = 0 θ ) (1− x ) Bernoulli( x ∣ θ ) = θ (1 − x this is our probabilistic model of some head/tail IID data D = {0, 0, 1, 1, 0, 0, 1, 0, 0, 1} Objective: learn the model parameter θ since we are only interested in the counts, we can also use Binomial distribution N ) N h θ ) N − N h Binomial( N , N ∣ θ ) = (1 − ( N h θ h # heads N = ∑ x ∈ D x h N t ∣ D ∣

  4. Maximum likelihood a thumbtack's head/tail outcome has a Bernoulli distribtion = 1 = 0 θ ) (1− x ) Bernoulli( x ∣ θ ) = θ (1 − x this is our probabilistic model of some head/tail IID data D = {0, 0, 1, 1, 0, 0, 1, 0, 0, 1} Objective: learn the model parameter θ Max-likelihood assignment Idea: find the parameter that maximizes the probability of observing D θ Likelihood is a function of θ 4 θ ) 6 L ( θ ; D ) = Bernoulli( x ∣ θ ) = θ (1 − ∏ x ∈ D note that this is not a probability density!

  5. Maximizing log-likelihood likelihood L ( θ ; D ) = p ( x ; θ ) ∏ x ∈ D using product here creates extreme values for 100 samples in our example, the likelihood shrinks below 1e-30 log-likelihood has the same maximum but it is well-behaved ℓ( θ ; D ) = log( L ( θ ; D )) = log( p ( x ; θ )) ∑ x ∈ D how do we find the max-likelihood parameter? θ = ∗ arg max ℓ( θ ; D ) θ for some simple models we can get the closed form solution for complex models we need to use numerical optimization

  6. Maximizing log-likelihood log-likelihood ℓ( θ ; D ) = log( L ( θ ; D )) = log(Bernoulli( x ; θ )) ∑ x ∈ D observation: at maximum, the derivative of is zero ℓ( θ ; D ) idea: set the the derivative to zero and solve for θ example max-likelihood for Bernoulli θ ∗ ∂ ∑ x ∈ D ∂ (1− x ) ) ℓ( θ ; D ) = log θ (1 − θ ) ( x ∂ θ ∂ θ ∂ ∑ x = x log θ + (1 − x ) log(1 − θ ) ∂ θ 1= x = x − = 0 ∑ x θ 1− θ ∑ x ∈ D x which gives is simply the portion of heads in our dataset MLE = θ ∣ D ∣ COMP 551 | Fall 2020

  7. Bayesian approach max-likelihood estimate does not reflect our uncertainty: e.g., for both 1/5 heads and 1000/5000 heads ∗ θ = .2 in the Bayesian approach we maintain a distribution over parameters p ( θ ) prior after observing we update this distribution p ( θ ∣ D ) posterior D using Bayes rule how to do this update? likelihood of the data prior p ( θ ) p ( D ∣ θ ) previously denoted by L ( θ ; D ) p ( θ ∣ D ) = p ( D ) evidence : this is a normalization p ( D ) = p ( θ ) p ( D ∣ θ )d θ ∫

  8. Conjugate Priors in our running example, we know the form of likelihood: prior p ( θ )? posterior p ( θ ∣ D )? p ( D ∣ θ ) = Bernoulli( x ; θ ) = θ N h (1 − θ ) N t likelihood ∏ x ∈ D we want prior and posterior to have the same form (so that we can easily update our belief with new observations.) this gives us the following form p ( θ ∣ a , b ) ∝ θ (1 − a θ ) b this means there is a normalization constant that does not depend on θ distribution of this form has a name, Beta distribution we say Beta distribution is a conjugate prior to the Bernoulli likelihood

  9. Beta distribution Beta distribution has the following density Γ( α )Γ( β ) α −1 θ ) β −1 Beta( θ ∣ α , β ) = (1 − θ Γ( α + β ) normalization Γ is the generalization of factorial to real number Γ( a + 1) = a Γ( a ) α , β > 0 is uniform Beta( θ ∣ α = β = 1) mean of the distribution is E [ θ ] = α + β α α −1 for the dist. is unimodal; its mode is α + β −2 α , β > 1

  10. Beta-Bernoulli conjugate pair α −1 θ ) β −1 p ( θ ) = Beta( θ ∣ α , β ) ∝ θ (1 − prior product of Bernoulli likelihoods likelihood p ( D ∣ θ ) = θ N h (1 − θ ) N t equivalent to Binomial likelihood α + N −1 θ ) β + N −1 posterior p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) ∝ (1 − θ h t h t are called pseudo-counts α , β their effect is similar to imaginary observation of heads ( ) and tails ( ) α β

  11. Effect of more data with few observations, prior has a high influence as we increase the number of observations the effect of prior diminishes N = ∣ D ∣ the likelihood term dominates the posterior example prior p ( θ ; 5, 5) plot of the posterior density with n observations 5+ H θ ) 5+ N − H p ( θ ∣ D ) ∝ θ (1 − COMP 551 | Fall 2020

  12. Posterior predictive p ( x ∣ θ ) our goal was to estimate the parameters ( ) so that we can make predictions θ but now we have a (posterior) distribution over parameters p ( θ ∣ D ) rather than using a single parameter p ( x ∣ θ ) we need to calculate the average prediction p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ θ posterior predictive for each possible , weight the prediction by the θ posterior probability of that parameter being true

  13. Posterior predictive for Beta-Bernoulli start from a Beta prior p ( θ ) = Beta( θ ∣ α , β ) observe heads and tails, the posterior is p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) N h N t h t what is the probability that the next coin flip is head? p ( x = 1∣ D ) = Bernoulli( x = 1∣ θ )Beta( θ ∣ α + N , β + N )d θ ∫ θ h t α + N h = θ Beta( θ ∣ α + N , β + N ) = ∫ θ h t α + β + N mean of Beta dist. compare with prediction of maximum-likelihood: p ( x = 1∣ D ) = N N h if we assume a uniform prior, the posterior predictive is p ( x = 1∣ D ) = N +1 Laplace smoothing h N +2

  14. Strength of the prior with a strong prior we need many samples to really change the posterior for Beta distribution decides how strong the prior is α + β example as our dataset grows our estimate becomes more accurate α different prior means different prior strength α + β α + β posterior predictive posterior predictive p ( x = 1∣ D ) p ( x = 1∣ D ) true value N N example: Koller & Friedman

  15. Maximum a Posteriori (MAP) p ( θ ∣ D ) sometimes it is difficult to work with the posterior dist. over parameters alternative : use the parameter with the highest posterior probability MAP estimate MAP = arg max p ( θ ∣ D ) = arg max p ( θ ) p ( D ∣ θ ) θ θ θ compare with max-likelihood estimate (the only difference is in the prior term) MLE = arg max p ( D ∣ θ ) θ θ example for the posterior p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) h t α + N −1 MAP = MAP estimate is the mode of posterior θ h α + β + N + N −2 h t MLE = N h compare with MLE θ N + N h t they are equal for uniform prior α = β = 1 COMP 551 | Fall 2020

  16. Categorical distribution what if we have more than two categories (e.g., loaded dice instead of coin) instead of Bernoulli we have multinoulli or categorical dist. # categories I ( x = k ) K Cat( x ∣ θ ) = ∏ k =1 θ k θ = 1 ∑ k where k belongs to probability simplex θ θ + θ + θ = 1 1 2 3 K = 3

  17. Maximum likelihood for categorical dist. likelihood p ( D ∣ θ ) = Cat( x ∣ θ ) ∏ x ∈ D log-likelihood I ( x = ℓ( θ , D ) = k ) log( θ ) ∑ x ∈ D ∑ k k ∂ θ = 1 ∑ k ℓ( θ , D ) = 0 subject to we need to solve k ∂ θ k similar to the binary case, max-likelihood estimate is given by data-frequencies MLE N k = θ k N example categorical distribution with K=8 frequencies are max-likelihood parameter estimates = .149 MLE θ 5

  18. Dirichlet distribution optional is a distribution over the parameters of a Categorical dist. θ is a generalization of Beta distribution to K categories this should be a dist. over prob. simplex ∑ k θ = 1 k α = k ∏ k Γ( α ) ∑ k α −1 Dir( θ ∣ α ) = θ k K = 3 Γ( α ) ∏ k k k normalization constant vector of psedo-counts for K categories (aka concentration parameters) α > 0 ∀ k k α = [1, … , 1] for , we get uniform distribution for K=2, it reduces to Beta distribution Dir( θ , [.2, .2, .2])

  19. Dirichlet-Categorical conjugate pair optional k ∏ k Γ( α ) ∑ k α −1 Dirichlet dist. Dir( θ ∣ α ) = k is a conjugate prior for θ Γ( α ) ∏ k k k I ( x = k ) Categorical dist. Cat( x ∣ θ ) = ∏ k θ k α −1 prior p ( θ ) = Dir( θ ∣ α ) ∝ ∏ k θ k k η N k p ( D ∣ θ ) = ∏ k likelihood θ we observe values from each category N , … , N 1 k K N + α −1 posterior p ( θ ∣ D ) = Dir( θ ∣ α + η ) ∝ ∏ k again, we add the real counts to pseudo-counts θ k k k α + N posterior predictive p ( x = k ∣ D ) = k k ∑ k ′ α + N k ′ k ′ α + N −1 = MAP MAP θ k k ( α + N )− K k ∑ k ′ k ′ k ′ COMP 551 | Fall 2020

Recommend


More recommend