Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551 (fall 2020)
Objectives understand what it means to learn a probabilistic model of the data using maximum likelihood principle using Bayesian inference prior, posterior, posterior predictive MAP inference Beta-Bernoulli conjugate pairs
Parameter estimation a thumbtack's head/tail outcome has a Bernoulli distribtion = 1 = 0 θ ) (1− x ) Bernoulli( x ∣ θ ) = θ (1 − x this is our probabilistic model of some head/tail IID data D = {0, 0, 1, 1, 0, 0, 1, 0, 0, 1} Objective: learn the model parameter θ since we are only interested in the counts, we can also use Binomial distribution N ) N h θ ) N − N h Binomial( N , N ∣ θ ) = (1 − ( N h θ h # heads N = ∑ x ∈ D x h N t ∣ D ∣
Maximum likelihood a thumbtack's head/tail outcome has a Bernoulli distribtion = 1 = 0 θ ) (1− x ) Bernoulli( x ∣ θ ) = θ (1 − x this is our probabilistic model of some head/tail IID data D = {0, 0, 1, 1, 0, 0, 1, 0, 0, 1} Objective: learn the model parameter θ Max-likelihood assignment Idea: find the parameter that maximizes the probability of observing D θ Likelihood is a function of θ 4 θ ) 6 L ( θ ; D ) = Bernoulli( x ∣ θ ) = θ (1 − ∏ x ∈ D note that this is not a probability density!
Maximizing log-likelihood likelihood L ( θ ; D ) = p ( x ; θ ) ∏ x ∈ D using product here creates extreme values for 100 samples in our example, the likelihood shrinks below 1e-30 log-likelihood has the same maximum but it is well-behaved ℓ( θ ; D ) = log( L ( θ ; D )) = log( p ( x ; θ )) ∑ x ∈ D how do we find the max-likelihood parameter? θ = ∗ arg max ℓ( θ ; D ) θ for some simple models we can get the closed form solution for complex models we need to use numerical optimization
Maximizing log-likelihood log-likelihood ℓ( θ ; D ) = log( L ( θ ; D )) = log(Bernoulli( x ; θ )) ∑ x ∈ D observation: at maximum, the derivative of is zero ℓ( θ ; D ) idea: set the the derivative to zero and solve for θ example max-likelihood for Bernoulli θ ∗ ∂ ∑ x ∈ D ∂ (1− x ) ) ℓ( θ ; D ) = log θ (1 − θ ) ( x ∂ θ ∂ θ ∂ ∑ x = x log θ + (1 − x ) log(1 − θ ) ∂ θ 1= x = x − = 0 ∑ x θ 1− θ ∑ x ∈ D x which gives is simply the portion of heads in our dataset MLE = θ ∣ D ∣ COMP 551 | Fall 2020
Bayesian approach max-likelihood estimate does not reflect our uncertainty: e.g., for both 1/5 heads and 1000/5000 heads ∗ θ = .2 in the Bayesian approach we maintain a distribution over parameters p ( θ ) prior after observing we update this distribution p ( θ ∣ D ) posterior D using Bayes rule how to do this update? likelihood of the data prior p ( θ ) p ( D ∣ θ ) previously denoted by L ( θ ; D ) p ( θ ∣ D ) = p ( D ) evidence : this is a normalization p ( D ) = p ( θ ) p ( D ∣ θ )d θ ∫
Conjugate Priors in our running example, we know the form of likelihood: prior p ( θ )? posterior p ( θ ∣ D )? p ( D ∣ θ ) = Bernoulli( x ; θ ) = θ N h (1 − θ ) N t likelihood ∏ x ∈ D we want prior and posterior to have the same form (so that we can easily update our belief with new observations.) this gives us the following form p ( θ ∣ a , b ) ∝ θ (1 − a θ ) b this means there is a normalization constant that does not depend on θ distribution of this form has a name, Beta distribution we say Beta distribution is a conjugate prior to the Bernoulli likelihood
Beta distribution Beta distribution has the following density Γ( α )Γ( β ) α −1 θ ) β −1 Beta( θ ∣ α , β ) = (1 − θ Γ( α + β ) normalization Γ is the generalization of factorial to real number Γ( a + 1) = a Γ( a ) α , β > 0 is uniform Beta( θ ∣ α = β = 1) mean of the distribution is E [ θ ] = α + β α α −1 for the dist. is unimodal; its mode is α + β −2 α , β > 1
Beta-Bernoulli conjugate pair α −1 θ ) β −1 p ( θ ) = Beta( θ ∣ α , β ) ∝ θ (1 − prior product of Bernoulli likelihoods likelihood p ( D ∣ θ ) = θ N h (1 − θ ) N t equivalent to Binomial likelihood α + N −1 θ ) β + N −1 posterior p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) ∝ (1 − θ h t h t are called pseudo-counts α , β their effect is similar to imaginary observation of heads ( ) and tails ( ) α β
Effect of more data with few observations, prior has a high influence as we increase the number of observations the effect of prior diminishes N = ∣ D ∣ the likelihood term dominates the posterior example prior p ( θ ; 5, 5) plot of the posterior density with n observations 5+ H θ ) 5+ N − H p ( θ ∣ D ) ∝ θ (1 − COMP 551 | Fall 2020
Posterior predictive p ( x ∣ θ ) our goal was to estimate the parameters ( ) so that we can make predictions θ but now we have a (posterior) distribution over parameters p ( θ ∣ D ) rather than using a single parameter p ( x ∣ θ ) we need to calculate the average prediction p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ θ posterior predictive for each possible , weight the prediction by the θ posterior probability of that parameter being true
Posterior predictive for Beta-Bernoulli start from a Beta prior p ( θ ) = Beta( θ ∣ α , β ) observe heads and tails, the posterior is p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) N h N t h t what is the probability that the next coin flip is head? p ( x = 1∣ D ) = Bernoulli( x = 1∣ θ )Beta( θ ∣ α + N , β + N )d θ ∫ θ h t α + N h = θ Beta( θ ∣ α + N , β + N ) = ∫ θ h t α + β + N mean of Beta dist. compare with prediction of maximum-likelihood: p ( x = 1∣ D ) = N N h if we assume a uniform prior, the posterior predictive is p ( x = 1∣ D ) = N +1 Laplace smoothing h N +2
Strength of the prior with a strong prior we need many samples to really change the posterior for Beta distribution decides how strong the prior is α + β example as our dataset grows our estimate becomes more accurate α different prior means different prior strength α + β α + β posterior predictive posterior predictive p ( x = 1∣ D ) p ( x = 1∣ D ) true value N N example: Koller & Friedman
Maximum a Posteriori (MAP) p ( θ ∣ D ) sometimes it is difficult to work with the posterior dist. over parameters alternative : use the parameter with the highest posterior probability MAP estimate MAP = arg max p ( θ ∣ D ) = arg max p ( θ ) p ( D ∣ θ ) θ θ θ compare with max-likelihood estimate (the only difference is in the prior term) MLE = arg max p ( D ∣ θ ) θ θ example for the posterior p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) h t α + N −1 MAP = MAP estimate is the mode of posterior θ h α + β + N + N −2 h t MLE = N h compare with MLE θ N + N h t they are equal for uniform prior α = β = 1 COMP 551 | Fall 2020
Categorical distribution what if we have more than two categories (e.g., loaded dice instead of coin) instead of Bernoulli we have multinoulli or categorical dist. # categories I ( x = k ) K Cat( x ∣ θ ) = ∏ k =1 θ k θ = 1 ∑ k where k belongs to probability simplex θ θ + θ + θ = 1 1 2 3 K = 3
Maximum likelihood for categorical dist. likelihood p ( D ∣ θ ) = Cat( x ∣ θ ) ∏ x ∈ D log-likelihood I ( x = ℓ( θ , D ) = k ) log( θ ) ∑ x ∈ D ∑ k k ∂ θ = 1 ∑ k ℓ( θ , D ) = 0 subject to we need to solve k ∂ θ k similar to the binary case, max-likelihood estimate is given by data-frequencies MLE N k = θ k N example categorical distribution with K=8 frequencies are max-likelihood parameter estimates = .149 MLE θ 5
Dirichlet distribution optional is a distribution over the parameters of a Categorical dist. θ is a generalization of Beta distribution to K categories this should be a dist. over prob. simplex ∑ k θ = 1 k α = k ∏ k Γ( α ) ∑ k α −1 Dir( θ ∣ α ) = θ k K = 3 Γ( α ) ∏ k k k normalization constant vector of psedo-counts for K categories (aka concentration parameters) α > 0 ∀ k k α = [1, … , 1] for , we get uniform distribution for K=2, it reduces to Beta distribution Dir( θ , [.2, .2, .2])
Dirichlet-Categorical conjugate pair optional k ∏ k Γ( α ) ∑ k α −1 Dirichlet dist. Dir( θ ∣ α ) = k is a conjugate prior for θ Γ( α ) ∏ k k k I ( x = k ) Categorical dist. Cat( x ∣ θ ) = ∏ k θ k α −1 prior p ( θ ) = Dir( θ ∣ α ) ∝ ∏ k θ k k η N k p ( D ∣ θ ) = ∏ k likelihood θ we observe values from each category N , … , N 1 k K N + α −1 posterior p ( θ ∣ D ) = Dir( θ ∣ α + η ) ∝ ∏ k again, we add the real counts to pseudo-counts θ k k k α + N posterior predictive p ( x = k ∣ D ) = k k ∑ k ′ α + N k ′ k ′ α + N −1 = MAP MAP θ k k ( α + N )− K k ∑ k ′ k ′ k ′ COMP 551 | Fall 2020
Recommend
More recommend