15-388/688 - Practical Data Science: Maximum likelihood estimation, naΓ―ve Bayes J. Zico Kolter Carnegie Mellon University Spring 2018 1
Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 2
Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 3
Estimating the parameters of distributions Weβre moving now from probability to statistics The basic question: given some data π¦ 1 , β¦ , π¦ ν , how do I find a distribution that captures this data βwellβ? In general (if we can pick from the space of all distributions), this is a hard question, but if we pick from a particular parameterized family of distributions π π; π , the question is (at least a little bit) easier Question becomes: how do I find parameters π of this distribution that fit the data? 4
Maximum likelihood estimation Given a distribution π π; π , and a collection of observed (independent) data points π¦ 1 , β¦ , π¦ ν , the probability of observing this data is simply ν π π¦ 1 , β¦ , π¦ ν ; π = β π π¦ ν ; π ν=1 Basic idea of maximum likelihood estimation (MLE): find the parameters that maximize the probability of the observed data ν ν π π¦ ν ; π log π π¦ ν ; π maximize β β‘ maximize β π = β ν ν ν=1 ν=1 where β π is called the log likelihood of the data Seems βobviousβ, but there are many other ways of fitting parameters 5
Μ Parameter estimation for Bernoulli Simple example: Bernoulli distribution π π = 1; π = π, π π = 0; π = 1 β π Given observed data π¦ 1 , β¦ , π¦ ν , the βobviousβ answer is: ν π¦ ν β ν=1 #1βs π = # Total = π But why is this the case? Maybe there are other estimates that are just as good, i.e.? ν π¦ ν + 1 β ν=1 π = π + 2 6
MLE for Bernoulli Maximum likelihood solution for Bernoulli given by ν ν π ν₯ ν 1 β π 1βν₯ ν π π¦ ν ; π = maximize maximize β β ν ν ν=1 ν=1 Taking the negative log of the optimization objective (just to be consistent with our usual notation of optimization as minimization) ν π¦ ν log π + 1 β π¦ ν maximize β π = β log 1 β π ν ν=1 Derivative with respect to π is given by ν π¦ ν ν (1 β π¦ ν ) ν β ν=1 β ν=1 π¦ ν π β 1 β π¦ ν π ππ β π = β = β 1 β π π 1 β π ν=1 7
MLE for Bernoulli, continued Setting derivative to zero gives: ν π¦ ν ν (1 β π¦ ν ) β ν=1 β ν=1 β‘ π π β π β 1 β π = 0 π 1 β π βΉ 1 β π π = ππ ν π¦ ν β ν=1 π βΉ π = π + π = π So, we have shown that the βnaturalβ estimate of π actually corresponds to the maximum likelihood estimate 8
Poll: Bernoulli maximum likelihood Suppose we observe binary data π¦ 1 , β¦ , π¦ ν with π¦ ν β {0,1} with some π¦ ν = 0 and some π¦ ν = 1 , and we compute the Bernoulli MLE ν π¦ ν β ν=1 π = π Which of following statements is necessarily true? (may be more than one) For any π β² β π , π π¦ ν ; π β² β€ π π¦ ν ; π 1. for all π = 1, β¦ , π ν π π¦ ν ; π β² ν π π¦ ν ; π For any π β² β π , β ν=1 2. β€ β ν=1 We always have π π¦ ν ; π β² β₯ π π¦ ν ; π 3. for at least one π 9
MLE for Gaussian, briefly For Gaussian distribution π π¦; π, π 2 = 2ππ 2 β1/2 exp β 1/2 π¦ β π 2 /π 2 Log likelihood given by: π¦ ν β π 2 ν β π, π 2 = βπ 1 2 log 2ππ 2 β 1 2 β π 2 ν=1 Derivatives (see if you can derive these fully): ν π¦ ν β π ν ππ β π, π 2 = β 1 π = 0 βΉ π = 1 π¦ ν 2 β π β π 2 ν=1 ν=1 π¦ ν β π 2 ν ν ππ 2 β π, π 2 = β π π 2π 2 + 1 1 = 0 βΉ π 2 = π¦ ν β π 2 2 β π β π 2 2 ν=1 ν=1 10
Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 11
Naive Bayes modeling Naive Bayes is a machine learning algorithm that rests relies heavily on probabilistic modeling But, it is also interpretable according to the three ingredients of a machine learning algorithm (hypothesis function, loss, optimization), more on this later Basic idea is that we model input and output as random variables π = π 1 , π 2 , β¦ , π ν (several Bernoulli, categorical, or Gaussian random variables), and π (one Bernoulli or categorical random variable), goal is to find π(π |π) 12
Naive Bayes assumptions Weβre going to find π π π via Bayesβ rule π π π = π π π π π π π π π π = π π β ν¦ π(π|π§) π π§ The denominator is just the sum over all values of π of the distribution specified by the numeration, so weβre just going to focus on the π π π π π term Modeling full distribution π(π|π ) for high-dimensional π is not practical, so weβre going to make the naive Bayes assumption , that the elements π ν are conditionally independent given π ν π π π = β π π ν π ν=1 13
Modeling individual distributions Weβre going to explicitly model the distribution of each π π ν π as well as π(π ) We do this by specifying a distribution for π(π ) and a separate distribution and for each π(π ν |π = π§) So assuming, for instance, that π ν and π ν are binary (Bernoulli random variables), then we would represent the distributions 0 ), 1 π π ; π 0 , π π ν π = 0; π ν π π ν π = 1; π ν We then estimate the parameters of these distributions using MLE, i.e. ν β 1{π§ ν = π§} ν ν π§ ν β ν=1 β ν=1 π¦ ν ν¦ = π 0 = , π ν 1{π§ ν = π§} ν π β ν=1 14
Making predictions Given some new data point π¦ , we can now compute the probability of each class ν ν ν¦ ) ν₯ ν 1 β π 1 ν¦ 1βν₯ ν π π = π§ π¦ β π π = π§ β π π¦ ν π = π§ = π 0 β (π ν ν=1 ν=1 After you have computed the right hand side, just normalize (divide by the sum over all π§ ) to get the desired probability Alternatively, if you just want to know the most likely π , just compute each right hand side and take the maximum 15
Example π π ν π ν π π = 1 = π 0 = 0 0 0 0 = π π 1 = 1 π = 0 = π 1 1 1 0 0 0 1 1 = π π 1 = 1 π = 1 = π 1 1 1 1 0 = π π 2 = 1 π = 0 = π 2 1 1 0 0 1 0 1 = π π 2 = 1 π = 0 = π 2 1 0 1 ? 1 0 π π π 1 = 1, π 2 = 0 = 16
Potential issues ν Problem #1: when computing probability, the product p π§ β ν=1 π(π¦ ν |π§) quickly goes to zero to numerical precision Solution: compute log of the probabilities instead ν log π(π§) + β log π π¦ ν π§ ν=1 Problem #2: If we have never seen either π ν = 1 or π ν = 0 for a given π§ , then the corresponding probabilities computed by MLE will be zero Solution: Laplace smoothing, βhallucinateβ one π ν = 0/1 for each class ν β 1{π§ ν = π§} + 1 ν β ν=1 π¦ ν ν¦ = π ν 1{π§ ν = π§} + 2 ν β ν=1 17
Other distributions Though naive Bayes is often presented as βjustβ counting, the value of the maximum likelihood interpretation is that itβs clear how to model π(π ν |π ) for non- categorical random variables Example: if π¦ ν is real-valued, we can model π(π ν |π = π§) as a Gaussian 2 = πͺ(π¦ ν ; π ν¦ , π ν¦ π π¦ ν π§; π ν¦ , π ν¦ 2 ) with maximum likelihood estimates ν π¦ ν ν β 1{π§ ν = π§} ν (π¦ ν ν βπ ν¦ )^2 β 1{π§ ν = π§} β ν=1 β ν=1 π ν¦ = 2 = , π ν¦ ν 1{π§ ν = π§} ν 1{π§ ν = π§} β ν=1 β ν=1 All probability computations are exactly the same as before (it doesnβt matter that some of the terms are probability densities) 18
Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 19
Machine learning via maximum likelihood Many machine learning algorithms (specifically the loss function component) can be interpreted probabilistically, as maximum likelihood estimation Recall logistic regression: ν β logistic (β ν (π¦ ν ) , π§ ν ) minimize β ν ν=1 β logistic β ν π¦ , π§ = log(1 + exp βπ§ β β ν π¦ 20
Logistic probability model Consider the model (where π is binary taking on β1, +1 values) 1 π π§ π¦; π = logistic π§ β β ν π¦ = 1 + exp(βπ§ β β ν π¦ ) Under this model, the maximum likelihood estimate is ν ν log π π§ ν π¦ ν ; π) β‘ minimize β logistic (β ν (π¦ ν ) , π§ ν ) maximize β β ν ν ν=1 ν=1 21
Recommend
More recommend