15-388/688 - Practical Data Science: Maximum likelihood estimation, - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Maximum likelihood estimation, naïve Bayes J. Zico Kolter Carnegie Mellon University Spring 2018 1

Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 2

Estimating the parameters of distributions We’re moving now from probability to statistics The basic question: given some data 𝑦 1 , … , 𝑦 푚 , how do I find a distribution that captures this data “well”? In general (if we can pick from the space of all distributions), this is a hard question, but if we pick from a particular parameterized family of distributions 𝑞 𝑌; 𝜄 , the question is (at least a little bit) easier Question becomes: how do I find parameters 𝜄 of this distribution that fit the data? 4

Maximum likelihood estimation Given a distribution 𝑞 𝑌; 𝜄 , and a collection of observed (independent) data points 𝑦 1 , … , 𝑦 푚 , the probability of observing this data is simply 푚 𝑞 𝑦 1 , … , 𝑦 푚 ; 𝜄 = ∏ 𝑞 𝑦 푖 ; 𝜄 푖=1 Basic idea of maximum likelihood estimation (MLE): find the parameters that maximize the probability of the observed data 푚 푚 𝑞 𝑦 푖 ; 𝜄 log 𝑞 𝑦 푖 ; 𝜄 maximize ∏ ≡ maximize ℓ 𝜄 = ∑ 휃 휃 푖=1 푖=1 where ℓ 𝜄 is called the log likelihood of the data Seems “obvious”, but there are many other ways of fitting parameters 5

̂ Parameter estimation for Bernoulli Simple example: Bernoulli distribution 𝑞 𝑌 = 1; 𝜚 = 𝜚, 𝑞 𝑌 = 0; 𝜚 = 1 − 𝜚 Given observed data 𝑦 1 , … , 𝑦 푚 , the “obvious” answer is: 푚 𝑦 푖 ∑ 푖=1 #1’s 𝜚 = # Total = 𝑛 But why is this the case? Maybe there are other estimates that are just as good, i.e.? 푚 𝑦 푖 + 1 ∑ 푖=1 𝜚 = 𝑛 + 2 6

MLE for Bernoulli Maximum likelihood solution for Bernoulli given by 푚 푚 𝜚 푥 푖 1 − 𝜚 1−푥 푖 𝑞 𝑦 푖 ; 𝜚 = maximize maximize ∏ ∏ 휙 휙 푖=1 푖=1 Taking the negative log of the optimization objective (just to be consistent with our usual notation of optimization as minimization) 푚 𝑦 푖 log 𝜚 + 1 − 𝑦 푖 maximize ℓ 𝜚 = ∑ log 1 − 𝜚 휙 푖=1 Derivative with respect to 𝜚 is given by 푚 𝑦 푖 푚 (1 − 𝑦 푖 ) 푚 ∑ 푖=1 ∑ 푖=1 𝑦 푖 𝜚 − 1 − 𝑦 푖 𝑒 𝑒𝜚 ℓ 𝜚 = ∑ = − 1 − 𝜚 𝜚 1 − 𝜚 푖=1 7

MLE for Bernoulli, continued Setting derivative to zero gives: 푚 𝑦 푖 푚 (1 − 𝑦 푖 ) ∑ 푖=1 ∑ 푖=1 ≡ 𝑏 𝑐 − 𝜚 − 1 − 𝜚 = 0 𝜚 1 − 𝜚 ⟹ 1 − 𝜚 𝑏 = 𝜚𝑐 푚 𝑦 푖 ∑ 푖=1 𝑏 ⟹ 𝜚 = 𝑏 + 𝑐 = 𝑛 So, we have shown that the “natural” estimate of 𝜚 actually corresponds to the maximum likelihood estimate 8

Poll: Bernoulli maximum likelihood Suppose we observe binary data 𝑦 1 , … , 𝑦 푚 with 𝑦 푖 ∈ {0,1} with some 𝑦 푖 = 0 and some 𝑦 푗 = 1 , and we compute the Bernoulli MLE 푚 𝑦 푖 ∑ 푖=1 𝜚 = 𝑛 Which of following statements is necessarily true? (may be more than one) For any 𝜚 ′ ≠ 𝜚 , 𝑞 𝑦 푖 ; 𝜚 ′ ≤ 𝑞 𝑦 푖 ; 𝜚 1. for all 𝑗 = 1, … , 𝑜 푚 𝑞 𝑦 푖 ; 𝜚 ′ 푚 𝑞 𝑦 푖 ; 𝜚 For any 𝜚 ′ ≠ 𝜚 , ∏ 푖=1 2. ≤ ∏ 푖=1 We always have 𝑞 𝑦 푖 ; 𝜚 ′ ≥ 𝑞 𝑦 푖 ; 𝜚 3. for at least one 𝑗 9

MLE for Gaussian, briefly For Gaussian distribution 𝑞 𝑦; 𝜈, 𝜏 2 = 2𝜌𝜏 2 −1/2 exp − 1/2 𝑦 − 𝜈 2 /𝜏 2 Log likelihood given by: 𝑦 푖 − 𝜈 2 푚 ℓ 𝜈, 𝜏 2 = −𝑛 1 2 log 2𝜌𝜏 2 − 1 2 ∑ 𝜏 2 푖=1 Derivatives (see if you can derive these fully): 푚 𝑦 푖 − 𝜈 푚 𝑒𝜈 ℓ 𝜈, 𝜏 2 = − 1 𝑒 = 0 ⟹ 𝜈 = 1 𝑦 푖 2 ∑ 𝑛 ∑ 𝜏 2 푖=1 푖=1 𝑦 푖 − 𝜈 2 푚 푚 𝑒𝜏 2 ℓ 𝜈, 𝜏 2 = − 𝑛 𝑒 2𝜏 2 + 1 1 = 0 ⟹ 𝜏 2 = 𝑦 푖 − 𝜈 2 2 ∑ 𝑛 ∑ 𝜏 2 2 푖=1 푖=1 10

Naive Bayes modeling Naive Bayes is a machine learning algorithm that rests relies heavily on probabilistic modeling But, it is also interpretable according to the three ingredients of a machine learning algorithm (hypothesis function, loss, optimization), more on this later Basic idea is that we model input and output as random variables 𝑌 = 𝑌 1 , 𝑌 2 , … , 𝑌 푛 (several Bernoulli, categorical, or Gaussian random variables), and 𝑍 (one Bernoulli or categorical random variable), goal is to find 𝑞(𝑍 |𝑌) 12

Naive Bayes assumptions We’re going to find 𝑞 𝑍 𝑌 via Bayes’ rule 𝑞 𝑍 𝑌 = 𝑞 𝑌 𝑍 𝑞 𝑍 𝑞 𝑌 𝑍 𝑞 𝑍 = 𝑞 𝑌 ∑ 푦 𝑞(𝑌|𝑧) 𝑞 𝑧 The denominator is just the sum over all values of 𝑍 of the distribution specified by the numeration, so we’re just going to focus on the 𝑞 𝑌 𝑍 𝑞 𝑍 term Modeling full distribution 𝑞(𝑌|𝑍 ) for high-dimensional 𝑌 is not practical, so we’re going to make the naive Bayes assumption , that the elements 𝑌 푖 are conditionally independent given 𝑍 푛 𝑞 𝑌 𝑍 = ∏ 𝑞 𝑌 푖 𝑍 푖=1 13

Modeling individual distributions We’re going to explicitly model the distribution of each 𝑞 𝑌 푖 𝑍 as well as 𝑞(𝑍 ) We do this by specifying a distribution for 𝑞(𝑍 ) and a separate distribution and for each 𝑞(𝑌 푖 |𝑍 = 𝑧) So assuming, for instance, that 𝑍 푖 and 𝑌 푖 are binary (Bernoulli random variables), then we would represent the distributions 0 ), 1 𝑞 𝑍 ; 𝜚 0 , 𝑞 𝑌 푖 𝑍 = 0; 𝜚 푖 𝑞 𝑌 푖 𝑍 = 1; 𝜚 푖 We then estimate the parameters of these distributions using MLE, i.e. 푗 ⋅ 1{𝑧 푗 = 𝑧} 푚 푚 𝑧 푗 ∑ 푗=1 ∑ 푗=1 𝑦 푖 푦 = 𝜚 0 = , 𝜚 푖 1{𝑧 푗 = 𝑧} 푚 𝑛 ∑ 푗=1 14

Making predictions Given some new data point 𝑦 , we can now compute the probability of each class 푚 푚 푦 ) 푥 푖 1 − 𝜚 1 푦 1−푥 푖 𝑞 𝑍 = 𝑧 𝑦 ∝ 𝑞 𝑍 = 𝑧 ∏ 𝑞 𝑦 푖 𝑍 = 𝑧 = 𝜚 0 ∏ (𝜚 푖 푖=1 푖=1 After you have computed the right hand side, just normalize (divide by the sum over all 𝑧 ) to get the desired probability Alternatively, if you just want to know the most likely 𝑍 , just compute each right hand side and take the maximum 15

Example 𝒁 𝒀 ퟏ 𝒀 ퟐ 𝑞 𝑍 = 1 = 𝜚 0 = 0 0 0 0 = 𝑞 𝑌 1 = 1 𝑍 = 0 = 𝜚 1 1 1 0 0 0 1 1 = 𝑞 𝑌 1 = 1 𝑍 = 1 = 𝜚 1 1 1 1 0 = 𝑞 𝑌 2 = 1 𝑍 = 0 = 𝜚 2 1 1 0 0 1 0 1 = 𝑞 𝑌 2 = 1 𝑍 = 0 = 𝜚 2 1 0 1 ? 1 0 𝑞 𝑍 𝑌 1 = 1, 𝑌 2 = 0 = 16

Potential issues 푛 Problem #1: when computing probability, the product p 𝑧 ∏ 푖=1 𝑞(𝑦 푖 |𝑧) quickly goes to zero to numerical precision Solution: compute log of the probabilities instead 푛 log 𝑞(𝑧) + ∑ log 𝑞 𝑦 푖 𝑧 푖=1 Problem #2: If we have never seen either 𝑌 푖 = 1 or 𝑌 푖 = 0 for a given 𝑧 , then the corresponding probabilities computed by MLE will be zero Solution: Laplace smoothing, “hallucinate” one 𝑌 푖 = 0/1 for each class 푗 ⋅ 1{𝑧 푗 = 𝑧} + 1 푚 ∑ 푗=1 𝑦 푖 푦 = 𝜚 푖 1{𝑧 푗 = 𝑧} + 2 푚 ∑ 푗=1 17

Other distributions Though naive Bayes is often presented as “just” counting, the value of the maximum likelihood interpretation is that it’s clear how to model 𝑞(𝑌 푖 |𝑍 ) for non- categorical random variables Example: if 𝑦 푖 is real-valued, we can model 𝑞(𝑌 푖 |𝑍 = 𝑧) as a Gaussian 2 = 𝒪(𝑦 푖 ; 𝜈 푦 , 𝜏 푦 𝑞 𝑦 푖 𝑧; 𝜈 푦 , 𝜏 푦 2 ) with maximum likelihood estimates 푚 𝑦 푖 푗 ⋅ 1{𝑧 푗 = 𝑧} 푚 (𝑦 푖 푗 −𝜈 푦 )^2 ⋅ 1{𝑧 푗 = 𝑧} ∑ 푗=1 ∑ 푗=1 𝜈 푦 = 2 = , 𝜏 푦 푚 1{𝑧 푗 = 𝑧} 푚 1{𝑧 푗 = 𝑧} ∑ 푗=1 ∑ 푗=1 All probability computations are exactly the same as before (it doesn’t matter that some of the terms are probability densities) 18

Machine learning via maximum likelihood Many machine learning algorithms (specifically the loss function component) can be interpreted probabilistically, as maximum likelihood estimation Recall logistic regression: 푚 ℓ logistic (ℎ 휃 (𝑦 푖 ) , 𝑧 푖 ) minimize ∑ 휃 푖=1 ℓ logistic ℎ 휃 𝑦 , 𝑧 = log(1 + exp −𝑧 ⋅ ℎ 휃 𝑦 20

Logistic probability model Consider the model (where 𝑍 is binary taking on −1, +1 values) 1 𝑞 𝑧 𝑦; 𝜄 = logistic 𝑧 ⋅ ℎ 휃 𝑦 = 1 + exp(−𝑧 ⋅ ℎ 휃 𝑦 ) Under this model, the maximum likelihood estimate is 푚 푚 log 𝑞 𝑧 푖 𝑦 푖 ; 𝜄) ≡ minimize ℓ logistic (ℎ 휃 (𝑦 푖 ) , 𝑧 푖 ) maximize ∑ ∑ 휃 휃 푖=1 푖=1 21

15-388/688 - Practical Data Science: Maximum likelihood estimation, - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 2 Outline

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

15-388/688 - Practical Data Science: Debugging data science J. Zico Kolter School of Computer

15-388/688 - Practical Data Science: Introduction J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Data collection and scraping J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Relational Data J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

Time Series Modeling Shouvik Mani April 5, 2018 15-388/688: Practical Data Science Carnegie

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter

15-388/688 - Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Recommender systems J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Jupyter notebook lab J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Free text and natural language processing J. Zico Kolter

Maximum Likelihood Jonathan Pillow Mathematical Tools for Neuroscience (NEU 314) Spring, 2016

optimization of software architectures Aurora Ramrez, Jos Ral Romero, Sebastin Ventura

Formal Methods for Probabilistic Systems Annabelle McIver Carroll Morgan Source-level

An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference

Lecture 7: Maximum Likelihood Estimation (MLE) Maximum a Posteriori (MAP) Aykut Erdem

Maximum Likelihood Setting parameters Chris Williams, School of Informatics We choose a

Gaussian Discriminant Analysis material thanks to Andrew Ng @Stanford Course Map / module3

Maximum Likelihood Density Estimation under Total Positivity Elina Robeva MIT joint work with