A Probabilistic View of Machine Learning (2/2) CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Some slides based on material by Tom Mitchell
What we know so far… • Bayes rule • A probabilistic view of machine learning – If we know the data generating distribution, we can define the Bayes optimal classifier – Under iid assumption • How to estimate a probability distribution from data? – Maximum likelihood estimation
T oday • How to compute Maximum Likelihood Estimates – For Bernouilli and Categorical Distributions • Naïve Bayes classifier
Maximum Likelihood Estimates Given a data set D of iid flips, which contains 𝛽 1 ones and 𝛽 0 zeros 𝑄 𝜄 (𝐸) = 𝜄 𝛽 1 (1 − 𝜄) 𝛽 0 𝛽 1 𝜄 𝑁𝑀𝐹 = 𝑏𝑠𝑛𝑏𝑦 𝜄 𝑄 𝜄 𝐸 = 𝛽 1 + 𝛽 0
Maximum Likelihood Estimates Given a data set D of iid rolls, which contains 𝑦 𝑙 outcomes 𝑙 for each 𝑙 𝐿 K sided die 𝑦 𝑙 𝑄 𝜄 (𝐸) = 𝜄 𝑙 ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄 𝑙 𝑙=1 (Categorical Distribution) 𝜄 𝑁𝑀𝐹 = 𝑏𝑠𝑛𝑏𝑦 𝜄 𝑄 𝜄 𝐸 Problem: = 𝑏𝑠𝑛𝑏𝑦 𝜄 log 𝑄 𝜄 𝐸 This objective lacks 𝐿 constraints! = 𝑏𝑠𝑛𝑏𝑦 𝜄 𝑦 𝑙 log(𝜄 𝑙 ) 𝑙=1
Maximum Likelihood Estimates A constrained optimization problem 𝐿 𝜄 𝑁𝑀𝐹 = 𝑏𝑠𝑛𝑏𝑦 𝜄 𝑦 𝑙 log(𝜄 𝑙 ) 𝑙=1 K sided die 𝐿 ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄 𝑙 𝑥𝑗𝑢ℎ 𝜄 𝑙 = 1 𝑙=1 How to solve it? Use lagrange multipliers to turn it into unconstrained objective (on board)
Maximum Likelihood Estimates The parameters that maximize the likelihood of the data are given by: 𝑦 𝑙 K sided die 𝜄 𝑙 = ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄 𝑙 𝑙 𝑦 𝑙 This is the relative frequency of rolls where side k comes up!
T oday • How to compute Maximum Likelihood Estimates – For Bernouilli and Categorical Distributions • Naïve Bayes classifier
Let’s learn a classifier by learning P(Y|X) • Goal: learn a classifier P(Y|X) • Prediction: – Given an example x – Predict 𝑧 = 𝑏𝑠𝑛𝑏𝑦 𝑧 𝑄 𝑍 = 𝑧 𝑌 = 𝑦)
Parameters for P(X,Y) vs. P(Y|X) Y = Wealth X = <Gender, Hours_worked> Joint probability distribution P(X,Y) Conditional probability distribution P(Y|X)
Parameters for P(X,Y) and P(Y|X) • P(Y|X) requires estimating fewer parameters than P(X,Y) • But that is still too many parameters in practice! • So we need simplifying assumptions to make estimation more practical
Naïve Bayes Assumption Naïve Bayes assumes 𝑒 𝑄 𝑌 1 , 𝑌 2 , … 𝑌 𝑒 𝑍 = 𝑗=1 𝑄(𝑌 𝑗 |𝑍) i.e., that 𝑌 𝑗 and 𝑌 𝑘 are conditionally independent given Y, for all 𝑗 ≠ 𝑘
Conditional Independence • Definition: X is conditionally independent of Y given Z if P(X|Y,Z) = P(X|Z) • Recall that X is independent of Y if P(X|Y)=P(Y)
Naïve Bayes classifier 𝑧 = 𝑏𝑠𝑛𝑏𝑦 𝑧 𝑄 𝑍 = 𝑧 𝑌 = 𝑦) = 𝑏𝑠𝑛𝑏𝑦 𝑧 𝑄(𝑍 = 𝑧)𝑄 𝑌 = 𝑦 𝑍 = 𝑧) 𝑒 = 𝑏𝑠𝑛𝑏𝑦 𝑧 𝑄(𝑍 = 𝑧) 𝑄 𝑌 𝑗 = 𝑦 𝑗 𝑍 = 𝑧) 𝑗=1 Bayes rule + Conditional independence assumption
How many parameters do we need to learn? • To describe P(Y)? • To describe 𝑄 𝑌 = < 𝑌 1 , 𝑌 2 , … 𝑌 𝑒 > 𝑍 ) – Without conditional independence assumption? – With conditional independence assumption? (Suppose all random variables are Boolean)
Training a Naïve Bayes classifier Let’s assume discrete Xi and Y # 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑍 = 𝑧 𝑙 TrainNaïveBayes (Data) # 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 for each value 𝑧 𝑙 of Y estimate 𝜌 𝑙 = 𝑄(𝑍 = 𝑧 𝑙 ) for each value 𝑦 𝑗𝑘 of 𝑌 𝑗 estimate 𝜄 𝑗𝑘𝑙 = 𝑄 𝑌 𝑗 = 𝑦 𝑗𝑘 𝑍 = 𝑧 𝑙 ) # 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑌 𝑗 = 𝑦 𝑗𝑘 𝑏𝑜𝑒 𝑍 = 𝑧 𝑙 # 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑍 = 𝑧 𝑙
Naïve Bayes Wrap-up • A simple classifier, that performs well in practice • Subtleties – Often the Xi are not really conditionally independent – What if the Maximum Likelihood estimate for P(Xi|Y) is zero?
What you should know • The Naïve Bayes classifier – Conditional independence assumption – How to train it? – How to make predictions? – How does it relate to other classifiers we know? [HW] • Fundamental Machine Learning concepts – iid assumption – Bayes optimal classifier
Recommend
More recommend