na ve bayes
play

Nave Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Nave Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning / Generative Models


  1. 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1

  2. Reminders • Homework 6: PAC Learning / Generative Models – Out: Wed, Oct 31 – Due: Wed, Nov 7 at 11:59pm (1 week) TIP: Do the readings! • Exam Viewing – Thu, Nov 1 – Fri, Nov 2 2

  3. NAÏVE BAYES 4

  4. Naïve Bayes Outline • Real-world Dataset – Economist vs. Onion articles – Document à bag-of-words à binary feature vector • Naive Bayes: Model – Generating synthetic "labeled documents" – Definition of model – Naive Bayes assumption – Counting # of parameters with / without NB assumption • Naïve Bayes: Learning from Data – Data likelihood – MLE for Naive Bayes – MAP for Naive Bayes • Visualizing Gaussian Naive Bayes 5

  5. Fake News Detector Today’s Goal: To define a generative model of emails of two different classes (e.g. real vs. fake news) The Economist The Onion 6

  6. Naive Bayes: Model Whiteboard – Document à bag-of-words à binary feature vector – Generating synthetic "labeled documents" – Definition of model – Naive Bayes assumption – Counting # of parameters with / without NB assumption 7

  7. Model 1: Bernoulli Naïve Bayes Flip weighted coin If TAILS, flip If HEADS, flip each blue coin each red coin y x 1 x 3 x 2 x M … … … 0 1 0 1 … 1 We can generate data in 1 0 0 1 … 1 this fashion. Though in practice we never would 1 1 1 1 … 1 since our data is given . 0 0 1 0 … 1 Instead, this provides an 0 1 0 1 … 0 explanation of how the Each red coin data was generated corresponds to 1 1 0 1 … 0 (albeit a terrible one). an x m 8

  8. What’s wrong with the Naïve Bayes Assumption? The features might not be independent!! • Example 1: – If a document contains the word “Donald”, it’s extremely likely to contain the word “Trump” – These are not independent! • Example 2: – If the petal width is very high, the petal length is also likely to be very high 9

  9. Naïve Bayes: Learning from Data Whiteboard – Data likelihood – MLE for Naive Bayes – Example: MLE for Naïve Bayes with Two Features – MAP for Naive Bayes 10

  10. NAÏVE BAYES: MODEL DETAILS 11

  11. Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K � ∈ { 0 , 1 } K Generative Story: Y ∼ Bernoulli ( φ ) X k ∼ Bernoulli ( θ k,Y ) ∀ k ∈ { 1 , . . . , K } Model: p φ , θ ( x , y ) = p φ , θ ( x 1 , . . . , x K , y ) K � = p φ ( y ) p θ k ( x k | y ) k =1 K = ( φ ) y (1 − φ ) (1 − y ) � ( θ k,y ) x k (1 − θ k,y ) (1 − x k ) k =1 12

  12. Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K � ∈ { 0 , 1 } K Generative Story: Y ∼ Bernoulli ( φ ) X k ∼ Bernoulli ( θ k,Y ) ∀ k ∈ { 1 , . . . , K } Same as Generic K Naïve Bayes Model: = ( φ ) y (1 − φ ) (1 − y ) � ( θ k,y ) x k (1 − θ k,y ) (1 − x k ) p φ , θ ( x , y ) = k =1 Classification: Find the class that maximizes the posterior y = ������ ˆ p ( y | � ) y 13

  13. Model 1: Bernoulli Naïve Bayes Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. i =1 I ( y ( i ) = 1) � N φ = N i =1 I ( y ( i ) = 0 ∧ x ( i ) � N = 1) k θ k, 0 = i =1 I ( y ( i ) = 0) � N i =1 I ( y ( i ) = 1 ∧ x ( i ) � N = 1) k θ k, 1 = i =1 I ( y ( i ) = 1) � N ∀ k ∈ { 1 , . . . , K } 14

  14. Model 1: Bernoulli Naïve Bayes Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. i =1 I ( y ( i ) = 1) � N Data: φ = y x 1 x 2 x 3 x K … N 0 1 0 1 … 1 i =1 I ( y ( i ) = 0 ∧ x ( i ) � N = 1) k θ k, 0 = 1 0 0 1 … 1 i =1 I ( y ( i ) = 0) � N 1 1 1 1 … 1 i =1 I ( y ( i ) = 1 ∧ x ( i ) � N = 1) 0 0 1 0 … 1 k θ k, 1 = i =1 I ( y ( i ) = 1) � N 0 1 0 1 … 0 1 1 0 1 … 0 ∀ k ∈ { 1 , . . . , K } 15

  15. Other NB Models 1. Bernoulli Naïve Bayes: – for binary features 2. Gaussian Naïve Bayes: – for continuous features 3. Multinomial Naïve Bayes: – for integer features 4. Multi-class Naïve Bayes: – for classification problems with > 2 classes – event model could be any of Bernoulli, Gaussian, Multinomial, depending on features 16

  16. Model 2: Gaussian Naïve Bayes Support: � ∈ R K Model: Product of prior and the event model p ( x , y ) = p ( x 1 , . . . , x K , y ) K � = p ( y ) p ( x k | y ) k =1 Gaussian Naive Bayes assumes that p ( x k | y ) is given by a Normal distribution. 17

  17. Model 3: Multinomial Naïve Bayes Support: Option 1: Integer vector (word IDs) � = [ x 1 , x 2 , . . . , x M ] where x m ∈ { 1 , . . . , K } a word id. Generative Story: for i ∈ { 1 , . . . , N } : y ( i ) ∼ Bernoulli ( φ ) for j ∈ { 1 , . . . , M i } : x ( i ) ∼ Multinomial ( θ y ( i ) , 1) j Model: K � p φ , θ ( x , y ) = p φ ( y ) p θ k ( x k | y ) k =1 M i = ( φ ) y (1 − φ ) (1 − y ) � θ y,x j 18 j =1

  18. Model 5: Multiclass Naïve Bayes Model: The only change is that we permit y to range over C classes. p ( x , y ) = p ( x 1 , . . . , x K , y ) K � = p ( y ) p ( x k | y ) k =1 Now, y ∼ Multinomial ( φ , 1) and we have a sepa- rate conditional distribution p ( x k | y ) for each of the C classes. 19

  19. Generic Naïve Bayes Model Support: Depends on the choice of event model , P(X k |Y) Model: Product of prior and the event model K � P ( � , Y ) = P ( Y ) P ( X k | Y ) k =1 Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. Classification: Find the class that maximizes the posterior y = ������ ˆ p ( y | � ) y 20

  20. Generic Naïve Bayes Model Classification: (posterior) y = ������ ˆ p ( y | � ) y p ( � | y ) p ( y ) (by Bayes’ rule) = ������ p ( x ) y = ������ p ( � | y ) p ( y ) y 21

  21. Smoothing 1. Add-1 Smoothing 2. Add-λ Smoothing 3. MAP Estimation (Beta Prior) 22

  22. MLE What does maximizing likelihood accomplish? • There is only a finite amount of probability mass (i.e. sum-to-one constraint) • MLE tries to allocate as much probability mass as possible to the things we have observed… … at the expense of the things we have not observed 23

  23. MLE For Naïve Bayes, suppose we never observe the word “ serious ” in an Onion article. In this case, what is the MLE of p(x k | y)? i =1 I ( y ( i ) = 0 ∧ x ( i ) � N = 1) k θ k, 0 = i =1 I ( y ( i ) = 0) � N Now suppose we observe the word “serious” at test time. What is the posterior probability that the article was an Onion article? p ( y | x ) = p ( x | y ) p ( y ) p ( x ) 24

  24. 1. Add-1 Smoothing The simplest setting for smoothing simply adds a single pseudo-observation to the data. This converts the true observations D into a new dataset D � from we derive the MLEs. D = { ( � ( i ) , y ( i ) ) } N (1) i =1 D � = D ∪ { ( � , 0) , ( � , 1) , ( � , 0) , ( � , 1) } (2) where � is the vector of all zeros and � is the vector of all ones. This has the effect of pretending that we observed each feature x k with each class y . 25

  25. 1. Add-1 Smoothing What if we write the MLEs in terms of the original dataset D ? i =1 I ( y ( i ) = 1) � N φ = N i =1 I ( y ( i ) = 0 ∧ x ( i ) θ k, 0 = 1 + � N = 1) k i =1 I ( y ( i ) = 0) 2 + � N i =1 I ( y ( i ) = 1 ∧ x ( i ) θ k, 1 = 1 + � N = 1) k i =1 I ( y ( i ) = 1) 2 + � N ∀ k ∈ { 1 , . . . , K } 26

  26. 2. Add-λ Smoothing For the Categorical Distribution Suppose we have a dataset obtained by repeatedly rolling a K -sided (weighted) die. Given data D = i =1 where x ( i ) ∈ { 1 , . . . , K } , we have the fol- { x ( i ) } N lowing MLE: i =1 I ( x ( i ) = k ) � N φ k = N Withadd- λ smoothing, weaddpseudo-observationsas before to obtain a smoothed estimate: i =1 I ( x ( i ) = k ) φ k = λ + � N 27 k λ + N

  27. 3. MAP Estimation (Beta Prior) Generative Story: Training: Find the class-conditional The parameters are MAP parameters drawn once for the i =1 I ( y ( i ) = 1) � N φ = entire dataset. N for k ∈ { 1 , . . . , K } : i =1 I ( y ( i ) = 0 ∧ x ( i ) θ k, 0 = ( α − 1) + � N = 1) for y ∈ { 0 , 1 } : k i =1 I ( y ( i ) = 0) ( α − 1) + ( β − 1) + � N θ k,y ∼ Beta ( α , β ) for i ∈ { 1 , . . . , N } : y ( i ) ∼ Bernoulli ( φ ) i =1 I ( y ( i ) = 1 ∧ x ( i ) θ k, 1 = ( α − 1) + � N = 1) k for k ∈ { 1 , . . . , K } : i =1 I ( y ( i ) = 1) ( α − 1) + ( β − 1) + � N x ( i ) ∼ Bernoulli ( θ k,y ( i ) ) k ∀ k ∈ { 1 , . . . , K } 28

  28. VISUALIZING NAÏVE BAYES 29

  29. Fisher Iris Dataset Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936) Species Sepal Sepal Petal Petal Length Width Length Width 0 4.3 3.0 1.1 0.1 0 4.9 3.6 1.4 0.1 0 5.3 3.7 1.5 0.2 1 4.9 2.4 3.3 1.0 1 5.7 2.8 4.1 1.3 1 6.3 3.3 4.7 1.6 1 6.7 3.0 5.0 1.7 31 Full dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

  30. Slide from William Cohen

Recommend


More recommend