Nave Bayes Yingyu Liang Computer Sciences 760 Fall 2017 - PowerPoint PPT Presentation

Naïve Bayes Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

Goals for the lecture • understand the concepts • generative/discriminative models • examples of the two approaches • MLE (Maximum Likelihood Estimation) • Naïve Bayes • Naïve Bayes assumption • model 1: Bernoulli Naïve Bayes • model 2: Multinomial Naïve Bayes • model 3: Gaussian Naïve Bayes • model 4: Multiclass Naïve Bayes

Review: supervised learning problem setting • set of possible instances: X • unknown target function (concept): • set of hypotheses (hypothesis class): given • training set of instances of unknown target function f       m y ( 1 ) ( 1 ) ( 2 ) ( 2 ) ( ) ( m ) , y , , y ... , x x x output h  • H hypothesis that best approximates target function

Parametric hypothesis class h     • H hypothesis is indexed by parameter h   • H learning: find the such that best approximate the target  h  H   • different from nonparametric approaches like decision trees and nearest neighbor • advantages: various hypothesis class; easier to use math/optimization

Discriminative approaches h  • H hypothesis directly predicts the label given the features   y h ( x ) or more generally, p ( y | x ) h ( x ) • L ( h ) then define a loss function and find hypothesis with min. loss • example: linear regression   h ( x ) x ,  m 1    ( i ) ( i ) 2 L ( h ) ( h ( x ) y )   m  i 1

Generative approaches h  • H hypothesis specifies a generative story for how the data was created  h ( x , y ) p ( x , y ) • then pick a hypothesis by maximum likelihood estimation (MLE) or Maximum A Posteriori (MAP) • example: roll a weighted die  • weights for each side ( ) define how the data are generated  • use MLE on the training data to learn

Comments on discriminative/generative • usually for supervised learning, parametric hypothesis class • can also for unsupervised learning • k-means clustering (discriminative flavor) vs Mixture of Gaussians (generative) • can also for nonparametric • nonparametric Bayesian: a large subfield of ML • when discriminative/generative is likely to be better? Discussed in later lecture • typical discriminative: linear regression, logistic regression, SVM, many neural networks (not all!), … • typical generative: Naïve Bayes, Bayesian Networks, …

MLE vs. MAP Maximum Likelihood Estimate (MLE) 8

Background: MLE Example: MLE of Exponential Distribution 9

MLE vs. MAP Maximum Likelihood Estimate (MLE) Maximum a posteriori (MAP) estimate Prior 12

Spam News The Economist The Onion 13

Model 0: Not-so-naïve Model? Generative Story: 1. Flip a weighted coin ( Y ) 2. If heads, roll the red many sided die to sample a document vector ( X ) from the Spam distribution 3. If tails, roll the blue many sided die to sample a document vector ( X ) from the Not-Spam distribution This model is computationally naïve! 14

Model 0: Not-so-naïve Model? Generative Story: 1. Flip a weighted coin ( Y ) 2. If heads, sample a document ID ( X ) from the Spam distribution 3. If tails, sample a document ID ( X ) from the Not-Spam distribution This model is computationally naïve! 15

Model 0: Not-so-naïve Model? Flip weighted coin If TAILS, roll If HEADS, roll blue die red die y x 1 x 2 x 3 … x K 0 1 0 1 … 1 1 0 1 0 … 1 1 1 1 1 … 1 0 0 0 1 … 1 Each side of the die 0 1 0 1 … 0 is labeled with a document vector 1 1 0 1 … 0 (e.g. [1,0,1,…,1]) 16

Naïve Bayes Assumption Conditional independence of features: 17

Assuming conditional independence, the conditional probabilities encode the same information as the joint table. They are very convenient for estimating P( X 1 ,…, X n |Y)=P( X 1 |Y)*…*P( X n |Y) They are almost as good for computing P ( Y | X 1 ,..., X n ) = P ( X 1, ..., X n | Y ) P ( Y ) P ( X 1 ,..., X n )   P ( X ..., X | Y ) P ( Y y ) x     1 , n , y : P ( Y y | X ,..., X ) x x  1 n P ( X ,..., X ) x 1 n

Generic Naïve Bayes Model Support: Depends on the choice of event model , P(X k |Y) Model: Product of prior and the event model Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. Classification: Find the class that maximizes the posterior 21

Generic Naïve Bayes Model Classification: 22

Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K Generative Story: Model: 23

Model 1: Bernoulli Naïve Bayes Flip weighted coin If TAILS, flip If HEADS, flip each blue coin each red coin y x 1 x 2 x 3 … x K … … 0 1 0 1 … 1 We can generate data in 1 0 1 0 … 1 this fashion. Though in practice we never would 1 1 1 1 … 1 since our data is given . 0 0 0 1 … 1 Instead, this provides an 0 1 0 1 … 0 explanation of how the Each red coin data was generated corresponds to 1 1 0 1 … 0 (albeit a terrible one). an x k 24

Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K Generative Story: Same as Generic Naïve Bayes Model: Classification: Find the class that maximizes the posterior 25

Generic Naïve Bayes Model Classification: 26

Model 1: Bernoulli Naïve Bayes Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. 27

Model 2: Multinomial Naïve Bayes Support: Integer vector (word IDs) Generative Story: Model: 28

Model 3: Gaussian Naïve Bayes Support: Model: Product of prior and the event model 29

Model 4: Multiclass Naïve Bayes Model: 30

Nave Bayes Yingyu Liang Computer Sciences 760 Fall 2017 - PowerPoint PPT Presentation

Nave Bayes Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell,

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Nave Bayes Classification Nickolai Riabov, Kenneth Tiong Brown University Fall 2013 Nickolai

BAYES FORMULA a two-stage experiment Xingru Chen xingru.chen.gr@dartmouth.edu XC 2020

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Probabilistic Diagnosis Albert R Meyer, May 3, 2013 Albert R Meyer, May 3, 2013 bayes.1

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Arthur Berg Pennsylvania State University Introduction Bayes Estimation Empirical Bayes

Bayes meets Dijkstra Exact Inference by Program Verification Joost-Pieter Katoen Dagstuhl

The Roman Road a sermon series January 2016 until who knows? Justified by Faith - at just the

Opening Exercise Use a Random object to create an object that simulates a standard six-sided die.

Java classes Savitch, ch 5 Outline n Objects, classes, and object-oriented programming q

NAND Flash Memory Laura M. Grupp * , John D. Davis , Steven Swanson * * Non-volatile Systems

Resolution and Refutation Resolution and Refutation York University CSE 3401 Vida Movahedi 1 York

P t r P t P

On the Effect of Learned Clauses on Stochastic Local Search Jan-Hendrik Lorenz Florian Wrz

Logic Programming Manipulating Programs Temur Kutsia Research Institute for Symbolic Computation