Naïve Bayes • A special type of Bayesian network • Makes a conditional independence CS 331: Artificial Intelligence assumption Naïve Bayes • Typically used for classification Thanks to Andrew Moore for some course material 1 2 Classification Classification This is called the “class” variable These are called Suppose you are trying to classify situations that determine (because we’re trying to classify it) features or attributes whether or not Canvas will be down. You’ve come up with the following list of variables (which are all Boolean): We also have a Boolean Monday Is a Monday Monday Assn Grades Win CD variable called CD Assn CS331 assignment which stands for These entries in the true true true false true due “Canvas down” CD column are false true true true false Grades CS331 instructor called “class labels” true false false false false needs to enter grades false true true false true Win The Beavers won the true true true false true football game false false true false true true true false true false 3 4 Classification Naïve Bayes Structure Monday Assn Grades Win CD You create a dataset out of your past experience. This true true true false true CD is called “training data”. false true true true false true false false false false false true false false true true true true false true M A G W false false true false true true true false true false You now have 2 new Notice the conditional independence assumption: situations and you would like Monday Assn Grades Win to predict if Canvas will go The features are conditionally independent given down. This is called “test true true true true data”. the class variable. false true true false 5 6 1
Naïve Bayes Parameters Naïve Bayes Parameters CD P ( CD ) = ? CD M A G W CD P( CD ) M A G W false (# of records in training data with CD = false) / (# of records in training data) P ( M | CD ) = ? P ( A | CD ) = ? P ( G | CD ) = ? P ( W | CD ) = ? true (# of records in training data with CD = true) / How do you get these parameters from the training data? (# of records in training data) 7 8 Naïve Bayes Parameters Inference in Naïve Bayes CD P ( CD | M , A , G , W ) ( , , , | ) ( ) P M A G W CD P CD By Bayes Rule M A G W P ( M , A , G , W ) ( , , , | ) ( ) P M A G W CD P CD Treat denominator M CD P( M | CD ) as constant false false (# of records with M = false and CD = false) / (# of records with CD = false) P ( CD ) P ( M | CD ) P ( A | CD ) P ( G | CD ) P ( W | CD ) false true (# of records with M = false and CD = true) / (# of records with CD = true) true false (# of records with M = true and CD = false) / From conditional (# of records with CD = false) independence true true (# of records with M = true and CD = true) / (# of records with CD = true) 10 Prediction Prediction • Suppose you are now in a day when • You need to compare: – P( cd | m, a, g, w ) = α P( cd ) P( m | cd ) P( a | M=true, A=true, G=true, W=true. cd ) P( g | cd ) P( w | cd ) • You need to predict if CD=true or – P( cd | m, a, g, w) = α P( cd ) P( m | cd ) P( CD=false. a | cd ) P( g | cd ) P( w | cd ) • We will use the notation that CD=true is • Whichever probability is the bigger of the two equivalent to cd and CD=false is equivalent above, that is your prediction for CD to cd. • Because you take the max of the two probabilities above, you can ignore α (since it is the same in both) 11 12 2
Naïve Bayes Classifier The General Case predict argmax ( | ) Y P Y v X u X u 1 1 m m Y v . . . X 1 X 2 X m 1. Estimate P(Y=v) as fraction of records with Y=v 2. Estimate P(X i =u | Y=v) as fraction of “Y=v” records that also have X=u. 3. To predict the Y value given observations of all the X i values, compute predict Y argmax P ( Y v | X u X u ) 1 1 m m v 13 14 Naïve Bayes Classifier Naïve Bayes Classifier predict predict argmax ( | ) argmax ( | ) Y P Y v X u X u Y P Y v X u X u 1 1 m m 1 1 m m v v P ( Y v , X u X u ) P ( Y v , X u X u ) predict 1 1 predict 1 1 argmax m m argmax m m Y Y ( ) ( ) P X u X u P X u X u v v 1 1 1 1 m m m m ( | ) ( ) P X u X u Y v P Y v predict 1 1 Y argmax m m ( ) P X u X u v 1 1 m m 15 16 Naïve Bayes Classifier Naïve Bayes Classifier predict predict argmax ( | ) argmax ( | ) Y P Y v X u X u Y P Y v X u X u 1 1 m m 1 1 m m v v P ( Y v , X u X u ) P ( Y v , X u X u ) predict 1 1 predict 1 1 argmax m m argmax m m Y Y P ( X u X u ) P ( X u X u ) v v 1 1 1 1 m m m m P ( X u X u | Y v ) P ( Y v ) P ( X u X u | Y v ) P ( Y v ) predict 1 1 predict 1 1 argmax m m argmax m m Y Y ( ) ( ) P X u X u P X u X u v v 1 1 1 1 m m m m predict predict argmax ( | ) ( ) argmax ( | ) ( ) Y P X u X u Y v P Y v Y P X u X u Y v P Y v 1 1 m m 1 1 m m v v Because of the structure of the Bayes Net m predict argmax ( ) ( | ) Y P Y v P X u Y v j j v j 1 17 18 3
Technical Point #1 Technical Point #2 • The probabilities P( X j = u j | Y = v ) can sometimes • When estimating parameters, what happens if you don’t have any records that match a certain be really small combination of features? • This can result in numerical instability since • For example, in our training data, we didn’t have floating point numbers are not represented exactly on any computer architecture M=false, A=false, G=false, W=false • To get around this, use the log of the last line in • This means that P( X j = u j | Y = v ) in the formula the previous slide i.e. below will be 0 and the entire expression will be 0. m m Even more horrible predict Y argmax log( P ( Y v )) log( P ( X u | Y v ) ) P ( Y v ) P ( X u | Y v ) things happen if you j j j j v 1 1 j j had this expression in log space 19 20 Uniform Dirichlet Priors Example Let 𝑂 𝑘 be the number of values that 𝑌 𝑘 can take on. Monday Assn Grades Win CD (# records with X u and Y v ) 1 j j true true true false true ( | ) P X u Y v j j (# records with ) Y v N false true true true false j true false false false false What happens when you have no records with Y = v ? false true false false true true true true false true 1 ( | ) P X u Y v false false true false true j j N true true false true false j This means that each value of 𝑌 𝑘 is equally likely in the absence Compute P(M|CD) using uniform Dirichlet priors of data. If you have a lot of data, it dominates the 1/𝑂 𝑘 value. We call this trick a “uniform Dirichlet prior ”. 21 22 CW: Practice Programming Assignment #3 Monday Assn Grades Win CD You will classify text into two classes. true true true false true false true true true false true false false false false There are two files: false true false false true 1. Training data: trainingSet.txt true true true false true false false true false true 2. Testing data: testSet.txt true true false true false Compute P(W=true|CD=true) using uniform Dirichlet priors 23 24 4
Recommend
More recommend