Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Learning objectives Learning objectives generative vs. discriminative classifier Naive Bayes classifier assumption different design choices 2
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) how to classify new input x? Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ∣ y = 1) p ( x ) p ( x ∣ y = 0) posterior probability marginal probability of the input (evidence) x of a given class C ′ p ( x , c ) ∑ c =1 ′ image: https://rpsychologist.com 3 . 1
Example: Example: Bayes rule for classification Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature prior: 1% of population has cancer p (yes) = .01 likelihood: p (+∣yes) = .9 TP rate of the test (90%) p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) FP rate of the test (5%) posterior: p (yes∣+) = .08 p (+) = p (yes) p (+∣yes) + p (no) p (+∣no) = .01 × .9 + .99 × .05 = .189 evidence: in a generative classifier likelihood & prior class probabilities are learned from data 3 . 2 Winter 2020 | Applied Machine Learning (COMP551)
Generative classification Generative classification prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ) posterior probability marginal probability of the input (evidence) of a given class C ′ p ( x , c ) ∑ c =1 ′ Some generative classifiers: Gaussian Discriminant Analysis: the likelihood is multivariate Gaussian Naive Bayes: decomposed likelihood image: https://rpsychologist.com 4 . 1
Naive Bayes: Naive Bayes: model model number of input features D assumption about the likelihood p ( x ∣ y ) = p ( x ∣ y ) ∏ d =1 d when is this assumption correct? ⊥ ∣ when features are conditionally independent given the label x x y i j knowing the label, the value of one input feature gives us no information about the other input features chain rule of probability (true for any distribution) p ( x ∣ y ) = p ( x ∣ y ) p ( x ∣ y , x ) p ( x ∣ y , x , x ) … p ( x ∣ y , x , … , x ) 1 2 1 3 1 2 1 D −1 D conditional independence assumption x1, x2 give no extra information, so p ( x ∣ y , x , x ) = p ( x ∣ y ) 3 1 2 3 4 . 2
Naive Bayes: objective Naive Bayes: objective given the training dataset (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} maximize the joint likelihood (contrast with logistic regression) ( n ) ( n ) ℓ( w , u ) = log p ( x , y ) ∑ n u , w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n u w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n ∑ n u w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n ∑ d ∑ n using Naive Bayes assumption u w [ d ] d separate MLE estimates for each part 4 . 3
Naive Bayes: Naive Bayes: train-test train-test given the training dataset (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} training time ( y ) p learn the prior class probabilities u ( x ∣ y ) ∀ d p learn the likelihood components w d [ d ] test time find posterior class probabilities ( c ) D ( x ∣ c ) ∏ d =1 p p u w d arg max p ( c ∣ x ) = arg max [ d ] ′ ∏ d =1 c c ( c ) ( x ∣ c ) ′ ∑ c =1 C D p p ′ u w d [ d ] 4 . 4 Winter 2020 | Applied Machine Learning (COMP551)
Class prior Class prior D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] C ( c ) D ( x ∣ c ) ′ ∑ c =1 ∏ d =1 p p ′ u w d [ d ] binary classification u ) 1− y ( y ) = u (1 − y Bernoulli distribution p u maximizing the log-likelihood N ( n ) ( n ) ℓ( u ) = log( u ) + (1 − y ) log(1 − u ) ∑ n =1 y = N log( u ) + ( N − N ) log(1 − u ) 1 1 frequency of class 1 in the dataset frequency of class 0 in the dataset setting its derivative to zero 1 max-likelihood estimate (MLE) is the d N − N ∗ N N ℓ( u ) = − = 0 ⇒ u = 1 1 frequency of class labels d u 1− u u N 5 . 1
Class prior Class prior D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] ∑ c =1 C ( c ) ∏ d =1 D ( x ∣ c ) ′ p p ′ u w d [ d ] multiclass classification C ( y ) = y categorical distribution ∏ c =1 p u c u c assuming one-hot coding for labels u = [ u , … , u ] is now a parameter vector 1 C ( n ) maximizing the log likelihood ℓ( u ) = log( u ) ∑ n ∑ c y c c = 1 ∑ c subject to: u c number of instances in class 1 ∗ N N u = [ , … , ] closed form for the optimal parameter 1 C N N all instances in the dataset 5 . 2 Winter 2020 | Applied Machine Learning (COMP551)
Likelihood terms Likelihood terms (class-conditionals) D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] ∑ c =1 C ( c ) ∏ d =1 D ( x ∣ c ) ′ p p ′ u w d [ d ] choice of likelihood distribution depends on the type of features (likelihood encodes our assumption about "generative process") Bernoulli: binary features note that these are different from the choice of Categorical: categorical features distribution for class prior Gaussian: continuous distribution ... each feature may use a different likelihood x d separate max-likelihood estimates for each feature ( n ) N [ d ]∗ ( n ) = arg max log p ( x ∣ ) [ d ] ∑ n =1 w y w w d [ d ] 6 . 1
Bernoulli Bernoulli Naive Bayes Naive Bayes binary features : likelihood is Bernoulli { ( x ∣ y = 0) = Bernoulli( x ; w ) p [ d ],0 w d d one parameter per label [ d ] ( x ∣ y = 1) = Bernoulli( x ; w ) p [ d ],1 w d d [ d ] ( x ∣ y ) = Bernoulli( x ; w ) short form: p [ d ], y w d d [ d ] max-likelihood estimation is similar to what we saw for the prior number of training instances satisfying this condition N ( y = c , x =1) ∗ = w d [ d ], c N ( y = c ) closed form solution of MLE 6 . 2
Example Example: Bernoulli Naive Bayes Bernoulli Naive Bayes using naive Bayes for document classification : ∗ w [ d ],1 2 classes (documents types) ∗ 600 binary features w [ d ],0 ( n ) = 1 word d is present in the document n (vocabulary of 600) x d likelihood of words in two document types d 1 def BernoulliNaiveBayes(prior,# vector of size 2 for class prior 2 likelihood, #600 x 2: likelihood of each word under each class 3 x, #vector of size 600: binary features for a new document 4 ): 5 logp = np.log(prior) + np.sum(np.log(likelihood * x[:,None]), 0) + \ 6 np.sum(np.log((1-likelihood) * (1 - x[:,None])), 0) 7 log_p -= np.max(log_p) #numerical stability 8 posterior = np.exp(log_p) # vector of size 2 9 posterior /= np.sum(posterior) # normalize 10 return posterior # posterior class probability 6 . 3
Multinomial Multinomial Naive Bayes Naive Bayes what if we wanted to use word frequencies in document classification ( n ) is the number of times word d appears in document n x d ( )! ∑ d x D x ( x ∣ c ) = ∏ d =1 d Multinomial likelihood: p w d w d , c D ∏ d =1 ! x d we have a vector of size D for each class C × D (parameters) ( n ) ( n ) count of word d in all documents labelled y ∑ x y ∗ = MLE estimates: w d c d , c ( n ) ( n ) ∑ n ∑ d ′ total word count in all documents labelled y x y d ′ c 6 . 4 Winter 2020 | Applied Machine Learning (COMP551)
Gaussian Naive Bayes Gaussian Naive Bayes Gaussian likelihood terms d , y 2 ( x − μ ) d − 1 2 d , y 2 ( x ∣ y ) = N ( x ; μ , σ ) = 2 σ p e d , y w d d d , y [ d ] d , y 2 2 πσ = ( μ , σ , … , μ , σ ) w d ,1 d ,1 d , C d , C [ d ] one mean and std. parameter for each class-feature pair writing log-likelihood and setting derivative to zero we get maximum likelihood estimate: 1 ∑ n =1 ( n ) ( n ) N = μ x y x d , y c empirical mean & std of feature d N d c 1 ∑ n =1 ( n ) ( n ) 2 N d , y 2 across instances with label y = ( x − ) σ y μ c d , y d N c 7 . 1
Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes classification on Iris flowers dataset : (a classic dataset originally used by Fisher) samples with D=4 features, for each of C=3 = 50 N c species of Iris flower our setting 3 classes 2 features (septal width, petal length) 7 . 2
Recommend
More recommend