Applied Machine Learning Applied Machine Learning Naive Bayes - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Learning objectives Learning objectives generative vs. discriminative classifier Naive Bayes classifier assumption different design choices 2

Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) how to classify new input x? Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ∣ y = 1) p ( x ) p ( x ∣ y = 0) posterior probability marginal probability of the input (evidence) x of a given class C ′ p ( x , c ) ∑ c =1 ′ image: https://rpsychologist.com 3 . 1

Example: Example: Bayes rule for classification Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature prior: 1% of population has cancer p (yes) = .01 likelihood: p (+∣yes) = .9 TP rate of the test (90%) p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) FP rate of the test (5%) posterior: p (yes∣+) = .08 p (+) = p (yes) p (+∣yes) + p (no) p (+∣no) = .01 × .9 + .99 × .05 = .189 evidence: in a generative classifier likelihood & prior class probabilities are learned from data 3 . 2 Winter 2020 | Applied Machine Learning (COMP551)

Generative classification Generative classification prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ) posterior probability marginal probability of the input (evidence) of a given class C ′ p ( x , c ) ∑ c =1 ′ Some generative classifiers: Gaussian Discriminant Analysis: the likelihood is multivariate Gaussian Naive Bayes: decomposed likelihood image: https://rpsychologist.com 4 . 1

Naive Bayes: Naive Bayes: model model number of input features D assumption about the likelihood p ( x ∣ y ) = p ( x ∣ y ) ∏ d =1 d when is this assumption correct? ⊥ ∣ when features are conditionally independent given the label x x y i j knowing the label, the value of one input feature gives us no information about the other input features chain rule of probability (true for any distribution) p ( x ∣ y ) = p ( x ∣ y ) p ( x ∣ y , x ) p ( x ∣ y , x , x ) … p ( x ∣ y , x , … , x ) 1 2 1 3 1 2 1 D −1 D conditional independence assumption x1, x2 give no extra information, so p ( x ∣ y , x , x ) = p ( x ∣ y ) 3 1 2 3 4 . 2

Naive Bayes: objective Naive Bayes: objective given the training dataset (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} maximize the joint likelihood (contrast with logistic regression) ( n ) ( n ) ℓ( w , u ) = log p ( x , y ) ∑ n u , w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n u w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n ∑ n u w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n ∑ d ∑ n using Naive Bayes assumption u w [ d ] d separate MLE estimates for each part 4 . 3

Naive Bayes: Naive Bayes: train-test train-test given the training dataset (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} training time ( y ) p learn the prior class probabilities u ( x ∣ y ) ∀ d p learn the likelihood components w d [ d ] test time find posterior class probabilities ( c ) D ( x ∣ c ) ∏ d =1 p p u w d arg max p ( c ∣ x ) = arg max [ d ] ′ ∏ d =1 c c ( c ) ( x ∣ c ) ′ ∑ c =1 C D p p ′ u w d [ d ] 4 . 4 Winter 2020 | Applied Machine Learning (COMP551)

Class prior Class prior D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] C ( c ) D ( x ∣ c ) ′ ∑ c =1 ∏ d =1 p p ′ u w d [ d ] binary classification u ) 1− y ( y ) = u (1 − y Bernoulli distribution p u maximizing the log-likelihood N ( n ) ( n ) ℓ( u ) = log( u ) + (1 − y ) log(1 − u ) ∑ n =1 y = N log( u ) + ( N − N ) log(1 − u ) 1 1 frequency of class 1 in the dataset frequency of class 0 in the dataset setting its derivative to zero 1 max-likelihood estimate (MLE) is the d N − N ∗ N N ℓ( u ) = − = 0 ⇒ u = 1 1 frequency of class labels d u 1− u u N 5 . 1

Class prior Class prior D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] ∑ c =1 C ( c ) ∏ d =1 D ( x ∣ c ) ′ p p ′ u w d [ d ] multiclass classification C ( y ) = y categorical distribution ∏ c =1 p u c u c assuming one-hot coding for labels u = [ u , … , u ] is now a parameter vector 1 C ( n ) maximizing the log likelihood ℓ( u ) = log( u ) ∑ n ∑ c y c c = 1 ∑ c subject to: u c number of instances in class 1 ∗ N N u = [ , … , ] closed form for the optimal parameter 1 C N N all instances in the dataset 5 . 2 Winter 2020 | Applied Machine Learning (COMP551)

Likelihood terms Likelihood terms (class-conditionals) D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] ∑ c =1 C ( c ) ∏ d =1 D ( x ∣ c ) ′ p p ′ u w d [ d ] choice of likelihood distribution depends on the type of features (likelihood encodes our assumption about "generative process") Bernoulli: binary features note that these are different from the choice of Categorical: categorical features distribution for class prior Gaussian: continuous distribution ... each feature may use a different likelihood x d separate max-likelihood estimates for each feature ( n ) N [ d ]∗ ( n ) = arg max log p ( x ∣ ) [ d ] ∑ n =1 w y w w d [ d ] 6 . 1

Bernoulli Bernoulli Naive Bayes Naive Bayes binary features : likelihood is Bernoulli { ( x ∣ y = 0) = Bernoulli( x ; w ) p [ d ],0 w d d one parameter per label [ d ] ( x ∣ y = 1) = Bernoulli( x ; w ) p [ d ],1 w d d [ d ] ( x ∣ y ) = Bernoulli( x ; w ) short form: p [ d ], y w d d [ d ] max-likelihood estimation is similar to what we saw for the prior number of training instances satisfying this condition N ( y = c , x =1) ∗ = w d [ d ], c N ( y = c ) closed form solution of MLE 6 . 2

Example Example: Bernoulli Naive Bayes Bernoulli Naive Bayes using naive Bayes for document classification : ∗ w [ d ],1 2 classes (documents types) ∗ 600 binary features w [ d ],0 ( n ) = 1 word d is present in the document n (vocabulary of 600) x d likelihood of words in two document types d 1 def BernoulliNaiveBayes(prior,# vector of size 2 for class prior 2 likelihood, #600 x 2: likelihood of each word under each class 3 x, #vector of size 600: binary features for a new document 4 ): 5 logp = np.log(prior) + np.sum(np.log(likelihood * x[:,None]), 0) + \ 6 np.sum(np.log((1-likelihood) * (1 - x[:,None])), 0) 7 log_p -= np.max(log_p) #numerical stability 8 posterior = np.exp(log_p) # vector of size 2 9 posterior /= np.sum(posterior) # normalize 10 return posterior # posterior class probability 6 . 3

Multinomial Multinomial Naive Bayes Naive Bayes what if we wanted to use word frequencies in document classification ( n ) is the number of times word d appears in document n x d ( )! ∑ d x D x ( x ∣ c ) = ∏ d =1 d Multinomial likelihood: p w d w d , c D ∏ d =1 ! x d we have a vector of size D for each class C × D (parameters) ( n ) ( n ) count of word d in all documents labelled y ∑ x y ∗ = MLE estimates: w d c d , c ( n ) ( n ) ∑ n ∑ d ′ total word count in all documents labelled y x y d ′ c 6 . 4 Winter 2020 | Applied Machine Learning (COMP551)

Gaussian Naive Bayes Gaussian Naive Bayes Gaussian likelihood terms d , y 2 ( x − μ ) d − 1 2 d , y 2 ( x ∣ y ) = N ( x ; μ , σ ) = 2 σ p e d , y w d d d , y [ d ] d , y 2 2 πσ = ( μ , σ , … , μ , σ ) w d ,1 d ,1 d , C d , C [ d ] one mean and std. parameter for each class-feature pair writing log-likelihood and setting derivative to zero we get maximum likelihood estimate: 1 ∑ n =1 ( n ) ( n ) N = μ x y x d , y c empirical mean & std of feature d N d c 1 ∑ n =1 ( n ) ( n ) 2 N d , y 2 across instances with label y = ( x − ) σ y μ c d , y d N c 7 . 1

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes classification on Iris flowers dataset : (a classic dataset originally used by Fisher) samples with D=4 features, for each of C=3 = 50 N c species of Iris flower our setting 3 classes 2 features (septal width, petal length) 7 . 2

Applied Machine Learning Applied Machine Learning Naive Bayes - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives generative vs. discriminative classifier Naive

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Harmonic measure with lower dimensional boundaries Guy David, Universit e de Paris-Sud Joseph

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

EE E6882 SVIA: Homework 1 Due on October 1, 2007 Shih-Fu Chang, Lexing Xie Monday 4:10-6:30

Quadrature Domains in Complex Variables Alan Legg Department of Mathematical Sciences, IPFW

Scientific Benchmarking of Parallel Computing Systems Paper Reading Group Torsten Hoefler

Transducing for fun and profit simon@metabase.com @sbelak Clojure at a glance (lisp

"Probabilistic" Data Structures vs. PostgreSQL (and similar stuff) FOSDEM PgDay -

Distributed motion coordination of robotic networks Lecture 5 agreement Jorge Cort es

Applied Machine Learning Applied Machine Learning Naive Bayes - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives generative vs. discriminative classifier Naive

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Harmonic measure with lower dimensional boundaries Guy David, Universit e de Paris-Sud Joseph

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

EE E6882 SVIA: Homework 1 Due on October 1, 2007 Shih-Fu Chang, Lexing Xie Monday 4:10-6:30

Quadrature Domains in Complex Variables Alan Legg Department of Mathematical Sciences, IPFW

Scientific Benchmarking of Parallel Computing Systems Paper Reading Group Torsten Hoefler

Transducing for fun and profit simon@metabase.com @sbelak Clojure at a glance (lisp

&quot;Probabilistic&quot; Data Structures vs. PostgreSQL (and similar stuff) FOSDEM PgDay -

Distributed motion coordination of robotic networks Lecture 5 agreement Jorge Cort es

"Probabilistic" Data Structures vs. PostgreSQL (and similar stuff) FOSDEM PgDay -