Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Learning objectives Learning objectives generative vs. discriminative classifier Naive Bayes classifier assumption different design choices 2
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative 3 . 1 image: https://rpsychologist.com
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative 3 . 1 image: https://rpsychologist.com
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ) 3 . 1 image: https://rpsychologist.com
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative prior class probability: frequency of observing this label Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ∣ y = 1) p ( x ) p ( x ∣ y = 0) x 3 . 1 image: https://rpsychologist.com
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ∣ y = 1) p ( x ) p ( x ∣ y = 0) x 3 . 1 image: https://rpsychologist.com
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ∣ y = 1) p ( x ) p ( x ∣ y = 0) marginal probability of the input (evidence) x C ′ ∑ c =1 p ( x , c ) ′ 3 . 1 image: https://rpsychologist.com
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ∣ y = 1) p ( x ) p ( x ∣ y = 0) posterior probability marginal probability of the input (evidence) x of a given class C ′ ∑ c =1 p ( x , c ) ′ 3 . 1 image: https://rpsychologist.com
Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) how to classify new input x? Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ∣ y = 1) p ( x ) p ( x ∣ y = 0) posterior probability marginal probability of the input (evidence) x of a given class C ′ ∑ c =1 p ( x , c ) ′ 3 . 1 image: https://rpsychologist.com
Example: Example: Bayes rule for classification Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) 3 . 2
Example: Example: Bayes rule for classification Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature prior: 1% of population has cancer p (yes) = .01 p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) 3 . 2
Example: Bayes rule for classification Example: Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature prior: 1% of population has cancer p (yes) = .01 p (+∣yes) = .9 likelihood: TP rate of the test (90%) p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) 3 . 2
Example: Example: Bayes rule for classification Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature prior: 1% of population has cancer p (yes) = .01 p (+∣yes) = .9 likelihood: TP rate of the test (90%) p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) FP rate of the test (5%) p (+) = p (yes) p (+∣yes) + p (no) p (+∣no) = .01 × .9 + .99 × .05 = .189 evidence: 3 . 2
Example: Bayes rule for classification Example: Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature prior: 1% of population has cancer p (yes) = .01 p (+∣yes) = .9 likelihood: TP rate of the test (90%) p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) FP rate of the test (5%) posterior: p (yes∣+) = .08 p (+) = p (yes) p (+∣yes) + p (no) p (+∣no) = .01 × .9 + .99 × .05 = .189 evidence: 3 . 2
Example: Example: Bayes rule for classification Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature prior: 1% of population has cancer p (yes) = .01 p (+∣yes) = .9 likelihood: TP rate of the test (90%) p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) FP rate of the test (5%) posterior: p (yes∣+) = .08 p (+) = p (yes) p (+∣yes) + p (no) p (+∣no) = .01 × .9 + .99 × .05 = .189 evidence: in a generative classifier likelihood & prior class probabilities are learned from data 3 . 2 Winter 2020 | Applied Machine Learning (COMP551)
Generative classification Generative classification prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ) posterior probability marginal probability of the input (evidence) of a given class C ′ p ( x , c ) ∑ c =1 ′ Some generative classifiers: Gaussian Discriminant Analysis: the likelihood is multivariate Gaussian Naive Bayes: decomposed likelihood 4 . 1 image: https://rpsychologist.com
Naive Bayes: Naive Bayes: model model number of input features D assumption about the likelihood p ( x ∣ y ) = p ( x ∣ y ) ∏ d =1 d 4 . 2
Naive Bayes: Naive Bayes: model model number of input features D assumption about the likelihood p ( x ∣ y ) = p ( x ∣ y ) ∏ d =1 d when is this assumption correct? ⊥ ∣ when features are conditionally independent given the label x x y i j knowing the label, the value of one input feature gives us no information about the other input features 4 . 2
Naive Bayes: Naive Bayes: model model number of input features D assumption about the likelihood p ( x ∣ y ) = p ( x ∣ y ) ∏ d =1 d when is this assumption correct? ⊥ ∣ when features are conditionally independent given the label x x y i j knowing the label, the value of one input feature gives us no information about the other input features chain rule of probability (true for any distribution) p ( x ∣ y ) = p ( x ∣ y ) p ( x ∣ y , x ) p ( x ∣ y , x , x ) … p ( x ∣ y , x , … , x ) 1 2 1 3 1 2 1 D −1 D 4 . 2
Naive Bayes: Naive Bayes: model model number of input features D assumption about the likelihood p ( x ∣ y ) = p ( x ∣ y ) ∏ d =1 d when is this assumption correct? ⊥ ∣ when features are conditionally independent given the label x x y i j knowing the label, the value of one input feature gives us no information about the other input features chain rule of probability (true for any distribution) p ( x ∣ y ) = p ( x ∣ y ) p ( x ∣ y , x ) p ( x ∣ y , x , x ) … p ( x ∣ y , x , … , x ) 1 2 1 3 1 2 1 D −1 D conditional independence assumption x1, x2 give no extra information, so p ( x ∣ y , x , x ) = p ( x ∣ y ) 3 1 2 3 4 . 2
Naive Bayes: objective Naive Bayes: objective given the training dataset (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} maximize the joint likelihood (contrast with logistic regression) ( n ) ( n ) ℓ( w , u ) = log p ( x , y ) ∑ n u , w 4 . 3
Naive Bayes: objective Naive Bayes: objective given the training dataset (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} maximize the joint likelihood (contrast with logistic regression) ( n ) ( n ) ℓ( w , u ) = log p ( x , y ) ∑ n u , w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n u w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n ∑ n u w 4 . 3
Recommend
More recommend