Generative Models and Naïve Bayes Ke Chen Reading: [14.3, EA], [3.5, KPM], [1.5.4, CMB] COMP24111 Machine Learning
Outline • Background and Probability Basics • Probabilistic Classification Principle – Probabilistic discriminative models – Generative models and their application to classification – MAP and converting generative into discriminative • Naïve Bayes – an generative model – Principle and Algorithms (discrete vs. continuous) – Example: Play Tennis • Zero Conditional Probability and Treatment • Summary 2 COMP24111 Machine Learning
Background • There are three methodologies: a ) Model a classification rule directly Examples: k-NN, linear classifier, SVM , neural nets, .. b ) Model the probability of class memberships given input data Examples: logistic regression, probabilistic neural nets (softmax),… c ) Make a probabilistic model of data within each class Examples: naive Bayes, model-based …. • Important ML taxonomy for learning models probabilistic models vs non-probabilistic models discriminative models vs generative models 3 COMP24111 Machine Learning
Background • Based on the taxonomy, we can see different the essence of learning models (classifiers) more clearly. Probabilistic Non-Probabilistic • Logistic Regression • K-nn • Probabilistic neural nets • Linear classifier Discriminative • …….. • SVM • Neural networks • …… • Naïve Bayes • Model-based (e.g., GMM) N.A. (?) Generative • …… 4 COMP24111 Machine Learning
Probability Basics • Prior, conditional and joint probability for random variables P ( x ) – Prior probability: , P ( x | x ) P(x | x ) – Conditional probability: 1 2 2 1 = = x ( x , x ), P ( x ) P(x ,x ) – Joint probability: 1 2 1 2 = = P(x 1 ,x ) P ( x | x ) P ( x ) P ( x | x ) P ( x ) – Relationship: 2 2 1 1 1 2 2 – Independence: = = = P ( x | x ) P ( x ), P ( x | x ) P ( x ), P(x ,x ) P ( x ) P ( x ) 2 1 2 1 2 1 1 2 1 2 • Bayesian Rule × P ( x | c ) P ( c ) Likelihood Prior = = P ( c | x ) Posterior P ( x ) Evidence Discriminative Generative 5 COMP24111 Machine Learning
Probabilistic Classification Principle • Establishing a probabilistic model for classification Discriminative model – = ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ , P ( c | x ) c c , , c x (x , , x ) 1 L 1 n P ( 1 x c | ) P ( c 2 x | ) P ( c | x ) • To train a discriminative classifier L • • • regardless its probabilistic or non- probabilistic nature , all training examples of different classes must Discriminative be jointly used to build up a single Probabilistic Classifier discriminative classifier. • Output L probabilities for L class labels in a probabilistic classifier • • • while a single label is achieved by x x x n 1 2 a non-probabilistic classifier . = ⋅ ⋅ ⋅ x ( x , x , , x ) 1 2 n 6 COMP24111 Machine Learning
Probabilistic Classification Principle • Establishing a probabilistic model for classification (cont.) Generative model (must be probabilistic) – = ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ P ( x | c ) c c , , c , x (x , , x ) 1 L 1 n • L probabilistic models have P x ( | c ) P ( x | c ) 1 L to be trained independently Generative Generative • Each is trained on only the • • • Probabilistic Model Probabilistic Model examples of the same label for Class 1 for Class L • Output L probabilities for a • • • • • • given input with L models x x x x x x 1 2 n 1 2 n • “Generative” means that = ⋅ ⋅ ⋅ such a model produces data x ( x , x , , x ) 1 2 n subject to the distribution via sampling. 7 COMP24111 Machine Learning
Probabilistic Classification Principle M aximum A P osterior ( MAP ) classification rule • For an input x , find the largest one from L probabilities output by – P ( 1 c | x ) , ..., P ( c | x ). a discriminative probabilistic classifier L ( * x Assign x to label c* if is the largest. – P c | ) • Generative classification with the MAP rule – Apply Bayesian rule to convert them into posterior probabilities P ( x | c ) P ( c ) = ∝ i i P ( c | x ) P ( x | c ) P ( c ) i i i Common factor for P ( x ) all L probabilities = ⋅ ⋅⋅ for i 1 , 2 , , L – Then apply the MAP rule to assign a label 8 COMP24111 Machine Learning
Naïve Bayes • Bayes classification ∝ = ⋅ ⋅ ⋅ = P ( c | ) P ( | c ) P ( c ) P ( x , , x | c ) P ( c ) for c c ,..., c . x x 1 n 1 L ⋅ ⋅ ⋅ P ( 1 x , , x | c ) Difficulty: learning the joint probability is infeasible! n • Naïve Bayes classification – Assume all input features are class conditionally independent! ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ P ( x , x , , x | c ) P ( x | x , , x , c ) P ( x , , x | c ) 1 2 n 1 2 n 2 n = ⋅ ⋅ ⋅ Applying the P ( x | c ) P ( x , , x | c ) 1 2 n independence = ⋅ ⋅ ⋅ P ( x | c ) P ( x | c ) P ( x | c ) assumption 1 2 n = ⋅ ⋅ ⋅ ' ( a , a , , a ) x Apply the MAP classification rule: assign to c* if – 1 2 n ⋅ ⋅ ⋅ > ⋅ ⋅ ⋅ ≠ = ⋅ ⋅ ⋅ * * * * [ P ( a | c ) P ( a | c )] P ( c ) [ P ( a | c ) P ( a | c )] P ( c ), c c , c c , , c 1 n 1 n 1 L ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ esitmate of P ( a , , a | c ) * estimate of P ( a , , a | c ) 1 n 1 n 9 COMP24111 Machine Learning
Naïve Bayes = ⋅ ⋅ ⋅ For each target va lue of c (c c , , c ) i i 1 L ˆ ← P ( c ) estimate P ( c ) with examples in S ; i i = ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ For every feature value x of each feature x ( j 1 , , F ; k 1 , , N ) jk j j ˆ = ← P x x c estimate P x c with examples in S; ( | ) ( | ) j jk i jk i ′ ′ = ⋅ ⋅ ⋅ x ' ( a , , a ) 1 n ′ ′ ′ ′ ˆ ⋅ ⋅ ⋅ ˆ ˆ > ˆ ⋅ ⋅ ⋅ ˆ ˆ ≠ = ⋅ ⋅ ⋅ * * * * [ P ( a | c ) P ( a | c )] P ( c ) [ P ( a | c ) P ( a | c )] P ( c ), c c , c c , , c 1 n 1 i n i i i i 1 L 10 COMP24111 Machine Learning
Example • Example: Play Tennis 11 COMP24111 Machine Learning
Example • Learning Phase Outlook Play= Yes Play= No Temperature Play= Yes Play= No Sunny Hot 2/9 3/5 2/9 2/5 Overcast Mild 4/9 0/5 4/9 2/5 Rain Cool 3/9 2/5 3/9 1/5 Humidity Play= Yes Play=N o Wind Play= Yes Play= No Strong High 3/9 3/5 3/9 4/5 Weak Normal 6/9 2/5 6/9 1/5 P (Play =Yes) = 9/14 P (Play =No) = 5/14 12 COMP24111 Machine Learning
Example • Test Phase – Given a new instance, predict its label x ’=(Outlook= Sunny, Temperature= Cool, Humidity =High, Wind= Strong ) – Look up tables achieved in the learning phrase P(Outlook=S unny |Play= No ) = 3/5 P(Outlook= Sunny |Play= Yes ) = 2/9 P(Temperature= Cool |Play= =No ) = 1/5 P(Temperature= Cool |Play= Yes ) = 3/9 P(Huminity= High |Play= No ) = 4/5 P(Huminity= High |Play= Yes ) = 3/9 P(Wind= Strong |Play= No ) = 3/5 P(Wind= Strong |Play= Yes ) = 3/9 P(Play= No ) = 5/14 P(Play= Yes ) = 9/14 – Decision making with the MAP rule P( Yes | x ’) ≈ [P( Sunny |Y es )P( Cool | Yes )P( High |Y es )P( Strong | Yes )]P(Play= Yes ) = 0.0053 P( No | x ’) ≈ [P( Sunny |N o ) P( Cool |N o )P( High | No )P( Strong | No )]P(Play= No ) = 0.0206 Given the fact P( Yes | x ’) < P( No | x ’), we label x ’ to be “ No ”. 13 COMP24111 Machine Learning
Naïve Bayes • Algorithm: Continuous-valued Features – Numberless values taken by a continuous-valued feature – Conditional probability often modeled with the normal distribution − µ 2 ( x ) 1 ˆ = − j ji P ( x | c ) exp j i σ π σ 2 2 2 ji ji µ = : mean (avearage) of feature values x of examples for which c c ji j i σ = : standard deviation of feature values x of examples for which c c ji j i = ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ for ( X , , X ), C c , , c X Learning Phase: – 1 F 1 L F × = = ⋅ ⋅ ⋅ L P ( C c ) i 1 , , L Output: normal distributions and i ′ ′ ′ = ⋅ ⋅ ⋅ X ( 1 a , , a ) – Test Phase: Given an unknown instance n • Instead of looking-up tables, calculate conditional probabilities with all the normal distributions achieved in the learning phrase • Apply the MAP rule to assign a label (the same as done for the discrete case) 14 COMP24111 Machine Learning
Recommend
More recommend