Data Mining 2013 Bayesian Network Classifiers Ad Feelders Universiteit Utrecht October 24, 2013 Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 1 / 49
Literature N. Friedman, D. Geiger and M. Goldszmidt Bayesian Network Classifiers Machine Learning, 29, pp. 131-163 (1997) (except section 6) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 2 / 49
Bayesian Network Classifiers Bayesian Networks are models of the joint distribution of a collection of random variables. The joint distribution is simplified by introducing independence assumptions. In many applications we are in fact interested in the conditional distribution of one variable (the class variable) given the other variables (attributes). Can we use Bayesian Networks as classifiers? Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 3 / 49
The Naive Bayes Classifier C . . . A 1 A k A 2 This Bayesian Network is equivalent to its undirected version (why?): C . . . A 1 A 2 A k Attributes are independent given the class label. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 4 / 49
The Naive Bayes Classifier C . . . A 1 A k A 2 BN factorisation: k � P ( X ) = P ( X i | X pa ( i ) ) , i =1 So factorisation corresponding to NB classifier is: P ( C , A 1 , . . . , A k ) = P ( C ) P ( A 1 | C ) · · · P ( A k | C ) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 5 / 49
Naive Bayes assumption Via Bayes rule we have P ( C = i | A ) = P ( A 1 , A 2 , . . . , A k , C = i ) (product rule) P ( A 1 , A 2 , . . . , A k ) P ( A 1 , A 2 , . . . , A k | C = i ) P ( C = i ) = � c j =1 P ( A 1 , A 2 , . . . , A k | C = j ) P ( C = j ) (product rule and sum rule) P ( A 1 | C = i ) P ( A 2 | C = i ) · · · P ( A k | C = i ) P ( C = i ) = � c j =1 P ( A 1 | C = j ) P ( A 2 | C = j ) · · · P ( A k | C = j ) P ( C = j ) (NB factorisation) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 6 / 49
Why Naive Bayes is competitive The conditional independence assumption is often clearly inappropriate, yet the predictive accuracy of Naive Bayes is competitive with more complex classifiers. How come? Probability estimates of Naive Bayes may be way off, but this does not necessarily result in wrong classification! Naive Bayes has only few parameters compared to more complex models, so it can estimate parameters more reliably. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 7 / 49
Naive Bayes: Example P ( C = 0) = 0 . 4, P ( C = 1) = 0 . 6 C = 0 A 2 C = 1 A 2 A 1 0 1 P ( A 1 ) A 1 0 1 P ( A 1 ) 0 0.2 0.1 0.3 0 0.5 0.2 0.7 1 0.1 0.6 0.7 1 0.1 0.2 0.3 P ( A 2 ) 0.3 0.7 1 P ( A 2 ) 0.6 0.4 1 We have that 0 . 5 × 0 . 6 P ( C = 1 | A 1 = 0 , A 2 = 0) = 0 . 5 × 0 . 6 + 0 . 2 × 0 . 4 = 0 . 79 According to naive Bayes 0 . 7 × 0 . 6 × 0 . 6 P ( C = 1 | A 1 = 0 , A 2 = 0) = 0 . 7 × 0 . 6 × 0 . 6 + 0 . 3 × 0 . 3 × 0 . 4 = 0 . 88 Naive Bayes assigns to the right class. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 8 / 49
What about this model? . . . A 1 A 2 A k C BN factorisation: k � P ( X ) = P ( X i | X pa ( i ) ) , i =1 So factorisation is: P ( C , A 1 , . . . , A k ) = P ( C | A 1 , . . . , A k ) P ( A 1 ) · · · P ( A k ) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 9 / 49
Bayesian Networks as Classifiers A 2 A 1 A 3 C A 5 A 4 A 6 A 7 A 8 Markov Blanket: Parents, Children and Parents of Children. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 10 / 49
Markov Blanket of C : Moral Graph A 2 A 1 A 3 C A 5 A 4 A 6 A 7 A 8 Markov Blanket: Parents, Children and Parents of Children. Local Markov property: C ⊥ ⊥ rest | boundary( C ) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 11 / 49
Bayesian Networks as Classifiers Loglikelihood under model M is n � log P M ( X ( j ) ) L ( M | D ) = j =1 where X ( j ) = ( A ( j ) 1 , A ( j ) 2 , . . . , A ( j ) k , C ( j ) ). We can rewrite this as n n � � log P M ( C ( j ) | A ( j ) ) + log P M ( A ( j ) ) L ( M | D ) = j =1 j =1 If there are many attributes, the second term will dominate the loglikelihood score. But we are not interested in modeling the distribution of the attributes! Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 12 / 49
Bayesian Networks as Classifiers 10 8 6 −log P(x) 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 P(x) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 13 / 49
Dataset # Attributes # Classes # Instances Train Test 1 australian 14 2 690 CV-5 2 breast 10 2 683 CV-5 3 chess 36 2 2130 1066 4 cleve 13 2 296 CV-5 5 corral 6 2 128 CV-5 6 crx 15 2 653 CV-5 7 diabetes 8 2 768 CV-5 8 flare 10 2 1066 CV-5 9 german 20 2 1000 CV-5 10 glass 9 7 214 CV-5 11 glass2 9 2 163 CV-5 12 heart 13 2 270 CV-5 13 hepatitis 19 2 80 CV-5 14 iris 4 3 150 CV-5 15 letter 16 26 15000 5000 16 lymphography 18 4 148 CV-5 17 mofn-3-7-10 10 2 300 1024 18 pima 8 2 768 CV-5 19 satimage 36 6 4435 2000 20 segment 19 7 1540 770 21 shuttle-small 9 7 3866 1934 22 soybean-large 35 19 562 CV-5 23 vehicle 18 4 846 CV-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700 Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 14 / 49
Naive Bayes vs. Unrestricted BN Bayesian Network 45 Naive Bayes 40 Percentage Classification Error 35 30 25 20 15 10 5 0 22 19 10 25 16 11 9 4 6 18 17 2 13 1 15 14 12 21 7 20 23 8 24 3 5 Data Set Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 15 / 49
Use Conditional Log-likelihood? Discriminative vs. Generative learning. Conditional loglikelihood function: n � log P M ( C ( j ) | A ( j ) 1 , . . . , A ( j ) CL ( M | D ) = k ) j =1 No closed form solution for ML estimates. Remark: can be done via Logistic Regression for models with perfect graphs (Naive Bayes, TAN’s). Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 16 / 49
NB and Logistic Regression The logistic regression assumption is � P ( C = 1 | A ) � k � log = α + β i A i , P ( C = 0 | A ) i =1 that is, the log odds is a linear function of the attributes. Under the naive Bayes assumption, this is exactly true. Assign to class 1 if α + � k i =1 β i A i > 0 and to class 0 otherwise. Logistic regression maximizes conditional likelihood under this assumption (it is a so-called discriminative model). There is no closed form solution for the maximum likelihood estimates of α and β i , but the loglikelihood function is globally concave (unique global optimum). Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 17 / 49
Proof (for binary attributes A i ) Under the naive Bayes assumption we have: P ( C = 1 | a ) P ( a 1 | C = 1) · · · P ( a k | C = 1) P ( C = 1) = P ( C = 0 | a ) P ( a 1 | C = 0) · · · P ( a k | C = 0) P ( C = 0) �� P ( a i = 1 | C = 1) � (1 − a i ) � k � a i � P ( a i = 0 | C = 1) � = P ( a i = 1 | C = 0) P ( a i = 0 | C = 0) i =1 P ( C = 1) × P ( C = 0) Taking the log we get k � P ( C = 1 | a ) � � � P ( a i = 1 | C = 1) � � log = a i log P ( C = 0 | a ) P ( a i = 1 | C = 0) i =1 � P ( a i = 0 | C = 1) �� � P ( C = 1) � + (1 − a i ) log + log P ( a i = 0 | C = 0) P ( C = 0) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 18 / 49
Proof (continued) Expand and collect terms. β i � �� � � P ( C = 1 | a ) � k � � log P ( a i = 1 | C = 1) P ( a i = 0 | C = 0) � log = a i P ( C = 0 | a ) P ( a i = 1 | C = 0) P ( a i = 0 | C = 1) i =1 k � � log P ( a i = 0 | C = 1) + log P ( C = 1) � + P ( a i = 0 | C = 0) P ( C = 0) i =1 � �� � α which is a linear function of a . Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 19 / 49
Example Suppose P ( C = 1) = 0 . 6 , P ( a 1 = 1 | C = 1) = 0 . 8 , P ( a 1 = 1 | C = 0) = 0 . 5, P ( a 2 = 1 | C = 1) = 0 . 6 , P ( a 2 = 1 | C = 0) = 0 . 3. Then � P ( C = 1 | a 1 , a 2 ) � log = 1 . 386 a 1 + 1 . 253 a 2 − 1 . 476 + 0 . 405 P ( C = 0 | a 1 , a 2 ) = − 1 . 071 + 1 . 386 a 1 + 1 . 253 a 2 Classify a point with a 1 = 1 and a 2 = 0: � P ( C = 1 | 1 , 0) � log = − 1 . 071 + 1 . 386 × 1 + 1 . 253 × 0 = 0 . 315 P ( C = 0 | 1 , 0) Decision rule: assign to class 1 if k � α + β i A i > 0 i =1 and to class 0 otherwise. Linear decision boundary. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 20 / 49
Linear Decision Boundary 1.0 CLASS 1 0.8 a 2 = 0.855 − 1.106a 1 0.6 A2 0.4 CLASS 0 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 A1 Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 21 / 49
Relax strong assumptions of NB Conditional independence assumption of NB is often incorrect, and could lead to suboptimal classification performance. Relax this assumption by allowing (restricted) dependencies between attributes. This may produce more accurate probability estimates, possibly leading to better classification performance. This is not guaranteed, because the more complex model may be overfitting. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 22 / 49
Recommend
More recommend