Data Warehousing and Machine Learning Probabilistic Classifiers Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 34
Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good . . . . . . . . . . . . . . . Probabilistic Classifiers DWML Spring 2008 2 / 34
Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good . . . . . . . . . . . . . . . P ( Risk = Good | Savings = Medium , Assets = High , Income = 75 ) = 2 / 3 P ( Risk = Bad | Savings = Medium , Assets = High , Income = 75 ) = 1 / 3 Probabilistic Classifiers DWML Spring 2008 2 / 34
Probabilistic Classifiers Empirical Distribution The training data defines the empirical distribution, which can be represented in a table. Empirical distribution obtained from 1000 data instances: P Gender Blood Pressure Weight Smoker Stroke m low under no no 32/1000 m low under no yes 1/1000 m low under yes no 27/1000 . . . . . . . . . . . . . . . . . . f normal normal no yes 0/1000 . . . . . . . . . . . . . . . . . . f high over yes yes 54/1000 Such a table is not a suitable probabilistic model, because • Size of representation • It overfits the data Probabilistic Classifiers DWML Spring 2008 3 / 34
Probabilistic Classifiers Model View data as being produced by a random process that is described by a joint probability distribution P on States ( A 1 , . . . , A n , C ) , i.e. P assigns a probability P ( a 1 , . . . , a n , c ) ∈ [ 0 , 1 ] to every tuple ( a 1 , . . . , a n , c ) of values for the attribute and class variables, s.t. P ( a 1 , . . . , a n , c ) = 1 X ( a 1 ,..., a n , c ) ∈ States ( A 1 ,..., A n , C ) (for discrete attributes; integration instead of summation for continuous attributes) Conditional Probability The joint distribution P also defines the conditional probability distribution of C , given A 1 , . . . , A n , i.e. values P ( c | a 1 , . . . , a n ) := P ( a 1 , . . . , a n , c ) P ( a 1 , . . . , a n , c ) = P ( a 1 , . . . , a n ) c ′ P ( a 1 , . . . , a n , c ′ ) P that represent the probability that C = c given that it is known that A 1 = a 1 , . . . , A n = a n . Probabilistic Classifiers DWML Spring 2008 4 / 34
Probabilistic Classifiers Classification Rule For a loss function L ( c , c ′ ) an instance is classified according to C ( a 1 , . . . , a n ) := arg L ( c , c ′ ) P ( c | a 1 , . . . , a n ) X min c ′ ∈ States ( C ) c ∈ States ( C ) Examples Predicted Predicted Cancer Normal c c’ Cancer 1 1000 c 0 1 true true True 1 0 c’ 1 0 L ( c , c ′ ) 0/1 loss Probabilistic Classifiers DWML Spring 2008 5 / 34
Probabilistic Classifiers Classification Rule For a loss function L ( c , c ′ ) an instance is classified according to C ( a 1 , . . . , a n ) := arg L ( c , c ′ ) P ( c | a 1 , . . . , a n ) X min c ′ ∈ States ( C ) c ∈ States ( C ) Under 0/1-loss we get C ( a 1 , . . . , a n ) := arg max c ∈ States ( C ) P ( c | a 1 , . . . , a n ) In binary case, e.g. States ( C ) = { notinfected , infected } , also with variable threshold t : C ( a 1 , . . . , a n ) = notinfected : ⇔ P ( notinfected | a 1 , . . . , a n ) ≥ t . (this can also be generalized for non-binary attributes). Probabilistic Classifiers DWML Spring 2008 5 / 34
Naive Bayes The Naive Bayes Model Structural assumption: P ( a 1 , . . . , a n , c ) = P ( a 1 | c ) · P ( a 2 | c ) · · · P ( a n | c ) · P ( c ) Graphical representation as a Bayesian Network : C A 3 A 4 A 5 A 6 A 7 A 1 A 2 Interpretation: Given the true class labels, the different attributes take their value independently. Probabilistic Classifiers DWML Spring 2008 6 / 34
Naive Bayes The naive Bayes assumption I 1 2 3 4 5 6 7 8 9 For example: P ( Cell − 2 = b | Cell − 5 = b , Symbol = 1 ) > P ( Cell − 2 = b | Symbol = 1 ) Attributes not independent given Symbol =1! Probabilistic Classifiers DWML Spring 2008 7 / 34
Naive Bayes The naive Bayes assumption II For spam example e.g.: P ( Body’nigeria’=y | Body’confidential’=y , Spam=y ) ≫ P ( Body’nigeria’=y | Spam=y ) Attributes not independent given Spam =yes! � Naive Bayes assumption often not realistic. Nevertheless, Naive Bayes often successful. Probabilistic Classifiers DWML Spring 2008 8 / 34
Naive Bayes Learning a Naive Bayes Classifier • Determine parameters P ( a i | c ) ( a i ∈ States ( A i ) , c ∈ States ( C ) ) from empirical counts in the data. • Missing values are easily handled: instances for which A i is missing are ignored for P ( a i | c ) . • Discrete and continuous attributes can be mixed. Probabilistic Classifiers DWML Spring 2008 9 / 34
Naive Bayes The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97] ⊕ : P ( C = ⊕ | a 1 , . . . , a n ) (real) 1 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0.5 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0 States ( A 1 , . . . , A n ) Probabilistic Classifiers DWML Spring 2008 10 / 34
Naive Bayes The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97] ⊕ : P ( C = ⊕ | a 1 , . . . , a n ) (real) 1 ⊕ : P ( C = ⊕ | a 1 , . . . , a n ) (Naive Bayes) ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0.5 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0 States ( A 1 , . . . , A n ) Probabilistic Classifiers DWML Spring 2008 10 / 34
Naive Bayes When Naive Bayes must fail No Naive Bayes Classifier can produce the following classification: A B Class yes yes ⊕ yes no ⊖ ⊖ no yes no no ⊕ because assume it did, then: P ( A = y | ⊕ ) P ( B = y | ⊕ ) P ( ⊕ ) P ( A = y | ⊖ ) P ( B = y | ⊖ ) P ( ⊖ ) 1 . > P ( A = y | ⊖ ) P ( B = n | ⊖ ) P ( ⊖ ) P ( A = y | ⊕ ) P ( B = n | ⊕ ) P ( ⊕ ) 2 . > P ( A = n | ⊖ ) P ( B = y | ⊖ ) P ( ⊖ ) P ( A = n | ⊕ ) P ( B = y | ⊕ ) P ( ⊕ ) 3 . > P ( A = n | ⊕ ) P ( B = n | ⊕ ) P ( ⊕ ) P ( A = n | ⊖ ) P ( B = n | ⊖ ) P ( ⊖ ) 4 . > Probabilistic Classifiers DWML Spring 2008 11 / 34
Naive Bayes When Naive Bayes must fail (cont.) P ( A = y | ⊕ ) P ( B = y | ⊕ ) P ( ⊕ ) P ( A = y | ⊖ ) P ( B = y | ⊖ ) P ( ⊖ ) 1 . > P ( A = y | ⊖ ) P ( B = n | ⊖ ) P ( ⊖ ) P ( A = y | ⊕ ) P ( B = n | ⊕ ) P ( ⊕ ) 2 . > P ( A = n | ⊖ ) P ( B = y | ⊖ ) P ( ⊖ ) P ( A = n | ⊕ ) P ( B = y | ⊕ ) P ( ⊕ ) 3 . > P ( A = n | ⊕ ) P ( B = n | ⊕ ) P ( ⊕ ) P ( A = n | ⊖ ) P ( B = n | ⊖ ) P ( ⊖ ) 4 . > Multiplying the four left sides and the four right sides of these inequalities: 4 4 ( leftsideof i . ) > ( rightsideofi . ) Y Y i = 1 i = 1 But this is false, because both products are actually equal. Probabilistic Classifiers DWML Spring 2008 12 / 34
Naive Bayes Tree Augmented Naive Bayes A 2 A 7 Model: all Bayesian network structures where A 6 C - The class node is parent of each A 3 attribute node - The substructure on the attribute A 1 nodes is a tree A 4 A 5 Learning TAN classifier: learning the tree structure and parameters. Optimal tree structure can be found efficiently (Chow, Liu 1968, Friedman et al. 1997). Probabilistic Classifiers DWML Spring 2008 13 / 34
Naive Bayes A B Class yes yes ⊕ TAN classifier for : yes no ⊖ no yes ⊖ no no ⊕ C yes no A ⊕ 0 . 5 0 . 5 ⊖ 0 . 5 0 . 5 ⊕ ⊖ C 0 . 5 0 . 5 C A yes no ⊕ yes 1 . 0 0 . 0 B ⊕ no 0 . 0 1 . 0 ⊖ yes 0 . 0 1 . 0 ⊖ 1 . 0 0 . 0 no Probabilistic Classifiers DWML Spring 2008 14 / 34
Tree Augmented Naive Bayes Learning a TAN Classifier: a rough overview • Learn a (class conditional) maximum likelihood tree structure of the attributes. • Insert the class variable as a parent of all the attributes. Probabilistic Classifiers DWML Spring 2008 15 / 34
Tree Augmented Naive Bayes Learning a TAN Classifier: a rough overview • Learn a (class conditional) maximum likelihood tree structure of the attributes. • Insert the class variable as a parent of all the attributes. Learning a Chow-Liu tree A Chow-Liu tree of maximal likelihood can be constructed as follows: Calculate MI ( A i , A j ) for each pair ( A i , A j ) . 1 Build a maximum-weight spanning tree over the attributes. 2 3 Direct the resulting tree. Learn the parameters. 4 P # ( A i , A j ) ! MI ( A i , A j ) = X P # ( A i , A j ) log 2 P # ( A i ) P ( A j ) A i , A j Probabilistic Classifiers DWML Spring 2008 15 / 34
Recommend
More recommend