Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Maximum a posteriori rule Calculating the posterior probability P Y y P / Y y k k P Y y / Bayes k P theorem P Y y P / Y y k k K P Y y P / Y y l l l 1 MAP – Maximum a posteriori rule y arg max P Y y / k * k k y arg max P Y y P / Y y k * k k k How to estimate P(X/Y=y k ) Prior probability of class k : P(Y = y k ) Assumptions are introduced in order to obtain Estimated by empirical frequency n k /n a convenient calculation of this likelihood Ricco Rakotomalala 2 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Ricco Rakotomalala 3 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Conditional independence assumption Conditional independence for the J P ( / Y y ) P ( X / Y y ) calculation of the likelihood k j k j 1 The attributes are all conditionally independent of one another given the value of Y For a categorical attribute X, the conditional P ( X x Y y ) probability for the value x l is computed as follows… l k ( / ) P X x Y y l k P ( Y y ) k The probability is estimated using # , X ( ) x Y ( ) y n ˆ l k kl P X x / Y y the conditional relative frequency l k # , Y ( ) y n k k Y \ X x l The Laplace rule of succession is often used to estimate the conditional probability y n n k kl k n 1 ˆ kl P X x / Y y p l k l / k n K n k This is a kind of smoothing; it enables also to overcome the (n kl = 0) problem. Ricco Rakotomalala 4 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
An example using a toy dataset Maladie Marié Etud.Sup Direct estimation of the posterior probability Présent Non Oui Présent Non Oui 1 ˆ Absent Non Non P ( Maladie Absent / Marié oui , Etu oui ) 1 1 Absent Oui Oui Présent Non Oui 0 ˆ Absent Non Non P ( Maladie Présent / Marié oui , Etu oui ) 0 1 Absent Oui Non Présent Non Oui If Etu = oui and Marié = oui Then Maladie = Absent Absent Oui Non Présent Oui Non (+) No assumptions, (-) small number of covered examples Conditional independence assumption NB Maladie Maladie Total ˆ Absent 5 P ( Maladie Absent / Marié oui , Etu oui ) Présent 5 ˆ ˆ ˆ P ( Maladie Absent ) P ( Marié oui / M Abs .) P ( Etu oui / M Abs .) Total général 10 5 1 3 1 1 1 0 . 082 NB Maladie Marié 10 2 5 2 5 2 Maladie Non Oui Total général Absent 2 3 5 ˆ P ( Maladie présent / Marié oui , Etu oui ) Présent 4 1 5 Total général 6 4 10 ˆ ˆ ˆ ( ) ( / .) ( / .) P Maladie présent P Marié oui M Abs P Etu oui M Abs NB Maladie Etud.Sup 5 1 1 1 4 1 0 . 102 Oui Maladie Non Total général 10 2 5 2 5 2 Absent 4 1 5 Présent 1 4 5 If Etu = oui and Marié = oui Then Maladie = Présent Total général 5 5 10 (-) Questionable assumption, (+) more reliable estimation of probabilities Ricco Rakotomalala 5 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Advantage and shortcoming (end of the course?) >> Simplicity, quickness, ability to handle very large dataset, no possible crash during the calculations >> Incrementality (we store only the contingency tables) >> Statistically robust (even if the assumption is very questionable) >> This is a linear classifier similar classification performance (see the numerous experiments described in scientific papers) >> No indication about the relevance of the attributes (really ?) >> Very high number of rules (in practice, the logical rules are not computed, the contingency tables for the calculation of the conditional frequency are deployed e.g. PMML format) >> Not explicit model (really ?) not used in marketing domain, etc. We see often these conclusions in the literature… Is it possible to go beyond that? Ricco Rakotomalala 6 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Logarithmic transformation J y arg max P ( Y y ) P ( X / Y y ) k * k j k k j 1 J y arg max ln P ( Y y ) ln P ( X / Y y ) k * k j k k j 1 Ricco Rakotomalala 7 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Model using one predictive attribute A discrete attribute X with L levels d ( y , X ) ln P ( Y y ) ln P ( X / Y y ) k k k From X, we can create L dummy variables L d ( y , X ) ln P ( Y y ) ln P ( X x / Y y ) I k k l k l l 1 L ln P ( Y y ) ln P ( X x / Y y ) I k l k l l 1 L a a I 0 , k l , k l l 1 We obtain a linear combination of the dummy variables i.e. an explicit model which is easy to deploy K linear classification functions (such as linear discriminant analysis) Ricco Rakotomalala 8 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
An example (Y : Maladie; X : Etu.Sup) NB Maladie Maladie Total Absent 5 Présent 5 Total général 10 NB Maladie Etud.Sup Oui Maladie Non Total général Absent 4 1 5 Présent 1 4 5 Total général 5 5 10 5 1 4 1 1 1 d ( absent , X ) ln ln ( X non ) ln ( X oui ) 10 2 5 2 5 2 0 . 6931 0 . 3365 ( X non ) 1 . 2528 ( X oui ) d ( present , X ) 0 . 6931 1 . 2528 ( X non ) 0 . 3365 ( X oui ) For an instance (Etu.Sup = NON) d ( absent , X ) 0 . 6931 0 . 3365 1 . 0296 Prediction : Maladie = non d ( present , X ) 0 . 9631 1 . 2528 1 . 9495 Ricco Rakotomalala 9 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Implemented solution into TANAGRA (Using [L-1] dummy variables for an attribute X with L levels) since L I I I 1 1 2 L d ( y , X ) ln P ( Y y ) ln P ( X x / Y y ) I k k l k l l 1 L 1 P ( X x / Y y ) l k ln P ( Y y ) ln P ( X x / Y y ) ln I k L k l P ( X x / Y y ) l 1 L k L 1 b b I 0 , k l , k l l 1 One level [x L ] becomes the reference level The dummy coding is the most commonly used coding scheme Ricco Rakotomalala 10 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Maladie Marié Etud.Sup Présent Non Oui Extension to J predictive attributes Présent Non Oui Absent Non Non Absent Oui Oui Présent Non Oui Dummy coding scheme Absent Non Non Absent Oui Non X j with L j levels (L j -1) dummy variables Présent Non Oui Absent Oui Non Présent Oui Non Linear classification functions using the indicator variables Ricco Rakotomalala 11 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
The particular case of the binary classification (K = 2) Construction of the SCORE function The class attribute has 2 levels :: Y={+,-} ( , ) d X a a X a X a X Decision rule , 0 , 1 1 , 2 2 , J J d ( , X ) a a X a X a X D(X) > 0 Y = + , 0 , 1 1 , 2 2 , J J d ( X ) c c X c X c X 1 1 2 2 J J Interpretation >> D(X) is the SCORE function. It assigns a score proportional to positive class probability estimate to the instances >> The sign of the coefficients allows to interpret the influence of the descriptors Notre Classification exemple : functions SCORE Descriptors Présent Absent D(X) Not being married makes sick… Marié = Non 0.916291 -0.287682 1.203973 To study makes sick… Etud.Sup = Oui 0.916291 -0.916291 1.832582 constant -3.198673 -1.589235 -1.609438 Ricco Rakotomalala 12 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Recommend
More recommend