data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chap. 12: Pattern and Rule Assessment Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  2. Rule Assessment Measures: Support and Confidence Support: The support of the rule is defined as the number of transactions that contain both X and Y , that is, sup ( X − → Y ) = sup ( XY ) = | t ( XY ) | The relative support is the fraction of transactions that contain both X and Y , that is, the empirical joint probability of the items comprising the rule → Y ) = P ( XY ) = rsup ( XY ) = sup ( XY ) rsup ( X − | D | Confidence: The conf idence of a rule is the conditional probability that a transaction contains the consequent Y given that it contains the antecedent X : → Y ) = P ( Y | X ) = P ( XY ) P ( X ) = rsup ( XY ) rsup ( X ) = sup ( XY ) conf ( X − sup ( X ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  3. Example Dataset: Support and Confidence Tid Items 1 ABDE 2 BCE 3 ABDE 4 ABCE 5 ABCDE 6 BCD Rule confidence Rule conf Frequent itemsets: minsup = 3 A − → E 1.00 sup rsup Itemsets E − → A 0.80 3 0.5 ABD , ABDE , AD , ADE BCE , BDE , CE , DE B − → E 0.83 4 0 . 67 A , C , D , AB , ABE , AE , BC , BD E − → B 1.00 5 0 . 83 E , BE E − → BC 0.60 6 1 . 0 B BC − → E 0.75 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  4. Rule Assessment Measures: Lift, Leverage and Jaccard Lift: Lift is defined as the ratio of the observed joint probability of X and Y to the expected joint probability if they were statistically independent, that is, P ( XY ) rsup ( X ) · rsup ( Y ) = conf ( X − rsup ( XY ) → Y ) lift ( X − → Y ) = P ( X ) · P ( Y ) = rsup ( Y ) Leverage: Leverage measures the difference between the observed and expected joint probability of XY assuming that X and Y are independent leverage ( X − → Y ) = P ( XY ) − P ( X ) · P ( Y ) = rsup ( XY ) − rsup ( X ) · rsup ( Y ) Jaccard: The Jaccard coefficient measures the similarity between two sets. When applied as a rule assessment measure it computes the similarity between the tidsets of X and Y : → Y ) = | t ( X ) ∩ t ( Y ) | jaccard ( X − | t ( X ) ∪ t ( Y ) | P ( XY ) = P ( X ) + P ( Y ) − P ( XY ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  5. Lift, Leverage, Jaccard, Support and Confidence Rule lift AE − → BC 0.75 Rule rsup lift leverage ACD − → E 0.17 1.20 0.03 CE − → AB 1.00 AC − → E 0.33 1.20 0.06 BE − → AC 1.20 AB − → D 0.50 1.12 0.06 A − → E 0.67 1.20 0.11 Rule rsup conf lift E − → AC 0.33 0.40 1.20 Rule rsup lift jaccard − → 0.67 0.80 1.20 − → 0.33 0.75 0.33 E AB A C B − → E 0.83 0.83 1.00 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  6. Contingency Table for X and Y Y ¬ Y sup ( XY ) sup ( X ¬ Y ) sup ( X ) X ¬ X sup ( ¬ XY ) sup ( ¬ X ¬ Y ) sup ( ¬ X ) sup ( Y ) sup ( ¬ Y ) | D | Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  7. Rule Assessment Measures: Conviction Define ¬ X to be the event that X is not contained in a transaction, that is, X �⊆ t ∈ T , and likewise for ¬ Y . There are, in general, four possible events depending on the occurrence or non-occurrence of the itemsets X and Y as depicted in the contingency table. Conviction measures the expected error of the rule, that is, how often X occurs in a transaction where Y does not. It is thus a measure of the strength of a rule with respect to the complement of the consequent, defined as → Y ) = P ( X ) · P ( ¬ Y ) 1 conv ( X − = P ( X ¬ Y ) lift ( X − → ¬ Y ) If the joint probability of X ¬ Y is less than that expected under independence of X and ¬ Y , then conviction is high, and vice versa. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  8. Rule Conviction Rule rsup conf lift conv A − → DE 0.50 0.75 1.50 2.00 − → 0.50 1.00 1.50 ∞ DE A E − → C 0.50 0.60 0.90 0.83 C − → E 0.50 0.75 0.90 0.68 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  9. Rule Assessment Measures: Odds Ratio The odds ratio utilizes all four entries from the contingency table. Let us divide the dataset into two groups of transactions – those that contain X and those that do not contain X . Define the odds of Y in these two groups as follows: odds ( Y | X ) = P ( XY ) / P ( X ) P ( X ¬ Y ) / P ( X ) = P ( XY ) P ( X ¬ Y ) odds ( Y |¬ X ) = P ( ¬ XY ) / P ( ¬ X ) P ( ¬ X ¬ Y ) / P ( ¬ X ) = P ( ¬ XY ) P ( ¬ X ¬ Y ) The odds ratio is then defined as the ratio of these two odds: → Y ) = odds ( Y | X ) odds ( Y |¬ X ) = P ( XY ) · P ( ¬ X ¬ Y ) oddsratio ( X − P ( X ¬ Y ) · P ( ¬ XY ) = sup ( XY ) · sup ( ¬ X ¬ Y ) sup ( X ¬ Y ) · sup ( ¬ XY ) If X and Y are independent, then odds ratio has value 1. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  10. Odds Ratio Let us compare the odds ratio for two rules, C − → A and D − → A . The contingency tables for A and C , and for A and D , are given below: C ¬ C D ¬ D 2 2 3 1 A A ¬ A 2 0 ¬ A 1 1 The odds ratio values for the two rules are given as → A ) = sup ( AC ) · sup ( ¬ A ¬ C ) sup ( A ¬ C ) · sup ( ¬ AC ) = 2 × 0 oddsratio ( C − 2 × 2 = 0 → A ) = sup ( AD ) · sup ( ¬ A ¬ D ) sup ( A ¬ D ) · sup ( ¬ AD ) = 3 × 1 oddsratio ( D − 1 × 1 = 3 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  11. Iris Data: Discretization Attribute Range or value Label 4.30–5.55 sl 1 Sepal length 5.55–6.15 sl 2 6.15–7.90 sl 3 2.00–2.95 sw 1 Sepal width 2.95–3.35 sw 2 3.35–4.40 sw 3 1.00–2.45 pl 1 Petal length 2.45–4.75 pl 2 4.75–6.90 pl 3 0.10–0.80 pw 1 0.80–1.75 pw 2 Petal width 1.75–2.50 pw 3 Iris-setosa c 1 Class Iris-versicolor c 2 Iris-virginica c 3 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  12. uT bC rS rS rSrS rSrS rSrSrSrS rSrS rSrS bC bC bC bCbCbCbCbCbC rSrSrS bCbCbCbC bCbCbC bCbCbC bCbCbC bC rS uT uT uT uT uT rSrSrS rS uT uTuT uT uT uT uT uT uTuT uTuT uT uT uTuT uTuT rS uTuT uTuT uTuT rS rS rS rS rS rS rS rSrS uT bC uT rSrS rS rS uT rS rS rSrS rS rS rSrSrS rSrSrS rS rS rSrS rS rSrSrSrS rSrS rSrS bC bC bC bC bCbCbCbCbCbC bCbCbCbC bCbCbC bCbCbC bCbCbC uT rS rS uTuT uTuT uT uT uT uT uT uTuT uTuT uTuT uTuT Iris: Support vs. Confidence, and Conviction vs. Lift conf conv bC Iris-setosa ( c 1 ) 1 . 00 30 . 0 rS uT rS Iris-versicolor ( c 2 ) uT Iris-virginica ( c 3 ) 25 . 0 0 . 75 20 . 0 0 . 50 15 . 0 10 . 0 bC Iris-setosa ( c 1 ) 0 . 25 rS Iris-versicolor ( c 2 ) 5 . 0 uT Iris-virginica ( c 3 ) 0 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 rsup lift (a) Support vs. confidence (b) Lift vs. conviction Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  13. Iris Data: Best Class-specific Rules Best Rules by Support and Confidence Rule rsup conf lift conv { pl 1 , pw 1 } − → c 1 0.333 1.00 3.00 ∞ pw 2 − → c 2 0.327 0.91 2.72 6.00 pl 3 − → c 3 0.327 0.89 2.67 5.24 Best Rules by Lift and Conviction Rule rsup conf lift conv { pl 1 , pw 1 } − → c 1 0.33 1.00 3.00 ∞ { pl 2 , pw 2 } − → c 2 0.29 0.98 2.93 15.00 { sl 3 , pl 3 , pw 3 } − → c 3 0.25 1.00 3.00 ∞ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

  14. Pattern Assessment Measures: Support and Lift Support: The most basic measures are support and relative support, giving the number and fraction of transactions in D that contain the itemset X : rsup ( X ) = sup ( X ) sup ( X ) = | t ( X ) | | D | Lift: The lift of a k -itemset X = { x 1 , x 2 ,..., x k } is defined as P ( X ) rsup ( X ) lift ( X , D ) = = � k � k i = 1 P ( x i ) i = 1 rsup ( x i ) Generalized Lift: Assume that { X 1 , X 2 ,..., X q } is a q -partition of X , i.e., a partitioning of X into q nonempty and disjoint itemsets X i . Define the generalized lift of X over partitions of size q as follows: � � P ( X ) lift q ( X ) = min � q i = 1 P ( X i ) X 1 ,..., X q This is, the least value of lift over all q -partitions X . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 12: Pattern and Rule Assessment

Recommend


More recommend