Chapter 4: Freq equen ent Ite temsets and nd Asso Associatio ion R Rules ules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision 3, November 12 th typo fixed in Pearson correlation IRDM ‘15/16 5 Nov 2015
Recall the Qu Question on of th f the w week How can we mine interest esting p patter erns s and usef useful rul ules es from data? 5 Nov 2015 IV-2: 2
IRDM Chapter 4, today Definitions 1. Algorithms for Frequent Itemset Mining 2. Association Rules and Interestingness 3. Summarising Collections of Itemsets 4. You’ll find this covered in Aggarwal Chapter 4, 5.2 Zaki & Meira, Ch. 10, 11 IV-2: 3 IRDM ‘15/16
Chapter 4.3: Assoc ociati tion on R Rules IV-2: 4 IRDM ‘15/16
IRDM Chapter 4.3 Generating Association Rules 1. Measures of Interestingness 2. Properties of Measures 3. Simpson’s Paradox 4. You’ll find this covered in Aggarwal, Chapter 4 Zaki & Meira, Ch. 10 IV-2: 5 IRDM ‘15/16
Generating Association Rules We can generate association rules from frequent itemsets if 𝑎 is a frequent itemset and 𝑌 ⊂ 𝑎 is its proper subset, we have rule 𝑌 → 𝑍 , where 𝑍 = 𝑎 ∖ 𝑌 These rules are frequent because 𝑡𝑡𝑡𝑡 𝑌 → 𝑍 = 𝑡𝑡𝑡𝑡 𝑌 ∪ 𝑍 = 𝑡𝑡𝑡𝑡 ( 𝑎 ) 𝑡𝑡𝑡𝑡 𝑎 we still need to compute the confidence as 𝑡𝑡𝑡𝑡 𝑌 Which means, if rule 𝑌 → 𝑎 ∖ 𝑌 is not confident, no rule of type 𝑋 → 𝑎 ∖ 𝑋 , with 𝑋 ⊆ 𝑌 , is confident we can use this to prune the search space IV-2: 6 IRDM ‘15/16
Pseudo-code (Algorithm 8.6 in Zaki & Meira) IV-2: 7 IRDM ‘15/16
Measures of interestingness Consider the following example: ∑ Coffee Not Coffee Tea 150 50 200 Not Tea 650 150 800 ∑ 800 200 1000 Rule Tea → {Coffee} has 15% support and 75% confidence reasonably good numbers Is this a good rule? the overall fraction of coffee drinkers is 80%, drinking tea redu educes es the probability of drinking coffee! IV-2: 8 IRDM ‘15/16
Problems with confidence The support-confidence framework does not take the support of the consequent into account rules with relatively small support for the antecedent and high support for the consequent often have high confidence To fix this, many other measures have been proposed Most measures are easy to express using contingen ency tables es B ¬B ∑ We’ll use 𝑡 𝑗𝑗 as shorthand for support: A s 11 s 10 s 1+ s 11 = 𝑡𝑡𝑡𝑡 𝐵𝐵 , 𝑡 01 = 𝑡𝑡𝑡𝑡 (¬ 𝐵𝐵 ) , … Analogue, we’ll say 𝑔 𝑗𝑗 for frequency: ¬A s 01 s 00 s 0+ 𝑔 11 = 𝑔𝑔𝑔𝑔 𝐵𝐵 , 𝑔 01 = 𝑔𝑔𝑔𝑔 (¬ 𝐵𝐵 ), … ∑ s +1 s +0 N (revised on Nov 9 th , now using 𝑡 𝑗𝑗 notation to more clearly indicate suppo port) IV-2: 9 IRDM ‘15/16
Statistical Coefficient of Correlation A natural statistical measure between a pair of items is the Pearson co correla latio ion co coefficie icient 𝜍 = 𝐹 𝑌𝑍 − 𝐹 𝑌 𝐹 𝑍 𝜏 𝑌 𝜏 𝑍 𝐹 𝑌𝑍 − 𝐹 𝑌 𝐹 𝑍 𝐹 𝑌 2 − 𝐹 𝑌 2 𝐹 𝑍 2 − 𝐹 𝑍 2 = IV-2: 10 IRDM ‘15/16
Pearson of Correlation of Items For items 𝐵 and 𝐵 it reduces to 𝑔 11 − 𝑔 1+ 𝑔 +1 𝜍 𝐵𝐵 = 𝑔 1+ 𝑔 +1 (1 − 𝑔 1+ )(1 − 𝑔 +1 ) It is + 1 when the data is perfectly positively correlated, -1 when perfectly negatively correlated, and 0 when uncorrelated. (revised on November 12 th ; typo fixed, as 𝑔 11 should be inside the nominator) IV-2: 11 IRDM ‘15/16
Chi-square 𝒴 2 is another natural statistical measure of significance for itemsets. For a set of 𝑙 items, it compares the observed frequencies against the expected frequencies of all 2 𝑙 possible states. 2 𝑔𝑔𝑔𝑔 ( 𝑍 ) − 𝐹 𝑌 𝑔𝑔𝑔𝑔 ( 𝑍 ) 𝒴 2 ( 𝑌 ) = � 𝐹 𝑌 𝑔𝑔𝑔𝑔 ( 𝑍 ) 𝑍∈𝒬 ( 𝑌 ) where 𝒬 ( 𝑌 ) is the powerset of 𝑌 and 𝐹 𝑌 [ 𝑔𝑔𝑔𝑔 𝑍 ] is the expected frequency of state 𝑍 over itemset 𝑌 For example, for 𝑌 = {beer, diapers} , it considers states beer, diapers , ¬beer, diapers , beer, ¬diapers and ¬beer, ¬diapers . (Brin et al. 1998, 1.6k+ cites) (revised on Nov 9 th , now using 𝐹 𝑌 [ 𝑔𝑔𝑔𝑔 𝑍 ] to more clearly indicate the expectation is of state 𝑍 over itemset 𝑌 ) IV-2: 12 IRDM ‘15/16
Chi-square (2) To compute 𝒴 2 ( 𝑌 ) we need to define 𝐹 𝑌 𝑔𝑔𝑔𝑔 𝑍 . The standard way is to assume me indep epen endence e between the items of 𝑍 . That is, the probability of a state 𝑍 is the multiplication of its individual item frequencies. 𝐹 𝑌 𝑔𝑔𝑔𝑔 𝑍 = � 𝑔𝑔𝑔𝑔 ( 𝐵 ) � (1 − 𝑔𝑔𝑔𝑔 𝐵 ) 𝐵∈𝑍 𝐵∈𝑌∖𝑍 The first product is over the items that ar are e present in 𝑍 (the 1s). For these their empirical probability is simply 𝑔𝑔𝑔𝑔 ( ⋅ ) . The second product considers the 0s in 𝑍 , or in other words, the 1s of 𝑌 not in 𝑍 . The empirical probability of not seeing an item A is (1 − 𝑔𝑔𝑔𝑔 𝐵 ) . Note! Independence between items is a very strong assumption, and hence we will find that many itemsets will be ‘significantly’ correlated. (revised on Nov 9 th , now using 𝐹 𝑌 [ 𝑔𝑔𝑔𝑔 𝑍 ] notation, added more explanation) IV-2: 13 IRDM ‘15/16
Chi-square (3) 2 𝑔𝑔𝑔𝑔 ( 𝑍 ) − 𝐹 𝑌 𝑔𝑔𝑔𝑔 ( 𝑍 ) 𝒴 2 ( 𝑌 ) = � 𝐹 𝑌 𝑔𝑔𝑔𝑔 ( 𝑍 ) 𝑍∈𝒬 ( 𝑌 ) Chi-square scores close to 0 indicate statistical independence, while larger values indicate stronger dependencies. no differentiation to positive or negative correlation lly costly at 𝑃 (2 𝑌 ) it is computatio iona nall but as it is upward closed, we can mine interesting sets efficiently Always be thoughtful of how you define your expected frequency! (revised on Nov 9 th , now using 𝐹 𝑌 [ 𝑔𝑔𝑔𝑔 𝑍 ] notation, added more explanation) IV-2: 14 IRDM ‘15/16
Interest Ratio The inter erest est r ratio 𝐽 of rule 𝐵 → 𝐵 is 𝑡𝑡𝑡𝑡 𝐵 × 𝑡𝑡𝑡𝑡 ( 𝐵 ) = 𝑂𝑡 11 𝑂 × 𝑡𝑡𝑡𝑡 𝐵𝐵 𝐽 𝐵 , 𝐵 = 𝑡 1+ 𝑡 +1 𝑑𝑑𝑑𝑑 𝐵→𝐵 it is equivalent to lif lift = = 𝑡𝑡𝑡𝑡 𝐵 Interest ratio compares the frequencies against the assumption that 𝐵 and 𝐵 are independent 𝑡 1+ 𝑡 +1 if 𝐵 and 𝐵 are independent, 𝑡 11 = 𝑂 Interpreting interest ratios 𝐽 𝐵 , 𝐵 = 1 if 𝐵 and 𝐵 are independent 𝐽 𝐵 , 𝐵 > 1 if 𝐵 and 𝐵 are positively correlated 𝐽 𝐵 , 𝐵 < 1 if 𝐵 and 𝐵 are negatively correlated ( 𝑔 𝑗𝑗 changed into 𝑡 𝑗𝑗 in revision 1) IV-2: 15 IRDM ‘15/16
The cosine measure The cosin ine, or 𝐽𝐽 , measure of rule 𝐵 → 𝐵 is defined as 𝑡 11 𝑑𝑑𝑡𝑑𝑑𝑔 𝐵 , 𝐵 = 𝐽 𝐵 , 𝐵 × 𝑡𝑡𝑡𝑡 ( 𝐵𝐵 )/ 𝑂 = 𝑡 1+ × 𝑡 +1 which is regular cosine if we think of 𝐵 and 𝐵 as binary vectors It also is the geomet metri ric m c mean between the confidences of 𝐵 → 𝐵 and 𝐵 → 𝐵 as 𝑡𝑡𝑡𝑡 𝐵𝐵 × 𝑡𝑡𝑡𝑡 𝐵𝐵 𝑑𝑑𝑑𝑔 𝐵 → 𝐵 × 𝑑𝑑𝑑𝑔 ( 𝐵 → 𝐵 ) 𝑑𝑑𝑡𝑑𝑑𝑔 𝐵 , 𝐵 = = 𝑡𝑡𝑡𝑡 𝐵 𝑡𝑡𝑡𝑡 𝐵 ( 𝑔 𝑗𝑗 changed into 𝑡 𝑗𝑗 in revision 1) IV-2: 16 IRDM ‘15/16
Examples (1) ∑ Coffee Not Coffee Tea 150 50 200 Not Tea 650 150 800 ∑ 800 200 1000 The interest ratio of Tea → {Coffee} is 1000 × 150 800 × 200 = 0.9375 almost 1, so not very interesting; below 1, so (slight) negative correlation The 𝑑𝑑𝑡𝑑𝑑𝑔 of this rule, however, is 0.375 quite far from 0, so, it is is interesting. IV-2: 17 IRDM ‘15/16
Examples (2) p ¬p ∑ r ¬r ∑ q 880 50 930 t 20 50 70 ¬ q 50 20 70 ¬ t 50 880 930 ∑ 930 70 1000 ∑ 70 930 1000 𝐽 𝑡 , 𝑔 = 1.02 and 𝐽 𝑔 , 𝑢 = 4.08 𝑡 and 𝑔 are close to independent But 𝑡 and 𝑔 appear 𝑔 and 𝑢 have highest interest factor together in 88% of cases But 𝑔 and 𝑢 appear together only seldom Now 𝑑𝑑𝑑𝑔 𝑡 → 𝑔 = 0.946 and 𝑑𝑑𝑑𝑔 𝑔 → 𝑢 = 0.286 (revised on Nov 9 th , now using 𝑢 instead of 𝑡 to avoid confusion with support-notation) IV-2: 18 IRDM ‘15/16
Examples (2) p ¬p ∑ r ¬r ∑ q 880 50 930 s 20 50 70 ¬ q 50 20 70 ¬ s 50 880 930 Bottom line: Lunch is not free. ∑ 930 70 1000 ∑ 70 930 1000 There is no single measure that 𝐽 𝑡 , 𝑔 = 1.02 and 𝐽 𝑔 , 𝑡 = 4.08 works well all the time. 𝑡 and 𝑔 are close to independent But 𝑡 and 𝑔 appear 𝑔 and 𝑡 have highest interest factor together in 88% of cases But 𝑔 and 𝑡 appear together only seldom Now 𝑑𝑑𝑑𝑔 𝑡 → 𝑔 = 0.946 and 𝑑𝑑𝑑𝑔 𝑔 → 𝑡 = 0.286 IV-2: 19 IRDM ‘15/16
Recommend
More recommend