association rules data mining and exploration association
play

Association Rules Data Mining and Exploration: Association Rules - PowerPoint PPT Presentation

Association Rules Data Mining and Exploration: Association Rules Itemsets, association rules Amos Storkey, School of Informatics Frequency, accuracy APRIORI algorithm Comments on Association Rules February 7, 2006 Reading: HMS


  1. Association Rules Data Mining and Exploration: Association Rules ◮ Itemsets, association rules Amos Storkey, School of Informatics ◮ Frequency, accuracy ◮ APRIORI algorithm ◮ Comments on Association Rules February 7, 2006 Reading: HMS chapter 13 Additional reading: Witten and Frank § 4.5, Han and Kamber § 6.1, 6.2 http://www.inf.ed.ac.uk/teaching/courses/dme/ These lecture slides are based extensively on previous versions of the course written by Chris Williams. 1 / 1 2 / 1 About Association Rules ◮ Example of Association rules: market basket analysis, the ◮ We are looking for patterns , i.e. local regularities in the data process of analyzing customer buying habits by finding associations between items that customers place in their ◮ Examples of frequent itemsets, association rules “shopping baskets” ◮ 10% of supermarket customers buy wine and cheese ◮ Each row of the data matrix has a 1 if the corresponding ◮ If a person visits the CNN website, there is a 60% chance that they will visit the ABC website in the same month product was in the basket. Data is often sparse ◮ Can recode k -valued categorical variables (e.g. outlook = ◮ Association rules are like classification rules, except that they can predict any attribute, not just the class { sunny, overcast, rainy } ) as k binary variables ◮ Association rules are not intended to be used together as a set (cf classification rules) 3 / 1 4 / 1

  2. Itemsets, Frequency, Accuracy Play Tennis Example ◮ An itemset is a pattern defined by Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High False No ( A i 1 = a j 1 ) ∧ ( A i 2 = a j 2 ) ∧ . . . ( A i k = a j k ) D2 Sunny Hot High True No D3 Overcast Hot High False Yes ◮ The frequency (or support) of an itemset X is simply P ( X ) D4 Rain Mild High False Yes ◮ Example: in the “Play Tennis” data D5 Rain Cool Normal False Yes D6 Rain Cool Normal True No P ( Humidity = Normal ∧ Play = Yes ∧ Windy = False ) = 4 / 14 D7 Overcast Cool Normal True Yes D8 Sunny Mild High False No D9 Sunny Cool Normal False Yes ◮ The accuracy (or confidence) of an association rule if Y=y D10 Rain Mild Normal False Yes then Z=z is D11 Sunny Mild Normal True Yes P ( Z = z | Y = y ) D12 Overcast Mild High True Yes D13 Overcast Hot Normal False Yes ◮ Example D14 Rain Mild High True No P ( Windy = False ∧ Play = Yes | Humidity = Normal ) = 4 / 7 5 / 1 6 / 1 Generating rules from itemsets Finding Frequent Itemsets ◮ An itemset of size k can give rise to 2 k − 1 rules ◮ Task: find all itemsets with frequency ≥ s ◮ Example. Itemset ◮ Key observation: a set X of variables can be frequent only if all subsets of variables are frequent (monotonicity Windy=False, Play=Yes, Humidity=Normal property), i.e. P ( A , B ) ≤ P ( A ) and P ( A , B ) ≤ P ( B ) gives rise to 7 rules including ◮ So find frequent singleton sets, then sets of size 2, and so on ... IF Windy=False and Humidity=Normal THEN Play=Yes (4/4) IF Play=Yes THEN Humidity=Normal and Windy=False (4/9) ◮ An efficient algorithm using this idea for finding frequent IF True THEN Windy=False and Play=Yes and Humidity=Normal (4/14) itemsets is the APRIORI algorithm (Agrawal and Srikant (1994), Mannila et al (1994)) ◮ Select association rules that have accuracy greater than some threshold a 8 / 1 9 / 1

  3. APRIORI algorithm ◮ Single database pass is linear in | C i | n , make a pass for each i until C i is empty (for binary variables) ◮ Candidate formation ◮ Find all pairs of sets { U , V } from L i such that U ∪ V has i = 1 size i + 1 and test if this union is really a potential C i = {{ A }| A is a variable } candidate. O ( | L i | 3 ) while C i is not empty ◮ Example: 5 three-item sets database pass: (ABC), (ABD), (ACD), (ACE), (BCD) for each set in C i test if it is frequent Candidate four-item sets let L i be collection of frequent sets from C i (ABCD) ok candidate formation: (ACDE) not ok because (CDE) is not present above let C i + 1 be those sets of size i + 1 ◮ Data structure techniques can be used for speedups all of whose subsets are frequent end while ◮ Other algorithms possible for finding frequent itemsets, e.g. Han’s FP-growth 10 / 1 11 / 1 APRIORI and Algorithm Components Comments on Association Rules ◮ Finding Association Rules is just the beginning in a datamining effort. Some will be trivial, others interesting. Challenge is to select potentially interesting rules ◮ Finding Association rules as Exploratory Data Analysis ◮ Trivial rule example: ◮ Task: Rule Pattern Discovery ◮ Structure: Association Rules pregnant ⇒ female ◮ Score Function: Support with accuracy 1! ◮ Search: Breadth First with Pruning ◮ For rule A ⇒ B , it can be useful to compare P ( B | A ) to P ( B ) ◮ Data Management Technique: Linear Scans ◮ APRIORI algorithm can be generalized to frequent structure mining, e.g. finding episodes from sequences or frequently-occurring trees ◮ Example application: Health Insurance Commission (HIC) in Australia detected patterns of ordering of medical tests that suggested that some of the tests ordered were unnecessary (Cabe˜ na et al, 1998) 12 / 1 13 / 1

  4. Summary ◮ Finding frequent itemsets ◮ Done with APRIORI algorithm ◮ Given frequent itemsets, construct association rules with accuracy > a ◮ Select interesting rules ◮ Generalize to frequent structure mining 14 / 1

Recommend


More recommend