outline
play

Outline Basics of Association Rules Algorithms: Apriori, ECLAT and - PowerPoint PPT Presentation

Association Rule Mining with R Yanchang Zhao http://www.RDataMining.com Short Course on R and Data Mining University of Canberra 7 October 2016 Chapter 9 - Association Rules, in R and Data Mining: Examples and Case Studies .


  1. Association Rule Mining with R ∗ Yanchang Zhao http://www.RDataMining.com Short Course on R and Data Mining University of Canberra 7 October 2016 ∗ Chapter 9 - Association Rules, in R and Data Mining: Examples and Case Studies . http://www.rdatamining.com/docs/RDataMining-book.pdf 1 / 58

  2. Outline Basics of Association Rules Algorithms: Apriori, ECLAT and FP-growth Interestingness Measures Applications Association Rule Mining with R Removing Redundancy Interpreting Rules Visualizing Association Rules Further Readings and Online Resources 2 / 58

  3. Association Rules ◮ To discover association rules showing itemsets that occur together frequently [Agrawal et al., 1993]. ◮ Widely used to analyze retail basket or transaction data. ◮ An association rule is of the form A ⇒ B , where A and B are items or attribute-value pairs. ◮ The rule means that those database tuples having the items in the left hand of the rule are also likely to having those items in the right hand. ◮ Examples of association rules: ◮ bread ⇒ butter ◮ computer ⇒ software ◮ age in [20,29] & income in [60K,100K] ⇒ buying up-to-date mobile handsets 3 / 58

  4. Association Rules Association rules are rules presenting association or correlation between itemsets. support ( A ⇒ B ) = P ( A ∪ B ) confidence ( A ⇒ B ) = P ( B | A ) P ( A ∪ B ) = P ( A ) confidence ( A ⇒ B ) lift ( A ⇒ B ) = P ( B ) P ( A ∪ B ) = P ( A ) P ( B ) where P ( A ) is the percentage (or probability) of cases containing A . 4 / 58

  5. An Example ◮ Assume there are 100 students. ◮ 10 out of them know data mining techniques, 8 know R language and 6 know both of them. ◮ knows R ⇒ knows data mining 5 / 58

  6. An Example ◮ Assume there are 100 students. ◮ 10 out of them know data mining techniques, 8 know R language and 6 know both of them. ◮ knows R ⇒ knows data mining ◮ support = 5 / 58

  7. An Example ◮ Assume there are 100 students. ◮ 10 out of them know data mining techniques, 8 know R language and 6 know both of them. ◮ knows R ⇒ knows data mining ◮ support = P(R & data mining) = 6/100 = 0.06 5 / 58

  8. An Example ◮ Assume there are 100 students. ◮ 10 out of them know data mining techniques, 8 know R language and 6 know both of them. ◮ knows R ⇒ knows data mining ◮ support = P(R & data mining) = 6/100 = 0.06 ◮ confidence = 5 / 58

  9. An Example ◮ Assume there are 100 students. ◮ 10 out of them know data mining techniques, 8 know R language and 6 know both of them. ◮ knows R ⇒ knows data mining ◮ support = P(R & data mining) = 6/100 = 0.06 ◮ confidence = support / P(R) = 0.06/0.08 = 0.75 5 / 58

  10. An Example ◮ Assume there are 100 students. ◮ 10 out of them know data mining techniques, 8 know R language and 6 know both of them. ◮ knows R ⇒ knows data mining ◮ support = P(R & data mining) = 6/100 = 0.06 ◮ confidence = support / P(R) = 0.06/0.08 = 0.75 ◮ lift = 5 / 58

  11. An Example ◮ Assume there are 100 students. ◮ 10 out of them know data mining techniques, 8 know R language and 6 know both of them. ◮ knows R ⇒ knows data mining ◮ support = P(R & data mining) = 6/100 = 0.06 ◮ confidence = support / P(R) = 0.06/0.08 = 0.75 ◮ lift = confidence / P(data mining) = 0.75/0.10 = 7.5 5 / 58

  12. Association Rule Mining ◮ Association Rule Mining is normally composed of two steps: ◮ Finding all frequent itemsets whose supports are no less than a minimum support threshold; ◮ From above frequent itemsets, generating association rules with confidence above a minimum confidence threshold. ◮ The second step is straightforward, but the first one, frequent itemset generateion, is computing intensive. ◮ The number of possible itemsets is 2 n − 1, where n is the number of unique items. ◮ Algorithms: Apriori, ECLAT, FP-Growth 6 / 58

  13. Downward-Closure Property ◮ Downward-closure property of support, a.k.a. anti-monotonicity ◮ For a frequent itemset, all its subsets are also frequent. if { A,B } is frequent, then both { A } and { B } are frequent. ◮ For an infrequent itemset, all its super-sets are infrequent. if { A } is infrequent, then { A,B } , { A,C } and { A,B,C } are infrequent. ◮ useful to prune candidate itemsets 7 / 58

  14. Itemset Lattice 8 / 58

  15. Outline Basics of Association Rules Algorithms: Apriori, ECLAT and FP-growth Interestingness Measures Applications Association Rule Mining with R Removing Redundancy Interpreting Rules Visualizing Association Rules Further Readings and Online Resources 9 / 58

  16. Apriori ◮ Apriori [Agrawal and Srikant, 1994]: a classic algorithm for association rule mining ◮ A level-wise, breadth-first algorithm ◮ Counts transactions to find frequent itemsets ◮ Generates candidate itemsets by exploiting downward closure property of support 10 / 58

  17. Apriori Process 1. Find all frequent 1-itemsets L 1 2. Join step: generate candidate k -itemsets by joining L k − 1 with itself 3. Prune step: prune candidate k -itemsets using downward-closure property 4. Scan the dataset to count frequency of candidate k -itemsets and select frequent k -itemsets L k 5. Repeat above process, until no more frequent itemsets can be found. 11 / 58

  18. From [Zaki and Meira, 2014] 12 / 58

  19. FP-growth ◮ FP-growth: frequent-pattern growth, which mines frequent itemsets without candidate generation [Han et al., 2004] ◮ Compresses the input database creating an FP-tree instance to represent frequent items. ◮ Divides the compressed database into a set of conditional databases, each one associated with one frequent pattern. ◮ Each such database is mined separately. ◮ It reduces search costs by looking for short patterns recursively and then concatenating them in long frequent patterns. † † https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ Frequent_Pattern_Mining/The_FP-Growth_Algorithm 13 / 58

  20. FP-tree ◮ The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative information about frequent patterns in a dataset. It has two components: ◮ A root labeled as “null” with a set of item-prefix subtrees as children ◮ A frequent-item header table ◮ Each node has three attributes: ◮ Item name ◮ Count: number of transactions represented by the path from root to the node ◮ Node link: links to the next node having the same item name ◮ Each entry in the frequent-item header table also has three attributes: ◮ Item name ◮ Head of node link: point to the first node in the FP-tree having the same item name ◮ Count: frequency of the item 14 / 58

  21. FP-tree From [Han, 2005] 15 / 58

  22. ECLAT ◮ ECLAT: equivalence class transformation [Zaki et al., 1997] ◮ A depth-first search algorithm using set intersection ◮ Idea: use tid set intersecion to compute the support of a candidate itemset, avoiding the generation of subsets that does not exist in the prefix tree. ◮ t ( AB ) = t ( A ) ∩ t ( B ) ◮ support ( AB ) = | t ( AB ) | ◮ Eclat intersects the tidsets only if the frequent itemsets share a common prefix. ◮ It traverses the prefix search tree in a DFS-like manner, processing a group of itemsets that have the same prefix, also called a prefix equivalence class. 16 / 58

  23. ECLAT ◮ It works recursively. ◮ The initial call uses all single items with their tid-sets. ◮ In each recursive call, it verifies each itemset tid-set pair ( X , t ( X ))with all the other pairs to generate new candidates. If the new candidate is frequent, it is added to the set P x . ◮ Recursively, it finds all frequent itemsets in the X branch. 17 / 58

  24. ECLAT From [Zaki and Meira, 2014] 18 / 58

  25. Outline Basics of Association Rules Algorithms: Apriori, ECLAT and FP-growth Interestingness Measures Applications Association Rule Mining with R Removing Redundancy Interpreting Rules Visualizing Association Rules Further Readings and Online Resources 19 / 58

  26. Interestingness Measures ◮ Which rules or patterns are the most interesting ones? One way is to rank the discovered rules or patterns with interestingness measures. ◮ The measures of rule interestingness fall into two categories, subjective and objective [Freitas, 1998, Silberschatz and Tuzhilin, 1996]. ◮ Objective measures, such as lift , odds ratio and conviction , are often data-driven and give the interestingness in terms of statistics or information theory. ◮ Subjective (user-driven) measures, e.g., unexpectedness and actionability , focus on finding interesting patterns by matching against a given set of user beliefs. 20 / 58

  27. Objective Interestingness Measures ◮ Support, confidence and lift are the most widely used objective measures to select interesting rules. ◮ Many other objective measures introduced by Tan et al. [Tan et al., 2002], such as φ -coefficient , odds ratio , kappa , mutual information , J-measure , Gini index , laplace , conviction , interest and cosine . ◮ Their study shows that different measures have different intrinsic properties and there is no measure that is better than others in all application domains. ◮ In addition, any-confidence, all-confidence and bond, are designed by Omiecinski [Omiecinski, 2003]. ◮ Utility is used by Chan et al. [Chan et al., 2003] to find top- k objective-directed rules. ◮ Unexpected Confidence Interestingness and Isolated Interestingness are designed by Dong and Li [Dong and Li, 1998] by considering its unexpectedness in terms of other association rules in its neighbourhood. 21 / 58

Recommend


More recommend