Data Mining Chapter 5 Association Analysis : Basic Concepts Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 10/26/2020 Introduction to Data Mining, 2 nd Edition 1 1 Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper} {Beer}, 1 Bread, Milk {Milk, Bread} {Eggs,Coke}, 2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk}, 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality! 5 Bread, Milk, Diaper, Coke 10/26/2020 Introduction to Data Mining, 2 nd Edition 2 2
Definition: Frequent Itemset Itemset – A collection of one or more items Example: {Milk, Bread, Diaper} TID Items – k-itemset 1 Bread, Milk An itemset that contains k items 2 Bread, Diaper, Beer, Eggs Support count ( ) 3 Milk, Diaper, Beer, Coke – Frequency of occurrence of an itemset 4 Bread, Milk, Diaper, Beer – E.g. ({Milk, Bread,Diaper}) = 2 5 Bread, Milk, Diaper, Coke Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold 10/26/2020 Introduction to Data Mining, 2 nd Edition 3 3 Definition: Association Rule Association Rule TID Items – An implication expression of the form 1 Bread, Milk X Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs – Example: 3 Milk, Diaper, Beer, Coke {Milk, Diaper} {Beer} 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Rule Evaluation Metrics – Support (s) Example: Fraction of transactions that contain both X and Y { Milk , Diaper } { Beer} – Confidence (c) ( Milk , Diaper, Beer ) 2 Measures how often items in Y s 0 . 4 appear in transactions that | T | 5 contain X ( Milk, Diaper, Beer ) 2 c 0 . 67 ( Milk , Diaper ) 3 10/26/2020 Introduction to Data Mining, 2 nd Edition 4 4
Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! 10/26/2020 Introduction to Data Mining, 2 nd Edition 5 5 Computational Complexity Given d unique items: – Total number of itemsets = 2 d – Total number of possible association rules: d d k d 1 d k R k j k 1 j 1 3 2 1 d d 1 If d=6, R = 602 rules 10/26/2020 Introduction to Data Mining, 2 nd Edition 6 6
Mining Association Rules Example of Rules: TID Items 1 Bread, Milk {Milk,Diaper} {Beer} (s=0.4, c=0.67) 2 Bread, Diaper, Beer, Eggs {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke {Beer} {Milk,Diaper} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer {Diaper} {Milk,Beer} (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements 10/26/2020 Introduction to Data Mining, 2 nd Edition 7 7 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support minsup – 2. Rule Generation Generate high confidence rules from each frequent itemset, – where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive 10/26/2020 Introduction to Data Mining, 2 nd Edition 8 8
Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Given d items, there are 2 d possible candidate itemsets ABCDE 10/26/2020 Introduction to Data Mining, 2 nd Edition 9 9 Frequent Itemset Generation Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2 d !!! 10/26/2020 Introduction to Data Mining, 2 nd Edition 10 10
Frequent Itemset Generation Strategies Reduce the number of candidates (M) – Complete search: M=2 d – Use pruning techniques to reduce M Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction 10/26/2020 Introduction to Data Mining, 2 nd Edition 11 11 Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: X , Y : ( X Y ) s ( X ) s ( Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support 10/26/2020 Introduction to Data Mining, 2 nd Edition 12 12
Illustrating Apriori Principle Found to be Infrequent Pruned supersets 10/26/2020 Introduction to Data Mining, 2 nd Edition 13 13 Illustrating Apriori Principle TID Items Items (1-itemsets) 1 Bread, Milk Item Count 2 Beer, Bread, Diaper, Eggs Bread 4 3 Beer, Coke, Diaper, Milk Coke 2 4 Beer, Bread, Diaper, Milk Milk 4 Beer 3 5 Bread, Coke, Diaper, Milk Diaper 4 Eggs 1 Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 15 + 20 = 41 With support-based pruning, 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 14 14
Illustrating Apriori Principle TID Items Items (1-itemsets) 1 Bread, Milk Item Count 2 Beer, Bread, Diaper, Eggs Bread 4 3 Beer, Coke, Diaper, Milk Coke 2 4 Beer, Bread, Diaper, Milk Milk 4 Beer 3 5 Bread, Coke, Diaper, Milk Diaper 4 Eggs 1 Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 15 + 20 = 41 With support-based pruning, 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 15 15 Illustrating Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Beer 3 {Bread,Milk} Diaper 4 {Bread, Beer } (No need to generate Eggs 1 {Bread,Diaper} candidates involving Coke {Beer, Milk} or Eggs) {Diaper, Milk} {Beer,Diaper} Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 15 + 20 = 41 With support-based pruning, 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 16 16
Illustrating Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Count Beer 3 {Bread,Milk} 3 Diaper 4 {Beer, Bread} 2 (No need to generate Eggs 1 {Bread,Diaper} 3 candidates involving Coke {Beer,Milk} 2 or Eggs) {Diaper,Milk} 3 {Beer,Diaper} 3 Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 15 + 20 = 41 With support-based pruning, 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 17 17 Illustrating Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Count Beer 3 {Bread,Milk} 3 Diaper 4 (No need to generate {Bread,Beer} 2 Eggs 1 candidates involving Coke {Bread,Diaper} 3 {Milk,Beer} 2 or Eggs) {Milk,Diaper} 3 {Beer,Diaper} 3 Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, Itemset If every subset is considered, { Beer, Diaper, Milk} 6 C 1 + 6 C 2 + 6 C 3 6 C 1 + 6 C 2 + 6 C 3 { Beer,Bread,Diaper} 6 + 15 + 20 = 41 6 + 15 + 20 = 41 {Bread,Diaper,Milk} With support-based pruning, With support-based pruning, { Beer, Bread, Milk} 6 + 6 + 4 = 16 6 + 6 + 4 = 16 10/26/2020 Introduction to Data Mining, 2 nd Edition 18 18
Recommend
More recommend