frequent pattern mining overview
play

Frequent Pattern Mining Overview Basic Concepts and Challenges - PDF document

Data Mining Techniques: Frequent Patterns in Sets and Sequences Mirek Riedewald Some slides based on presentations by Han/Kamber and Tan/Steinbach/Kumar Frequent Pattern Mining Overview Basic Concepts and Challenges Efficient and


  1. Data Mining Techniques: Frequent Patterns in Sets and Sequences Mirek Riedewald Some slides based on presentations by Han/Kamber and Tan/Steinbach/Kumar Frequent Pattern Mining Overview • Basic Concepts and Challenges • Efficient and Scalable Methods for Frequent Itemsets and Association Rules • Pattern Interestingness Measures • Sequence Mining 2 1

  2. What Is Frequent Pattern Analysis? • Find patterns (itemset, sequence, structure, etc.) that occur frequently in a data set • First proposed for frequent itemsets and association rule mining • Motivation: Find inherent regularities in data – What products were often purchased together? – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to a new drug? • Applications – Market basket analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, DNA sequence analysis 3 Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper}  {Beer}, 1 Bread, Milk {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality! 5 Bread, Milk, Diaper, Coke 4 2

  3. Definition: Frequent Itemset • Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper} – k-itemset: itemset that contains k items Support count (  ) • TID Items – Frequency of occurrence of an itemset 1 Bread, Milk – E.g.,  ({Milk, Bread, Diaper}) = 2 2 Bread, Diaper, Beer, Eggs • Support (s) 3 Milk, Diaper, Beer, Coke – Fraction of transactions that contain an 4 Bread, Milk, Diaper, Beer itemset 5 Bread, Milk, Diaper, Coke – E.g., s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold 5 Definition: Association Rule TID Items • Association Rule = implication expression of the form X  Y, 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs where X and Y are itemsets – Ex.: {Milk, Diaper}  {Beer} 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke • Rule Evaluation Metrics – Support (s) = P(X  Y)  Example: { Milk , Diaper } Beer • Estimated by fraction of transactions that contain both X and Y   ( Milk , Diaper, Beer ) 2  – Confidence (c) = P(Y| X) s | D | 5 • Estimated by fraction of transactions that contain X and Y  ( Milk, Diaper, Beer ) 2 among all transactions containing   c X  ( Milk , Diaper ) 3 6 3

  4. Association Rule Mining Task • Given a transaction database DB, find all rules having support ≥ minsup and confidence ≥ minconf • Brute-force approach: – List all possible association rules – Compute support and confidence for each rule – Remove rules that fail the minsup or minconf thresholds – Computationally prohibitive! 7 Mining Association Rules Example rules: TID Items 1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) 2 Bread, Diaper, Beer, Eggs {Diaper,Beer}  {Milk} (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke {Beer}  {Milk,Diaper} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer {Diaper}  {Milk,Beer} (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations : • All the above rules are binary partitions of the same itemset {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements 8 4

  5. Mining Association Rules • Two-step approach: 1. Frequent Itemset Generation • Generate all itemsets that have support  minsup 2. Rule Generation • Generate high-confidence rules from each frequent itemset, where each rule is a binary partitioning of the frequent itemset • Frequent itemset generation is still computationally expensive 9 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Given d items, there ABCD ABCE ABDE ACDE BCDE are 2 d possible candidate itemsets ABCDE 10 5

  6. Frequent Itemset Generation • Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database – Match each transaction against every candidate – Complexity  O(N*M*w) => expensive since M=2 d List of Transactions Candidates TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs M Milk, Diaper, Beer, Coke 3 N Bread, Milk, Diaper, Beer 4 Bread, Milk, Diaper, Coke 5 w 11 Computational Complexity • Given d unique items, total number of itemsets = 2 d • Total number of possible association rules?          d 1 d d k d k           R          k j    1 1 k j     d d 1 3 2 1 If d=6, R = 602 possible rules 12 6

  7. Frequent Pattern Mining Overview • Basic Concepts and Challenges • Efficient and Scalable Methods for Frequent Itemsets and Association Rules • Pattern Interestingness Measures • Sequence Mining 13 Frequent Itemset Generation Strategies • Reduce the number of candidates (M) – Complete search: M=2 d – Use pruning techniques to reduce M • Reduce the number of transactions (N) – Skip short transactions as size of itemset increases • Reduce the number of comparisons (N*M) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction 14 7

  8. Reducing Number of Candidates • Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure:     , : ( ) ( ) ( ) X Y X Y s X s Y – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support 15 Illustrating the Apriori Principle null null A A B B C C D D E E AB AB AC AC AD AD AE AE BC BC BD BD BE BE CD CD CE CE DE DE Found to be infrequent ABC ABC ABD ABD ABE ABE ACD ACD ACE ACE ADE ADE BCD BCD BCE BCE BDE BDE CDE CDE ABCD ABCD ABCE ABCE ABDE ABDE ACDE ACDE BCDE BCDE Pruned ABCDE ABCDE supersets 16 8

  9. Illustrating the Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Count Beer 3 {Bread,Milk} 3 Diaper 4 (No need to generate {Bread,Beer} 2 Eggs 1 candidates involving Coke {Bread,Diaper} 3 or Eggs) {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, Itemset Count 6 C 1 + 6 C 2 + 6 C 3 = 41 {Bread,Milk,Diaper} 3 With support-based pruning, 6 + 6 + 1 = 13 17 Apriori Algorithm • Generate L 1 = frequent itemsets of length k=1 • Repeat until no new frequent itemsets are found – Generate C k+1 , the length-(k+1) candidate itemsets, from L k – Prune candidate itemsets in C k+1 containing subsets of length k that are not in L k (and hence infrequent) – Count support of each remaining candidate by scanning DB; eliminate infrequent ones from C k+1 – L k+1 =C k+1 ; k = k+1 18 9

  10. Important Details of Apriori • How to generate candidates? – Step 1: self-joining L k – Step 2: pruning • How to count support of candidates? • Example of Candidate-generation for L 3 ={ {a,b,c}, {a,b,d}, {a,c,d}, {a,c,e}, {b,c,d} } – Self-joining L 3 • {a,b,c,d} from {a,b,c} and {a,b,d} • {a,c,d,e} from {a,c,d} and {a,c,e} – Pruning: • {a,c,d,e} is removed because {a,d,e} is not in L 3 – C 4 ={ {a,b,c,d} } 19 How to Generate Candidates? • Step 1: self-joining L k-1 insert into C k select p.item 1 , p.item 2 ,…, p.item k-1 , q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1 AND … AND p.item k-2 =q.item k-2 AND p.item k-1 < q.item k-1 • Step 2: pruning – forall itemsets c in C k do • forall (k-1)-subsets s of c do – if (s is not in L k-1 ) then delete c from C k 20 10

  11. How to Count Supports of Candidates? • Why is counting supports of candidates a problem? – Total number of candidates can be very large – One transaction may contain many candidates • Method: – Candidate itemsets stored in a hash-tree – Leaf node contains list of itemsets – Interior node contains a hash table – Subset function finds all candidates contained in a transaction 21 Generate Hash Tree • Suppose we have 15 candidate itemsets of length 3: – {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} • We need: – Hash function – Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 Hash function 5 6 7 3,6,9 1,4,7 3 6 7 1 4 5 3 4 5 3 5 6 1 3 6 3 6 8 3 5 7 2,5,8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 22 11

Recommend


More recommend