Frequent Pattern Mining
How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2
Burnt or Burned? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 3
Store Layout Design http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 4
Transaction Data • Alphabet: a set of items – Example: all products sold in a store • A transaction: a set of items involved in an activity – Example: the items purchased by a customer in a visit • Other information is often associated – Timestamp, price, salesperson, customer-id, store-id, … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 5
Examples of Transaction Data • • • • • Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 6
How to Store Transaction Data? Tid Item • Transaction-id t123 a (t123, a, b, c) t123 b t123 c (t236, b, d) … … • Relational storage t236 b t236 d • Transaction-based storage • Item-based (vertical) storage – Item a: … , t123, … – Item b: … , t123, … , t236, … – … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 7
Transaction Data Analysis • Transactions: customers ’ purchases of commodities – {bread, milk, cheese} if they are bought together • Frequent patterns: product combinations that are frequently purchased together by customers • Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93] Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 8
Why Frequent Patterns? • What products were often purchased together? • What are the frequent subsequent purchases after buying a iPod? • What kinds of genes are sensitive to this new drug? • What key-word combinations are frequently associated with web pages about game- evaluation? Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 9
Why Frequent Pattern Mining? • Foundation for many data mining tasks – Association rules, correlation, causality, sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, … • Broad applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 10
Frequent Itemsets • Itemset: a set of items – E.g., acm = {a, c, m} Transaction database TDB • Support of itemsets TID Items bought – Sup(acm) = 3 100 f, a, c, d, g, I, m, p • Given min_sup = 3, acm 200 a, b, c, f, l, m, o is a frequent pattern 300 b, f, h, j, o • Frequent pattern mining: 400 b, c, k, s, p finding all frequent 500 a, f, c, e, l, p, m, n patterns in a database Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 11
A Naïve Attempt • Generate all possible itemsets, test their supports against the database • How to hold a large number of itemsets into main memory? – 100 items à 2 100 – 1 possible itemets • How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the support of 2 20 – 1 = 1,048,575 itemsets Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 12
Transactions in Real Applications • A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books relevant to data mining • Walmart has more than 20 million transactions per day, AT&T produces more than 275 million calls per day • Mining large transaction databases of many items is a real demand Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 13
How to Get an Efficient Method? • Reducing the number of itemsets that need to be checked • Checking the supports of selected itemsets efficiently Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 14
Candidate Generation & Test • Any subset of a frequent itemset must also be frequent – an anti-monotonic property – A transaction containing {beer, diaper, nuts} also contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must also be frequent • In other words, any superset of an infrequent itemset must also be infrequent – No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned! Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 15
Apriori-Based Mining • Generate length (k+1) candidate itemsets from length k frequent itemsets, and • Test the candidates against DB Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 16
The Apriori Algorithm [AgSr94] Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items Itemset Sup Itemset Sup Itemset 10 a, c, d a 2 a 2 ab Scan D 20 b, c, e b 3 b 3 ac 30 a, b, c, e c 3 c 3 ae 40 b, e d 1 bc e 3 Min_sup=2 e 3 be ce Counting 3-candidates Freq 2-itemsets Scan D Itemset Sup Itemset Itemset Sup ab 1 bce ac 2 Scan D ac 2 bc 2 ae 1 be 3 Freq 3-itemsets bc 2 ce 2 Itemset Sup be 3 bce 2 ce 2 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 17
The Apriori Algorithm Level-wise, candidate generation and test • C k : Candidate itemset of size k • L k : frequent itemset of size k Candidate generation • L 1 = {frequent items}; • for (k = 1; L k != ∅ ; k++) do Test – C k+1 = candidates generated from L k ; – for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t – L k+1 = candidates in C k+ 1 with min_support • return ∪ k L k ; Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 18
Important Steps in Apriori • How to find frequent 1- and 2-itemsets? • How to generate candidates? – Step 1: self-joining L k – Step 2: pruning • How to count supports of candidates? Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 19
Finding Frequent 1- & 2-itemsets • Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array – Initialize c[item]=0 for each item – For each transaction T, for each item in T, c[item]++; – If c[item]>=min_sup, item is frequent • Finding frequent 2-itemsets using a 2- dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 20
Counting Array • A 2-dimensional triangle matrix can be implemented using a 1-dimensional array There are n items 1 2 3 4 5 For items i, j (i>j), 1 1 2 3 4 c[i,j] = c[(i-1)(2n-i)/2+j-i]; 2 5 6 7 3 8 9 Example: c[3,5] =c[(3-1)*(2*5-3)/ 4 10 2+5-3]=c[9] 5 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 21
Example of Candidate-generation • L 3 = { abc, abd, acd, ace, bcd } • Self-joining: L 3 *L 3 – abcd ß abc * abd – acde ß acd * ace • Pruning: – acde is removed because ade is not in L 3 • C 4 ={ abcd } Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 22
How to Generate Candidates? • Suppose the items in L k-1 are listed in an order • Step 1: self-join L k-1 INSERT INTO C k SELECT p.item 1 , p.item 2 , … , p.item k-1 , q.item k-1 FROM L k-1 p , L k-1 q WHERE p.item 1 =q.item 1 , … , p.item k-2 =q.item k-2 , p.item k-1 < q.item k-1 • Step 2: pruning – For each itemset c in C k do • For each ( k-1 )-subsets s of c do if ( s is not in L k-1 ) then delete c from C k Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 23
How to Count Supports? • Why is counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates • Method – Candidate itemsets are stored in a hash-tree – A leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 24
Example: Counting Supports Subset function Transaction: 1 2 3 5 6 3,6,9 1,4,7 2,5,8 1 + 2 3 5 6 2 3 4 1 3 + 5 6 5 6 7 3 6 7 1 4 5 3 5 6 3 4 5 1 3 6 3 6 8 3 5 7 1 2 + 3 5 6 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 25
Association Rules • Rule c à am • Support: 3 (i.e., the support Transaction database TDB of acm) TID Items bought • Confidence: 75% (i.e., 100 f, a, c, d, g, I, m, p sup(acm) / sup(c)) 200 a, b, c, f, l, m, o • Given a minimum support 300 b, f, h, j, o threshold and a minimum confidence threshold, find 400 b, c, k, s, p all association rules whose 500 a, f, c, e, l, p, m, n support and confidence passing the thresholds Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 26
To-Do List • Read Sections 6.1, 6.2.1 and 6.2.2 in the textbook • Understand the concept of frequent itemsets and association rules • Understand algorithm Apriori • Figure out how to use Weka to mine frequent itemsets Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 27
Recommend
More recommend