Frequent Itemsets • Itemset: a set of items – E.g., acm = {a, c, m} Transaction database TDB • Support of itemsets TID Items bought – Sup(acm) = 3 100 f, a, c, d, g, I, m, p • Given min_sup = 3, acm 200 a, b, c, f, l, m, o is a frequent pattern 300 b, f, h, j, o • Frequent pattern mining: 400 b, c, k, s, p finding all frequent 500 a, f, c, e, l, p, m, n patterns in a database Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 1
Candidate Generation & Test • Any subset of a frequent itemset must also be frequent – an anti-monotonic property – A transaction containing {beer, diaper, nuts} also contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must also be frequent • In other words, any superset of an infrequent itemset must also be infrequent – No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned! Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 2
Apriori-Based Mining • Generate length (k+1) candidate itemsets from length k frequent itemsets, and • Test the candidates against DB Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 3
The Apriori Algorithm [AgSr94] Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items Itemset Sup Itemset Sup Itemset 10 a, c, d a 2 a 2 ab Scan D 20 b, c, e b 3 b 3 ac 30 a, b, c, e c 3 c 3 ae 40 b, e d 1 bc e 3 Min_sup=2 e 3 be ce Counting 3-candidates Freq 2-itemsets Scan D Itemset Sup Itemset Itemset Sup ab 1 bce ac 2 Scan D ac 2 bc 2 ae 1 be 3 Freq 3-itemsets bc 2 ce 2 Itemset Sup be 3 bce 2 ce 2 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 4
Challenges of Freq Pat Mining • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 5
Improving Apriori: Ideas • Reducing the number of transaction database scans • Shrinking the number of candidates • Facilitating support counting of candidates Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 6
DIC: Reducing Number of Scans • Once both A and D are determined ABCD frequent, the counting of AD can begin ABC ABD ACD BCD • Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin AB AC BC AD BD CD Transactions 1-itemsets B C D A 2-itemsets Apriori … {} 1-itemsets Itemset lattice 2-items S. Brin R. Motwani, J. Ullman, DIC 3-items and S. Tsur, SIGMOD ’ 97. DIC: Dynamic Itemset Counting Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 7
DHP: Reducing # of Candidates • A hashing bucket count < min_sup à every candidate in the bucket is infrequent – Candidates: a, b, c, d, e – Hash entries: {ab, ad, ae} {bd, be, de} … – Large 1-itemset: a, b, d, e – The sum of counts of {ab, ad, ae} < min_sup à ab should not be a candidate 2-itemset • J. Park, M. Chen, and P. Yu, SIGMOD ’ 95 – DHP: Direct Hashing and Pruning Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 8
A 2-Scan Method by Partitioning • Partition the database into n partitions, such that each partition can be held into main memory • Itemset X is frequent à X must be frequent in at least one partition – Scan 1: partition database and find local frequent patterns – Scan 2: consolidate global frequent patterns • All local frequent itemsets can be held in main memory? A sometimes too strong assumption • A. Savasere, E. Omiecinski, and S. Navathe, VLDB ’ 95 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 9
Sampling for Frequent Patterns • Select a sample of the original database, mine frequent patterns in the sample using Apriori • Scan database once more to verify frequent itemsets found in the sample, only borders of closure of frequent patterns are checked – Example: check abcd instead of ab, ac, … , etc. • Scan database again to find missed frequent patterns • H. Toivonen, VLDB ’ 96 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 10
Eclat/MaxEclat and VIPER • Tid-list: the list of transaction-ids containing an itemset – Vertical Data Format • Major operation: intersections of tid-lists • Compression of tid-lists – Itemset A: t1, t2 t3, sup(A)=3 – Itemset B: t2, t3, t4, sup(B)=3 – Itemset AB: t2, t3, sup(AB)=2 • M. Zaki et al., 1997 • P. Shenoy et al., 2000 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 11
Bottleneck of Freq Pattern Mining • Multiple database scans are costly • Mining long patterns needs many scans and generates many candidates – To find frequent itemset i 1 i 2 … i 100 • # of scans: 100 100 100 100 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ • # of Candidates: 100 30 � 2 1 1 . 27 10 ⎜ ⎟ + ⎜ ⎟ + + ⎜ ⎟ = − ≈ × ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 1 2 100 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ – Bottleneck: candidate-generation-and-test • Can we avoid candidate generation? Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 12
Search Space of Freq. Pat. Mining • Itemsets form a lattice ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D {} Itemset lattice Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 13
Set Enumeration Tree • Use an order on items, enumerate itemsets in lexicographic order – a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d ∅ • Reduce a lattice to a tree a b c d Set enumeration tree ab ac ad bc bd cd abc abd acd bcd abcd Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 14
Borders of Frequent Itemsets • Frequent itemsets are connected – ∅ is trivially frequent – X on the border à every subset of X is frequent ∅ a b c d ab ac ad bc bd cd abc abd acd bcd abcd Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 15
Projected Databases • To test whether Xy is frequent, we can use the X-projected database – The sub-database of transactions containing X – Check whether item y is frequent in X-projected database ∅ a b c d ab ac ad bc bd cd abc abd acd bcd abcd Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 16
Compress Database by FP-tree root Header table item • The 1st scan: find f:4 c:1 f frequent items c:3 b:1 c b:1 a – Only record frequent a:3 p:1 b items in FP-tree m b:1 m:2 p – F-list: f-c-a-b-m-p p:2 m:1 • The 2nd scan: construct tree (ordered) TID Items bought freq items – Order frequent items in 100 f, a, c, d, g, I, m, p f, c, a, m, p each transaction w.r.t. f- 200 a, b, c, f, l,m, o f, c, a, b, m list 300 b, f, h, j, o f, b – Explore sharing among 400 b, c, k, s, p c, b, p transactions 500 a, f, c, e, l, p, m, n f, c, a, m, p Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 17
Benefits of FP-tree • Completeness – Never break a long pattern in any transaction – Preserve complete information for freq pattern mining • Not scan database anymore • Compactness – Reduce irrelevant info — infrequent items are removed – Items in frequency descending order (f-list): the more frequently occurring, the more likely to be shared – Never be larger than the original database (not counting node-links and the count fields) Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 18
Partitioning Frequent Patterns • Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f • Depth-first search of a set enumeration tree – The partitioning is complete and does not have any overlap Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 19
Find Patterns Having Item “ p ” • Only transactions containing p are needed • Form p -projected database – Starting at entry p of the header table – Follow the side-link of frequent item p – Accumulate all transformed prefix paths of p root Header p -projected database TDB| p table item f:4 fcam : 2 c:1 f cb : 1 c:3 b:1 c b:1 a Local frequent item: c :3 a:3 p:1 b Frequent patterns containing p m b:1 m:2 p : 3, pc : 3 p p:2 m:1 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 20
Find Pat Having Item m But No p • Form m-projected database TDB|m – Item p is excluded (why?) – Contain fca:2, fcab:1 – Local frequent items: f, c, a Header table • Build FP-tree for TDB|m root item Header root f:4 f c:1 table c c:3 item b:1 f:3 b:1 a f a:3 c:3 b p:1 c m a b:1 m:2 a:3 p m-projected FP-tree p:2 m:1 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 21
Recursive Mining • Patterns having m but no p can be mined recursively • Optimization: enumerate patterns from a single-branch FP-tree Header – Enumerate all combination table root item – Support = that of the last item f f:3 • m, fm, cm, am c c:3 • fcm, fam, cam a a:3 • fcam m -projected FP-tree Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 22
Enumerate Patterns From Single Prefix of FP-tree • A (projected) FP-tree has a single prefix – Reduce the single prefix into one node – Join the mining results of the two parts root root r 1 a 1 :n 1 a 1 :n 1 a 2 :n 2 b 1 :m 1 c 1 :k 1 r = + Ú Ú a 2 :n 2 a 3 :n 3 a 3 :n 3 b 1 :m 1 c 1 :k 1 c 2 :k 2 c 3 :k 3 c 2 :k 2 c 3 :k 3 Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 23
Recommend
More recommend