H-Mine: Hyper-Structure Paper’s goals Mining of Frequent Patterns in Large Databases ■ Introduce a new data structure: H-struct ■ Introduce a new mining algorithm: H-mine J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang ■ Introduce a new data mining methodology: Int. Conf. on Data Mining (ICDM'01), San Jose, CA space-preserving mining Presented by Leonid Mocofan 1 2 Why a new algorithm ? H-mine characteristics Two current algorithm categories: ■ ■ It has limited and precisely predictable – Candidate generation-and-test approach: space overhead. • E.g., Apriori algorithm – Pattern growth methods: ■ It can scale up to very large databases • E.g., FP-growth, TreeProjection by using database partitioning They have performance bottlenecks: ■ – Huge space required for mining ■ When the data sets are dense, it can – Real databases contain all the cases switch to use FP-trees to continue the – Large applications need more scalability mining process 3 4
Frequent pattern mining Frequent pattern mining introduction definitions ■ set of items: I = {x 1 ,…,x n } Frequent pattern: For a transaction database TDB and a support threshold min_sup , X is a ■ itemset X: subset of items (X ⊆ I) frequent pattern if and only if sup(X) ≥ min_sup ■ transaction: T=(tid, X) ■ transaction database: TBD Frequent pattern mining: Finding the complete set of frequent patterns in a given ■ support(X): number of transactions in transaction database with respect to a given TDB containing X support threshold. 5 6 H-mine algorithm H-mine(Mem) – Example minimum support threshold is 2 Trans Items Frequent-item H-mine(Mem) – memory based, ID projection 1. 100 c,d,e,f,g,i c,d,e,g efficient pattern-growth algorithm 200 a,c,d,e,m a,c,d,e Header a c d e g 300 a,b,d,e,g,k a,d,e,g H-mine based on H-mine(Mem) for Table H 3 3 4 3 2 2. 400 a,c,d,h a,c,d large databases by first partitioning the 100 c d e g database F-list : a-c-d-e-g frequent 200 a c d E projections For dense data sets, H-mine is 3. 300 a d e g integrated with FP-growth dynamically a c d 400 H-struct 7 8
H-mine(Mem) – Example H-mine(Mem) – Example Header Header Header H eader a c d e g Table H Table H a Table H ac H eader c d e g Table H 3 3 4 3 2 Table H a 2 3 2 1 a c d e g c d e g d e 3 3 4 3 2 2 3 2 1 2 1 100 c d e g frequent 100 c d e g 200 a c d g projections frequent 200 a c d g a d e g 300 projections 300 a d e g a c d 400 a c d 400 H eader table H a and ac -queue Header table H ac 9 10 H-mine(Mem) – Example H-mine(Mem) – Example Header a c d e g H eader a c d e g c d e g Table H 3 3 4 3 2 Table H 3 3 4 3 2 H eader 2 3 2 1 Table H 100 c d e g 100 c d e g frequent frequent 200 a c d e 200 a c d g projections projections 300 a d e g a d e g 300 a c d 400 a c d 400 Adjusted hyper-links after mining H eader table H a and ad -queue a- projected database 11 12
H-mine: Mining large databases H-mine: Mining large databases ■ Apply H-mine(Mem) to TDB i with minimum ■ TDB transaction database (size n ) support threshold min_sup ∗ n i /n ■ Minimum support threshold min_sup ■ Find L, the set of frequent items ■ Combine F i , set of locally frequent pattern in TDB i , to get the globally frequent patterns. ■ TDB partitioned in k parts (TDB i , 1 ≤ i ≤ k ) 13 14 H-mine – Example Performance ■ H-mine has better runtime performance ■ TDB split in P 1 ,P 2 ,P 3 ,P 4 on both sparse and dense data than ■ Minimum support threshold 100 FP-growth and Apriori Local freq. pat. Partitions Accumulated sup.cnt ■ H-mine has better space usage on both ab P 1 ,P 2 ,P 3 ,P 4 280 sparse and dense data than FP-growth ac P 1 ,P 2 ,P 3 ,P 4 320 ad P 1 ,P 2 ,P 3 ,P 4 260 and Apriori abc P 1 ,P 3 ,P 4 120 ■ H-mine performs well with very large abcd P 1 ,P 4 40 … … … databases too ■ Frequent patterns: ab, ac, ad, abc 15 16
Conclusions Bibliography H-mine: ■ “H-Mine: Hyper-Structure Mining of Frequent ■ has high performance Patterns in Large Databases”, J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, Int. Conf. on Data ■ is scalable in all kinds of data Mining (ICDM'01), San Jose, CA, Nov. 2001. ■ has very small space overhead ■ “Mining Frequent Patterns without Candidate Generation”, J. Han, J. Pei, and Y. Yin, ACM- ■ can dynamically adapt to input data SIGMOD 2000, Dallas, TX, May 2000. ■ introduces structure- and space- ■ “Data Mining: Concepts and Techniques”, Jiawei Han and Micheline Kamber, The Morgan Kaufmann Pub., preserving mining methodology 2001. 17 18
Recommend
More recommend