Outline CHARM: An Efficient Algorithm � Introductions for Closed Itemset Mining � Itemset-Tidset tree � CHARM algorithm Authors: Mohammed J. Zaki and Ching-Jui Hsiao � Performance study Presenter: Junfeng Wu � Conclusion � Comments 28/10/2004 2 28/10/2004 1 Introductions Introductions When we are mining association rules in a Closed frequent itemsets are non- database, a huge number of frequent redundant representations of all patterns (itemsets) will be generated. frequent itemsets. Mining association rules on closed Database: {(1,2,3,4),(1,2,3,4,5,6)} � frequent itemsets is a much easier task. Minimum support = 50% � 63 frequent itemsets � ({(1),(2),(3),(4),(5),(6),(1,2),(1,3),…,(1,2,3,4,5,6)}) In the previous database, the number of closed frequent itemsets is only 2, (1,2,3,4) and (1,2,3,4,5,6). 28/10/2004 3 28/10/2004 4
Closed frequent itemsets Example Database � A frequent itemset X is closed if and DISTINCT DATABASE ITEMS Sir Arthur Agatha P.G. only if there is no itemset Y such that Jane Austen Mark Twain Christie Wodehouse Conan Doyle A C D T W � Y subsumes X DATABASE ALL FREQUENT ITEMSETS � every transaction that contains X also MINIMUM SUPPORT = 50% Transaction Items contains Y 1 A,C,T,W Support Itemsets 2 C,D,W 100%(6) C 3 A,C,T,W 83%(5) W,CW Database: {(1,2,3,4),(1,2,3,4,5,6)} 4 A,C,D,W 67%(4) A,D,T,AC,AW,CD,CT,ACW Itemset (1,2) is not a closed itemset. 5 A,C,D,T,W 50%(3) AT,DW,TW,ACT,ATW,CDW,C Itemset (1,2,3,4) is a closed itemset. TW,ACTW 6 C,D,T 28/10/2004 5 28/10/2004 6 Vertical format database Horizontal/Vertical format database � Horizontal format database A C D T W � Each record is a set of items. 1 1 2 1 1 � Each record is assigned a distinct number 3 2 4 3 2 named transaction id. 4 3 5 5 3 � Vertical format database 5 4 6 6 4 � Each record is a set of transaction id about 5 5 an item. 6 � This item occurs in these transactions. 28/10/2004 7 28/10/2004 8
Notations Itemset-Tidset Search Tree (IT-tree) Given an itemset X, t(X) is the set of all � Each node in the IT-tree is an itemset- tids that contains X . tidset pair, X×t(X). For example: t(ACW) = 1345 For example: AT×135 Given a tidset Y , i(Y) is the set of all � All the children of node X share the common items to all the tids in Y . same prefix X and belong to an For example: i(12) = CW equivalence class Given an itemset X , c(X) is the smallest closed set that contains X . For example: c(A)=c(C)=C(W)=ACW 28/10/2004 9 28/10/2004 10 Example of IT-tree Theorem 1 {} Let and be any two members of a X × X × t ( X ) t ( X ) � 123456 i i j j class , with , where is a total order. The X ≤ [ p ] X f i f j following four properties hold: A C D T W 1345 123456 2456 1356 12345 = = ∪ 1. If , then = c ( X ) c ( X ) c ( X X ) t ( X ) t ( X ) � i j i j i j 2. If , then , but ⊂ ≠ = ∪ t ( X ) t ( X ) c ( X ) c ( X ) c ( X ) c ( X X ) AC AD AT AW CD CT CW DT DW TW � i j i j i i j 1345 45 135 1345 2456 1356 12345 56 245 135 3. If , then , but ⊃ ≠ = ∪ t ( X ) t ( X ) c ( X ) c ( X ) c ( X ) c ( X X ) � i j j i j i j 4. If , then ≠ ≠ ∪ ≠ ACD ACT ACW ADW ADT ATW CDT CDW CTW DTW t ( X ) t ( X ) c ( X ) c ( X ) c ( X X ) � i j i j i j 45 135 1345 45 5 135 56 245 135 5 ACDT ACDW ACTW ADTW CDTW 5 45 135 5 5 ACDTW 5 28/10/2004 11 28/10/2004 12
CHARM algorithm How does CHARM work? {} Dx2456 Tx1356 Ax1345 Wx12345 Cx123456 DCx2456 TCx1356 AWx1345 WCx12345 AWCx1345 DTx56 DAx45 DWx245 TAx135 TWx135 DWCx245 TACx135 TWCx135 TAWCx135 28/10/2004 13 28/10/2004 14 Subsumption Checking Hash function Before add a set X to the current set of ∑ ∈ = closed set, we need check if X is h ( X ) T subsumed by some closed sets. T t ( X ) � Comparing X with all closed set is expensive. The sum of the tids in the tidset of an itemset � Assumption: itemsets with the same hash key Solution: using hash function to retrieve have different supports. relevant closed sets 28/10/2004 15 28/10/2004 16
Complexity issues Diffsets t(PX) Comparing two itemset’s tidsets becomes t(X) a time consuming task when tidset gets t(P) very large. Keeping all tids of itemsets in memory needs lots of space. t(Y) Solution: using diffsets d(PX) d(PY) d(PXY) t(PXY) 28/10/2004 17 28/10/2004 18 Diffset and Tidset CHARM using diffsets {} Let m(X i ) and m(X j ) denote the number of mismatches in the diffsets d(X i ) and d(X j ) For example: X i =D, X j =T, then d(X i )=2456, d(X j )=1356, Dx2456 Tx1356 Ax1345 Wx12345 Cx123456 DCx2456 TCx1356 AWx1345 WCx12345 m(X i )=|(13)|=2, m(X j )=|(24)|=2 AWCx1345 = = = = m X and m X then d X d X or t X t X ( ) 0 ( ) 0 , ( ) ( ) ( ) ( ) i j i j i j > = ⊃ ⊂ m ( X ) 0 and m ( X ) 0 , then d ( X ) d ( X ) or t ( X ) t ( X ) i j i j i j = > ⊂ ⊃ m ( X ) 0 and m ( X ) 0 , then d ( X ) d ( X ) or t ( X ) t ( X ) i j i j i j DTx24 DAx26 DWx6 TAx6 TWx6 > > ≠ ≠ DWCx6 TACx6 TWCx6 m ( X ) 0 and m ( X ) 0 , then d ( X ) d ( X ) or t ( X ) t ( X ) TAWCx6 i j i j i j 28/10/2004 19 28/10/2004 20
Performance study Performance study � Datasets 28/10/2004 21 28/10/2004 22 Performance study Performance study 28/10/2004 23 28/10/2004 24
Scalability Memory usage Linear increasing in the running time with increasing The memory usage is 50 times smaller by using diffsets number of transactions at a giving support. than using tidsets. Memory usage (using diffsets) 28/10/2004 25 28/10/2004 26 Conclusion Comments � Strength � Advantage of CHARM � The ideas in the paper are intuitive. � Faster than other algorithm at low support threshold � The authors first introduced an efficient data structure (IT- � Faster than other algorithm on a database with very long tree) for closed itemset mining. closed patterns � The authors demonstrated the algorithm on various � Disadvantage of CHARM datasets. � Slower than Closet when most of closed sets are 2-itemset � The experimental studies are convincing. � Weakness � The algorithm requires the conversion of database from horizontal format to vertical format. � Follow-up � Closet+ (Wang et al, 2003) beats CHARM one year later. 28/10/2004 27 28/10/2004 28
THANK YOU! Questions or comments? 28/10/2004 29
Recommend
More recommend