data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 9: Summarizing Itemsets Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 1 / 23

  2. Maximal Frequent Itemsets Given a binary database D ⊆ T × I , over the tids T and items I , let F denote the set of all frequent itemsets, that is, � � F = X | X ⊆ I and sup ( X ) ≥ minsup A frequent itemset X ∈ F is called maximal if it has no frequent supersets. Let M be the set of all maximal frequent itemsets, given as � � M = X | X ∈ F and � ∃ Y ⊃ X , such that Y ∈ F The set M is a condensed representation of the set of all frequent itemset F , because we can determine whether any itemset X is frequent or not using M . If there exists a maximal itemset Z such that X ⊆ Z , then X must be frequent; otherwise X cannot be frequent. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 2 / 23

  3. An Example Database Transaction database Frequent itemsets ( minsup = 3) Tid Itemset 1 ABDE sup Itemsets 2 BCE 6 B 3 ABDE 5 E , BE 4 ABCE 4 A , C , D , AB , AE , BC , BD , ABE 5 ABCDE 3 AD , CE , DE , ABD , ADE , BCE , BDE , ABDE 6 BCD Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 3 / 23

  4. Closed Frequent Itemsets Given T ⊆ T , and X ⊆ I , define t ( X ) = { t ∈ T | t contains X } i ( T ) = { x ∈ I | ∀ t ∈ T , t contains x } c ( X ) = i ◦ t ( X ) = i ( t ( X )) The function c is a closure operator and an itemset X is called closed if c ( X ) = X . It follows that t ( c ( X )) = t ( X ) . The set of all closed frequent itemsets is thus defined as C = � X | X ∈ F and � ∃ Y ⊃ X such that sup ( X ) = sup ( Y ) � X is closed if all supersets of X have strictly less support, that is, sup ( X ) > sup ( Y ) , for all Y ⊃ X . The set of all closed frequent itemsets C is a condensed representation, as we can determine whether an itemset X is frequent, as well as the exact support of X using C alone. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 4 / 23

  5. Minimal Generators A frequent itemset X is a minimal generator if it has no subsets with the same support: G = � X | X ∈ F and � ∃ Y ⊂ X , such that sup ( X ) = sup ( Y ) � In other words, all subsets of X have strictly higher support, that is, sup ( X ) < sup ( Y ) , for all Y ⊂ X . Given an equivalence class of itemsets that have the same tidset, a closed itemset is the unique maximum element of the class, whereas the minimal generators are the minimal elements of the class. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 5 / 23

  6. Frequent Itemsets: Closed, Minimal Generators and Maximal ∅ A B D E C 1345 123456 1356 12345 2456 AD DE AB AE BD BE BC CE 1345 1345 135 135 1356 12345 2456 245 ABD ADE BDE ABE BCE 135 135 135 1345 245 ABDE 135 Itemsets boxed and shaded are closed, double boxed are maximal, and those boxed are minimal generators Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 6 / 23

  7. Mining Maximal Frequent Itemsets: GenMax Algorithm Mining maximal itemsets requires additional steps beyond simply determining the frequent itemsets. Assuming that the set of maximal frequent itemsets is initially empty, that is, M = ∅ , each time we generate a new frequent itemset X , we have to perform the following maximality checks Subset Check: � ∃ Y ∈ M , such that X ⊂ Y . If such a Y exists, then clearly X is not maximal. Otherwise, we add X to M , as a potentially maximal itemset. Superset Check: � ∃ Y ∈ M , such that Y ⊂ X . If such a Y exists, then Y cannot be maximal, and we have to remove it from M . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 7 / 23

  8. GenMax Algorithm: Maximal Itemsets GenMax is based on dEclat, i.e., it uses diffset intersections for support computation. The initial call takes as input the set of frequent items along with their tidsets, � i , t ( i ) � , and the initially empty set of maximal itemsets, M . Given a set of itemset–tidset pairs, called IT-pairs, of the form � X , t ( X ) � , the recursive GenMax method works as follows. If the union of all the itemsets, Y = � X i , is already subsumed by (or contained in) some maximal pattern Z ∈ M , then no maximal itemset can be generated from the current branch, and it is pruned. Otherwise, we intersect each IT-pair � X i , t ( X i ) � with all the other IT-pairs � X j , t ( X j ) � , with j > i , to generate new candidates X ij , which are added to the IT-pair set P i . If P i is not empty, a recursive call to GenMax is made to find other potentially frequent extensions of X i . On the other hand, if P i is empty, it means that X i cannot be extended, and it is potentially maximal. In this case, we add X i to the set M , provided that X i is not contained in any previously added maximal set Z ∈ M . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 8 / 23

  9. GenMax Algorithm // Initial Call: M ← ∅ , � � P ← � i , t ( i ) � | i ∈ I , sup ( i ) ≥ minsup GenMax ( P , minsup , M ): 1 Y ← � X i 2 if ∃ Z ∈ M , such that Y ⊆ Z then return // prune entire branch 3 4 foreach � X i , t ( X i ) � ∈ P do P i ← ∅ 5 foreach � X j , t ( X j ) � ∈ P , with j > i do 6 X ij ← X i ∪ X j 7 t ( X ij ) = t ( X i ) ∩ t ( X j ) 8 if sup ( X ij ) ≥ minsup then P i ← P i ∪ {� X ij , t ( X ij ) �} 9 10 if P i � = ∅ then GenMax ( P i , minsup , M ) 11 12 else if � ∃ Z ∈ M , X i ⊆ Z then 13 M = M ∪ X i // add X i to maximal set 14 15 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 9 / 23

  10. Mining Maximal Frequent Itemsets A B C D E 1345 123456 2456 1356 12345 P A P B P C P D AB AD AE BC BD BE CE DE 1345 135 1345 2456 1356 12345 245 135 P BC P AB P AD P BD ABD ABE ADE BDE BCE 135 1345 135 135 245 P ABD ABDE 135 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 10 / 23

  11. Mining Closed Frequent Itemsets: Charm Algorithm Mining closed frequent itemsets requires that we perform closure checks, that is, whether X = c ( X ) . Direct closure checking can be very expensive. Given a collection of IT-pairs {� X i , t ( X i ) �} , Charm uses the following three properties: If t ( X i ) = t ( X j ) , then c ( X i ) = c ( X j ) = c ( X i ∪ X j ) , which implies that we can Property (1) replace every occurrence of X i with X i ∪ X j and prune the branch under X j because its closure is identical to the closure of X i ∪ X j . If t ( X i ) ⊂ t ( X j ) , then c ( X i ) � = c ( X j ) but c ( X i ) = c ( X i ∪ X j ) , which means Property (2) that we can replace every occurrence of X i with X i ∪ X j , but we cannot prune X j because it generates a different closure. Note that if t ( X i ) ⊃ t ( X j ) then we simply interchange the role of X i and X j . If t ( X i ) � = t ( X j ) , then c ( X i ) � = c ( X j ) � = c ( X i ∪ X j ) . In this case we cannot Property (3) remove either X i or X j , as each of them generates a different closure. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 11 / 23

  12. Charm Algorithm: Closed Itemsets � � // Initial Call: C ← ∅ , P ← � i , t ( i ) � : i ∈ I , sup ( i ) ≥ minsup Charm ( P , minsup , C ): 1 Sort P in increasing order of support (i.e., by increasing | t ( X i ) | ) 2 foreach � X i , t ( X i ) � ∈ P do P i ← ∅ 3 foreach � X j , t ( X j ) � ∈ P , with j > i do 4 X ij = X i ∪ X j 5 t ( X ij ) = t ( X i ) ∩ t ( X j ) 6 if sup ( X ij ) ≥ minsup then 7 if t ( X i ) = t ( X j ) then // Property 1 8 Replace X i with X ij in P and P i 9 Remove � X j , t ( X j ) � from P 10 else 11 if t ( X i ) ⊂ t ( X j ) then // Property 2 12 Replace X i with X ij in P and P i 13 else // Property 3 14 � � P i ← P i ∪ � X ij , t ( X ij ) � 15 if P i � = ∅ then Charm ( P i , minsup , C ) 16 17 if � ∃ Z ∈ C , such that X i ⊆ Z and t ( X i ) = t ( Z ) then 18 C = C ∪ X i // Add X i to closed set 19 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 12 / 23

  13. Mining Frequent Closed Itemsets: Charm Process A C D E B A AE AEB 2456 1356 12345 123456 1345 P A AD ADE ADEB 135 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 13 / 23

  14. Mining Frequent Closed Itemsets: Charm A AE AEB C CB D DB E EB B 1345 2456 1356 12345 123456 P A P C P D DE DEB AD ADE ADEB CE CEB 135 135 245 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 14 / 23

Recommend


More recommend