mining frequent patterns associations and correlations
play

Mining Frequent Patterns, Associations and Correlations Week 3 1 - PowerPoint PPT Presentation

Mining Frequent Patterns, Associations and Correlations Week 3 1 Team Homework Assignment #2 Team Homework Assignment #2 Read pp. 285 300 of the text book. R d 285 300 f h b k Do Example 6.1. Prepare for the results of the


  1. Mining Frequent Patterns, Associations and Correlations Week 3 1

  2. Team Homework Assignment #2 Team Homework Assignment #2 • Read pp. 285 – 300 of the text book. R d 285 300 f h b k • Do Example 6.1. Prepare for the results of the homework assignment. assignment. • Due date – beginning of the lecture on Friday February 18 th .

  3. Team Homework Assignment #3 Team Homework Assignment #3 • Prepare for the one ‐ page description of your group project P f h d i i f j topic • Prepare for presentation using slides Prepare for presentation using slides • Due date – beginning of the lecture on Friday February 11 th .

  4. http://www.lucyluvs.com/images/fitt edXLpooh.JPG edXLpooh.JPG http://www.mondobirra.org/sfondi/BudLight.siz ed.jpg 4

  5. cell_cycle ‐ > [+]Exp1,[+]Exp2,[+]Exp3,[+]Exp4, support=52.94% (9 genes) apoptosis ‐ > [+]Exp6,[+]Exp7,[+]Exp8, p p [ ] p ,[ ] p ,[ ] p , support=76.47% (13 genes) http://www.cnb.uam.es/~pcarmona/assocrules/imag4.JPG 5

  6. a ble 8.3 T he substitutio n matrix o f amino ac ids. T ig ure 8.8 Sc o ring two po te ntial pairwise alignme nts, (a) and F (b), o f amino ac ids. 6

  7. ig ure 9.1 A sample graph data se t. F ig ure 9.2 F re que nt graph. F 7

  8. 8 ig ure 9.14 A c he mic al database . F

  9. What Is Frequent Pattern Analysis? What Is Frequent Pattern Analysis? • Frequent pattern: a pattern for itemsets, subsequences, substructures , etc. that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami in 1993, in the context of frequent itemsets and association rule mining 9

  10. Why Is Frequent Pattern Mining I Important? ? • Discloses an intrinsic and important property of data sets • Discloses an intrinsic and important property of data sets • Forms the foundation for many essential data mining tasks and applications tasks and applications – What products were often purchased together?— Beer and diapers? – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? 10

  11. Topics of Frequent Pattern Mining (1) Topics of Frequent Pattern Mining (1) • Based on the kinds of patterns to be mined – Frequent itemset mining – Sequential pattern mining – Structured pattern mining 11

  12. Topics of Frequent Pattern Mining (2) Topics of Frequent Pattern Mining (2) • Based on the levels of abstraction involved in the rule set – Single ‐ level association rules Single level association rules – Multi ‐ level association rules 12

  13. Topics of Frequent Pattern Mining (3) Topics of Frequent Pattern Mining (3) • Based on the number of data dimensions involved in the rule – Single ‐ dimensional association rules Single dimensional association rules – Multi ‐ dimensional association rules 13

  14. Association Rule Mining Process Association Rule Mining Process • Find all frequent itemsets Fi d ll f i – Join steps – Prune steps – Prune steps • Generate “ strong” association rules from the frequent itemsets 14

  15. Basic Concepts of Frequent Itemsets Basic Concepts of Frequent Itemsets • Let I = { I 1 , I 2 , …., I m } be a set of items • Let D , the task ‐ relevant data, be a set of database L t D th t k l t d t b t f d t b transactions where each transaction T is a set of items such that T ⊆ I that T ⊆ I • Each transaction is associated with an identifier, called TID . • Let A be a set of items • Let A be a set of items • A transaction T is said to contain A if and only if A ⊆ T 15

  16. How to Generate Frequent Itemset? How to Generate Frequent Itemset? • Suppose the items in L k ‐ 1 are listed in an order • The join step : To find L k , a set of candidate k ‐ itemsets, C k , is generated by joining L k ‐ 1 with itself. Let l 1 and l 2 be itemsets in L k ‐ 1 .The resulting itemset formed by joining l 1 and l 2 is l 1 [1], l 1 [2], …, l 1 [k 2], l 1 [k 1], l 2 [k 1] l 1 [2] l 1 [k ‐ 2] l 1 [k ‐ 1] l 2 [k ‐ 1] • The prune step : Scan data set D and compare candidate support count of C k with minimum support count. Remove pp pp k candidate itemsets that whose support count is less than minimum support count, resulting in L k . 16

  17. Apriori Algorithm Apriori Algorithm • Initially, scan DB once to get frequent 1 ‐ itemset I iti ll DB t t f t 1 it t • Generate length (k+1) candidate itemsets from length k frequent itemsets frequent itemsets • Prune length (k+1) candidate itemsets with Apriori property – Apriori property: All nonempty subsets of a frequent itemset must Apriori property: All nonempty subsets of a frequent itemset must also be frequent • Test the candidates against DB g • Terminate when no frequent or candidate set can be generated g 17

  18. 18 5.4 T he A Aprio ri alg go rithm fo o r disc o ve e ring fre q que nt F ig ure ite mse ts fo r min ing Bo o le e an asso c c iatio n rul le s.

  19. T T ransac tio nalDatabase ransac tio nalDatabase TI D TI D List of item List of item _ I Ds I Ds T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 , T600 I2, I3 T700 I1, I3 T800 T800 I1 I2 I3 I5 I1, I2, I3, I5 T900 I1, I2, I3 a ble 5 1 T a ble 5.1 T ransac tio nal data fo r an AllE ransac tio nal data fo r an AllE le c tro nic s branc h le c tro nic s branc h. T T 19

  20. Minimum support count = 2 Figure 5.2 Generation of candidate itemsets and frequent itemsets, where 20 the minimum support count is 2.

  21. Generating Strong Association Rules Generating Strong Association Rules • From the frequent itemsets q • For each frequent itemset l , generate all nonempty subset of l • For every nonempty subset s of l, • Output the rule “s (l – s)” • If support_count(l) / support_count(s) ≥ min_conf, If t t(l) / t t( ) ≥ i f where min_conf is the minimum confidence threshold • Rules that satisfy both a minimum support threshold Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong 1

  22. Support Support • The rule A The rule A B holds in the transaction set D B holds in the transaction set D with support s – support , s , probability that a transaction contains t b bilit th t t ti t i A and B – support (A support (A B) B) = P (A P (A B) B) 2

  23. Confidence Confidence • The rule A The rule A B has confidence c in the B has confidence c in the transaction set D – confidence , c , conditional probability that a fid diti l b bilit th t transaction having A also contains B – confidence (A confidence (A B) = P (B | A) B) P (B | A) ∪ ∪ ( ) ( ) support A B support_co unt A B ⇒ = = = ( ) ( | ) Confidence A B P B A ( ( ) ) ( ( ) ) support pp A support co pp _ unt A 3

  24. Generating Association Rules from Frequent Itemsets • Example 5.4: Suppose the data contain the frequent itemset l p pp q = {I1, I2, I5}. What are the association rules that can be generated from l? If the minimum confidence threshold is 70%, then which rules are strong? 70% then which rules are strong? – I1 ^I2 ‐ > I5, confidence = 2/4 = 50% – I1 ^I5 ‐ > I2, confidence = 2/2 = 100% – I2 ^I5 ‐ > I1, confidence = 2/2 = 100% – I1 ‐ > I2 ^ I5, confidence = 2/6 = 33% – I2 ‐ > I1 ^ I5, confidence = 2/7 = 29% , / – I5 ‐ > I1 ^ I2, confidence = 2/2 = 100% 1

  25. Exercise Exercise 5.3 A database has five transactions. Let min_sup = 60% and min_conf = 80%. TID Items_bought T100 {M, O, N, K, E, Y} T200 T200 {D O N K E Y} {D, O, N, K, E, Y} T300 {M, A, K, E} T400 {M, U, C, K, Y} T500 {C, O, O, K, I, E} (a) Find all frequent itemsets. (b) List all of the strong association rules (with support s and confidence c ) matching following meta ‐ rule, where X is a variable representing customers, and item i denotes variables representing items (e.g., “A”, “B”, etc.): representing items (e.g., A , B , etc.): ∀ ∈ ∧ ⇒ , ( , ) ( , ) ( , ) x transactio n buys X item buys X item buys X item [s, c] 4 1 2 3

  26. Challenges of Frequent Pattern Mining Challenges of Frequent Pattern Mining • Challenges Challenges – Multiple scans of transaction database – Huge number of candidates uge u be o ca d dates – Tedious workload of support counting for candidates • Improving Apriori – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates 5

  27. Advanced Methods for Mining Frequent Itemsets • Mining frequent itemsets without candidate Mining frequent itemsets without candidate generation – Frequent pattern growth (FP growth—Han Pei & – Frequent ‐ pattern growth (FP ‐ growth—Han, Pei & Yin @SIGMOD’00) • Mining frequent itemsets using vertical data • Mining frequent itemsets using vertical data format – Vertical data format approach (ECLAT—Zaki Vertical data format approach (ECLAT Zaki @IEEE ‐ TKDE’00) 6

  28. Mining Various Kinds of Association Rules • Mining multilevel association rules • Mining multilevel association rules • Mining multidimensional association rules g 7

Recommend


More recommend