introduction to data mining
play

Introduction to Data Mining Frequent Pattern Mining and Association - PowerPoint PPT Presentation

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide credits: Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns, Association and Correlations Basic concepts Frequent


  1. Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide credits: Jiawei Han and Micheline Kamber George Kollios 1

  2. Mining Frequent Patterns, Association and Correlations  Basic concepts  Frequent itemset mining methods  Mining association rules  Association mining to correlation analysis  Constraint-based association mining 2

  3. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures,  etc.) that occurs frequently in a data set  Frequent sequential pattern  Frequent structured pattern Motivation: Finding inherent regularities in data   What products were often purchased together? — Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug? Applications   Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 3

  4. Frequent Itemset Mining  Frequent itemset mining: frequent set of items in a transaction data set  Agrawal, Imielinski, and Swami, SIGMOD 1993  SIGMOD Test of Time Award 2003 “This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ” R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD ’93. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. 4

  5. Basic Concepts: Transaction dataset Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F 5

  6. Record Data Data that consists of a collection of records, each of which consists of  a fixed set of attributes Points in a multi-dimensional space, where each dimension  represents a distinct attribute Represented by an m by n matrix, where there are m rows, one for  each object, and n columns, one for each attribute Tid Refund Marital Taxable Cheat Status Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10

  7. Transaction Data  A special type of record data, where  each record (transaction) involves a set of items.  For example, the set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

  8. Document Data  Each document becomes a `term' vector,  each term is a component (attribute) of the vector,  the value of each component is the number of times the corresponding term occurs in the document. timeout season coach score game team ball lost pla wi y n Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0

  9. Basic Concepts: Transaction dataset Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F 9

  10. Basic Concepts: Frequent Patterns and Association Rules Itemset: X = {x 1 , …, x k } (k-itemset) Transaction-id Items bought  10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F 10

  11. Basic Concepts: Frequent Patterns and Association Rules Itemset: X = {x 1 , …, x k } (k- Transaction-id Items bought  itemset) 10 A, B, D Frequent itemset: X with minimum  20 A, C, D support count 30 A, D, E Support count (absolute support):  40 B, E, F count of transactions containing X 50 B, C, D, E, F 11

  12. Basic Concepts: Frequent Patterns and Association Rules Itemset: X = {x 1 , …, x k } (k- Transaction-id Items bought  itemset) 10 A, B, D Frequent itemset: X with minimum  20 A, C, D support count 30 A, D, E Support count (absolute support):  40 B, E, F count of transactions containing X 50 B, C, D, E, F Association rule: A  B with  minimum support and confidence Customer Customer Support: probability that a buys both buys diaper  transaction contains A  B s = P(A  B) Confidence: conditional probability  that a transaction having A also contains B Customer c = P(B | A) buys beer 12

  13. Illustration of Frequent Itemsets and Association Rules Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F  Frequent itemsets (minimum support count = 3) ?  Association rules (minimum support = 50%, minimum confidence = 50%) ? 13

  14. Illustration of Frequent Itemsets and Association Rules Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F  Frequent itemsets (minimum support count = 3) ? {A:3, B:3, D:4, E:3, AD:3}  Association rules (minimum support = 50%, minimum confidence = 50%) ? A  D (60%, 100%) D  A (60%, 75%) 14

  15. Mining Frequent Patterns, Association and Correlations  Basic concepts  Frequent itemset mining methods  Mining association rules  Association mining to correlation analysis  Constraint-based association mining 15

  16. Scalable Methods for Mining Frequent Patterns  Frequent itemset mining methods  Apriori (Agrawal & Srikant@VLDB’94) and variations  Frequent pattern growth (FPgrowth — Han, Pei & Yin @SIGMOD’00)  Closed and maximal patterns and their mining methods  FIMI Workshop and implementation repository 16

  17. Frequent itemset mining  Brute force approach Transaction- Items id bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Itemset: X = {x 1 , …, x k } (k-itemset) Frequent itemset: X with minimum support count

  18. Frequent itemset mining  Brute force approach  Set enumeration tree for all possible itemsets  Tree search Transaction- Items id bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

  19. Apriori – Apriori Property  The apriori property of frequent patterns  Any nonempty subset of a frequent itemset must be frequent  If {beer, diaper, nuts} is frequent, so is {beer, diaper}  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! 19

  20. Apriori: Level-Wise Search Method  Level-wise search method (BFS):  Initially, scan DB once to get frequent 1-itemset  Generate length (k+1) candidate itemsets from length k frequent itemsets  Test the candidates against DB  Terminate when no frequent or candidate set can be generated 20

  21. The Apriori Algorithm  Pseudo-code: C k : Candidate k-itemset L k : frequent k-itemset L 1 = frequent 1-itemsets; for (k = 2; L k-1 !=  ; k++) C k = generate candidate set from L k-1; for each transaction t in database find all candidates in C k that are subset of t; increment their count; L k = candidates in C k with min_support return  k L k ; 21

  22. The Apriori Algorithm — An Example Sup min = 2 Itemset sup Itemset sup Transaction DB {A} 2 L 1 {A} 2 Tid Items C 1 {B} 3 {B} 3 10 A, C, D {C} 3 1 st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E Itemset sup C 2 C 2 Itemset {A, B} 1 2 nd scan Itemset sup L 2 {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset sup Itemset 3 rd scan L 3 C 3 {B, C, E} 2 {B, C, E} 22

  23. Important Details of Apriori  How to generate candidate sets?  How to count supports for candidate sets? 23

  24. Candidate Set Generation C k = generate candidate set from L k-1; Step 1: self-joining L k-1 : assuming items and itemsets are sorted in  order, joinable only if the first k-2 items are in common Step 2: pruning: prune if it has infrequent subset  Example : Generate C 4 from L 3 = { abc, abd, acd, ace, bcd } Step 1: Self-joining: L 3 *L 3   abcd from abc and abd; acde from acd and ace Step 2: Pruning:   acde is removed because ade is not in L 3 C 4 ={ abcd } 24

  25. How to Count Supports of Candidates? for each transaction t in database find all candidates in C k that are subset of t; increment their count;  Why counting supports of candidates a problem?  The total number of candidates can be very huge  One transaction may contain many candidates  For each subset s in t, check if s is in C k 25

  26. How to Count Supports of Candidates? for each transaction t in database find all candidates in C k that are subset of t; increment their count;  For each subset s in t, check if s is in C k  Linear search  Prefix tree  Hash-tree (prefix tree with hash function at interior node)  Hash-table 26

  27. Example: Hash-tree hash function Transaction: 2 3 5 6 7 3,6,9 1,4,7 2,5,8 3 2 5 2 3 4 6 5 5 6 7 3 6 7 1 4 5 3 5 6 3 4 5 1 3 6 3 6 8 3 5 7 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 28

Recommend


More recommend