Frequent Pattern Mining Overview Basic Concepts and Challenges Data - PDF document

Frequent Pattern Mining Overview • Basic Concepts and Challenges Data Mining Techniques: • Efficient and Scalable Methods for Frequent Frequent Patterns in Sets and Itemsets and Association Rules Sequences • Pattern Interestingness Measures • Sequence Mining Mirek Riedewald Some slides based on presentations by Han/Kamber and Tan/Steinbach/Kumar 2 What Is Frequent Pattern Analysis? Association Rule Mining • Find patterns (itemset, sequence, structure, etc.) that • Given a set of transactions, find rules that will predict occur frequently in a data set the occurrence of an item based on the occurrences of • First proposed for frequent itemsets and association other items in the transaction rule mining Market-Basket transactions • Motivation: Find inherent regularities in data Example of Association Rules – What products were often purchased together? TID Items {Diaper}  {Beer}, – What are the subsequent purchases after buying a PC? 1 Bread, Milk {Milk, Bread}  {Eggs,Coke}, – What kinds of DNA are sensitive to a new drug? {Beer, Bread}  {Milk}, 2 Bread, Diaper, Beer, Eggs • Applications 3 Milk, Diaper, Beer, Coke – Market basket analysis, cross-marketing, catalog design, 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, sale campaign analysis, Web log (click stream) analysis, not causality! 5 Bread, Milk, Diaper, Coke DNA sequence analysis 3 4 Definition: Frequent Itemset Definition: Association Rule TID Items • • Association Rule = implication Itemset expression of the form X  Y, – A collection of one or more items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs • Example: {Milk, Bread, Diaper} where X and Y are itemsets – k-itemset: itemset that contains k items – Ex.: {Milk, Diaper}  {Beer} 3 Milk, Diaper, Beer, Coke Support count (  ) • 4 Bread, Milk, Diaper, Beer TID Items – Frequency of occurrence of an itemset 5 Bread, Milk, Diaper, Coke • Rule Evaluation Metrics 1 Bread, Milk – E.g.,  ({Milk, Bread, Diaper}) = 2 2 Bread, Diaper, Beer, Eggs – Support (s) = P(X  Y) •  Support (s) Example: { Milk , Diaper } Beer 3 Milk, Diaper, Beer, Coke • Estimated by fraction of – Fraction of transactions that contain an 4 Bread, Milk, Diaper, Beer transactions that contain both X itemset and Y   ( Milk , Diaper, Beer ) 2 5 Bread, Milk, Diaper, Coke – E.g., s({Milk, Bread, Diaper}) = 2/5  – Confidence (c) = P(Y| X) s • | D | 5 Frequent Itemset • Estimated by fraction of – An itemset whose support is greater than transactions that contain X and Y  ( Milk, Diaper, Beer ) 2 or equal to a minsup threshold among all transactions containing   c X  ( Milk , Diaper ) 3 5 6 1

Association Rule Mining Task Mining Association Rules Example rules: • Given a transaction database DB, find all rules TID Items 1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67) having support ≥ minsup and confidence ≥ {Milk,Beer}  {Diaper} (s=0.4, c=1.0) 2 Bread, Diaper, Beer, Eggs {Diaper,Beer}  {Milk} (s=0.4, c=0.67) minconf 3 Milk, Diaper, Beer, Coke {Beer}  {Milk,Diaper} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer • Brute-force approach: {Diaper}  {Milk,Beer} (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5) – List all possible association rules – Compute support and confidence for each rule Observations : • All the above rules are binary partitions of the same itemset – Remove rules that fail the minsup or minconf {Milk, Diaper, Beer} thresholds • Rules originating from the same itemset have identical support but can have different confidence – Computationally prohibitive! • Thus, we may decouple the support and confidence requirements 7 8 Mining Association Rules Frequent Itemset Generation null • Two-step approach: A B C D E 1. Frequent Itemset Generation • Generate all itemsets that have support  minsup AB AC AD AE BC BD BE CD CE DE 2. Rule Generation • Generate high-confidence rules from each frequent itemset, where each rule is a binary partitioning of the ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE frequent itemset • Frequent itemset generation is still Given d items, there computationally expensive ABCD ABCE ABDE ACDE BCDE are 2 d possible candidate itemsets ABCDE 9 10 Frequent Itemset Generation Computational Complexity • Brute-force approach: • Given d unique items, total number of itemsets = 2 d – Each itemset in the lattice is a candidate frequent itemset • Total number of possible association rules? – Count the support of each candidate by scanning the database – Match each transaction against every candidate          1 d d d k d k – Complexity  O(N*M*w) => expensive since M=2 d           R       k   j     1 1 List of k j Transactions Candidates     d d 1 3 2 1 TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs If d=6, R = 602 possible M 3 Milk, Diaper, Beer, Coke rules N Bread, Milk, Diaper, Beer 4 5 Bread, Milk, Diaper, Coke w 11 12 2

Frequent Pattern Mining Overview Reducing Number of Candidates • Apriori principle: • Basic Concepts and Challenges – If an itemset is frequent, then all of its subsets must • Efficient and Scalable Methods for Frequent also be frequent • Apriori principle holds due to the following Itemsets and Association Rules property of the support measure: • Pattern Interestingness Measures     X , Y : ( X Y ) s ( X ) s ( Y ) • Sequence Mining – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support 13 14 Illustrating the Apriori Principle Illustrating the Apriori Principle null null Items (1-itemsets) Item Count A A B B C C D D E E Bread 4 Coke 2 Pairs (2-itemsets) Milk 4 Itemset Count Beer 3 {Bread,Milk} 3 Diaper 4 (No need to generate AB AB AC AC AD AD AE AE BC BC BD BD BE BE CD CD CE CE DE DE {Bread,Beer} 2 Eggs 1 candidates involving Coke {Bread,Diaper} 3 {Milk,Beer} 2 or Eggs) Found to be {Milk,Diaper} 3 infrequent {Beer,Diaper} 3 ABC ABC ABD ABD ABE ABE ACD ACD ACE ACE ADE ADE BCD BCD BCE BCE BDE BDE CDE CDE Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, Itemset Count 6 C 1 + 6 C 2 + 6 C 3 = 41 {Bread,Milk,Diaper} 3 ABCD ABCD ABCE ABCE ABDE ABDE ACDE ACDE BCDE BCDE With support-based pruning, Pruned 6 + 6 + 1 = 13 ABCDE ABCDE supersets 15 16 Apriori Algorithm Important Details of Apriori • How to generate candidates? • Generate L 1 = frequent itemsets of length k=1 – Step 1: self-joining L k • Repeat until no new frequent itemsets are found – Step 2: pruning – Generate C k+1 , the length-(k+1) candidate itemsets, • Example of Candidate-generation for from L k L 3 ={ {a,b,c}, {a,b,d}, {a,c,d}, {a,c,e}, {b,c,d} } – Prune candidate itemsets in C k+1 containing subsets of – Self-joining L 3 • {a,b,c,d} from {a,b,c} and {a,b,d} length k that are not in L k (and hence infrequent) • {a,c,d,e} from {a,c,d} and {a,c,e} – Count support of each remaining candidate by – Pruning: scanning DB; eliminate infrequent ones from C k+1 • {a,c,d,e} is removed because {a,d,e} is not in L 3 – L k+1 =C k+1 ; k = k+1 – C 4 ={ {a,b,c,d} } 17 18 3

How to Generate Candidates? How to Count Supports of Candidates? • Step 1: self-joining L k-1 • Why is counting supports of candidates a problem? insert into C k – Total number of candidates can be very large select p.item 1 , p.item 2 ,…, p.item k-1 , q.item k-1 – One transaction may contain many candidates from L k-1 p, L k-1 q where p.item 1 =q.item 1 AND … AND p.item k-2 =q.item k-2 AND p.item k-1 < q.item k-1 • Method: – Candidate itemsets stored in a hash-tree • Step 2: pruning – Leaf node contains list of itemsets – forall itemsets c in C k do – Interior node contains a hash table • forall (k-1)-subsets s of c do – Subset function finds all candidates contained in a – if (s is not in L k-1 ) then delete c from C k transaction 19 20 Generate Hash Tree Subset Operation Using Hash Tree Hash Function 1 2 3 5 6 transaction • Suppose we have 15 candidate itemsets of length 3: – {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 1 + 2 3 5 6 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} 2 + 3 5 6 1,4,7 3,6,9 • We need: 2,5,8 – Hash function 3 + 5 6 – Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 5 6 7 1 4 5 1 3 6 2 3 4 Hash function 3 4 5 3 5 6 3 6 7 5 6 7 3,6,9 1,4,7 3 6 7 3 5 7 3 6 8 1 4 5 3 5 6 3 4 5 1 3 6 3 6 8 6 8 9 2,5,8 3 5 7 1 2 4 1 5 9 1 2 5 6 8 9 1 2 4 4 5 7 4 5 8 1 2 5 1 5 9 4 5 7 4 5 8 21 22 Subset Operation Using Hash Tree Subset Operation Using Hash Tree Hash Function Hash Function transaction transaction 1 2 3 5 6 1 2 3 5 6 1 + 2 3 5 6 1 + 2 3 5 6 2 + 3 5 6 2 + 3 5 6 1,4,7 3,6,9 1,4,7 3,6,9 1 2 + 3 5 6 1 2 + 3 5 6 2,5,8 2,5,8 3 + 5 6 3 + 5 6 1 3 + 5 6 1 3 + 5 6 2 3 4 2 3 4 1 5 + 6 1 5 + 6 5 6 7 5 6 7 1 4 5 1 3 6 1 4 5 1 3 6 3 4 5 3 5 6 3 6 7 3 4 5 3 5 6 3 6 7 3 5 7 3 6 8 3 5 7 3 6 8 6 8 9 6 8 9 1 2 4 1 5 9 1 2 4 1 5 9 1 2 5 1 2 5 4 5 7 4 5 7 4 5 8 4 5 8 Match transaction against 9 out of 15 candidates 23 24 4

Frequent Pattern Mining Overview Basic Concepts and Challenges Data - PDF document

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques: Efficient and Scalable Methods for Frequent Frequent Patterns in Sets and Itemsets and Association Rules Sequences Pattern Interestingness

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted.

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

Integrating Classification and Association Rule Mining the Secret Behind CBA Written by Bing

Foundations of Knowledge Management: Association Rules Markus Strohmaier (with slides based on

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Mining Frequent Patterns, Associations and Correlations Week 3 1 Team Homework Assignment #2

An ideal associated to any cometric association scheme William J. Martin Department of

Non-Dues Revenue from Your Communications Vehicles Jon Meurlott, Group Vice Laura Taylor,

T YMBER S KAN U PDATE : HOA, S ECTION O NE & T HREE FRANK PAUL BARBER, COURT APPOINTED

Spring 2020 Area Association Meeting 1 Membership 2 Membership Update Girl Membership Goal

Frequent Pattern Mining Overview Basic Concepts and Challenges Data - PDF document

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques: Efficient and Scalable Methods for Frequent Frequent Patterns in Sets and Itemsets and Association Rules Sequences Pattern Interestingness

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Statistics and Data Analysis Logistic Regression &amp; Frequent Pattern Mining Ling-Chieh Kung

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted.

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

Integrating Classification and Association Rule Mining the Secret Behind CBA Written by Bing

Foundations of Knowledge Management: Association Rules Markus Strohmaier (with slides based on

Frequent Item Sets Chau Tran &amp; Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Mining Frequent Patterns, Associations and Correlations Week 3 1 Team Homework Assignment #2

An ideal associated to any cometric association scheme William J. Martin Department of

Non-Dues Revenue from Your Communications Vehicles Jon Meurlott, Group Vice Laura Taylor,

T YMBER S KAN U PDATE : HOA, S ECTION O NE &amp; T HREE FRANK PAUL BARBER, COURT APPOINTED

Spring 2020 Area Association Meeting 1 Membership 2 Membership Update Girl Membership Goal

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

T YMBER S KAN U PDATE : HOA, S ECTION O NE & T HREE FRANK PAUL BARBER, COURT APPOINTED