Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden - PowerPoint PPT Presentation

Frequent Pattern Mining

How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2

Burnt or Burned? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 3

Store Layout Design http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 4

Transaction Data • Alphabet: a set of items – Example: all products sold in a store • A transaction: a set of items involved in an activity – Example: the items purchased by a customer in a visit • Other information is often associated – Timestamp, price, salesperson, customer-id, store-id, … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 5

Examples of Transaction Data • • • • • Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 6

How to Store Transaction Data? Tid Item • Transaction-id t123 a (t123, a, b, c) t123 b t123 c (t236, b, d) … … • Relational storage t236 b t236 d • Transaction-based storage • Item-based (vertical) storage – Item a: … , t123, … – Item b: … , t123, … , t236, … – … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 7

Transaction Data Analysis • Transactions: customers ’ purchases of commodities – {bread, milk, cheese} if they are bought together • Frequent patterns: product combinations that are frequently purchased together by customers • Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93] Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 8

Why Frequent Patterns? • What products were often purchased together? • What are the frequent subsequent purchases after buying a iPod? • What kinds of genes are sensitive to this new drug? • What key-word combinations are frequently associated with web pages about game- evaluation? Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 9

Why Frequent Pattern Mining? • Foundation for many data mining tasks – Association rules, correlation, causality, sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, … • Broad applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 10

Frequent Itemsets • Itemset: a set of items – E.g., acm = {a, c, m} Transaction database TDB • Support of itemsets TID Items bought – Sup(acm) = 3 100 f, a, c, d, g, I, m, p • Given min_sup = 3, acm 200 a, b, c, f, l, m, o is a frequent pattern 300 b, f, h, j, o • Frequent pattern mining: 400 b, c, k, s, p finding all frequent 500 a, f, c, e, l, p, m, n patterns in a database Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 11

A Naïve Attempt • Generate all possible itemsets, test their supports against the database • How to hold a large number of itemsets into main memory? – 100 items à 2 100 – 1 possible itemets • How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the support of 2 20 – 1 = 1,048,575 itemsets Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 12

Transactions in Real Applications • A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books relevant to data mining • Walmart has more than 20 million transactions per day, AT&T produces more than 275 million calls per day • Mining large transaction databases of many items is a real demand Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 13

How to Get an Efficient Method? • Reducing the number of itemsets that need to be checked • Checking the supports of selected itemsets efficiently Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 14

Candidate Generation & Test • Any subset of a frequent itemset must also be frequent – an anti-monotonic property – A transaction containing {beer, diaper, nuts} also contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must also be frequent • In other words, any superset of an infrequent itemset must also be infrequent – No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned! Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 15

Apriori-Based Mining • Generate length (k+1) candidate itemsets from length k frequent itemsets, and • Test the candidates against DB Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 16

The Apriori Algorithm [AgSr94] Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items Itemset Sup Itemset Sup Itemset 10 a, c, d a 2 a 2 ab Scan D 20 b, c, e b 3 b 3 ac 30 a, b, c, e c 3 c 3 ae 40 b, e d 1 bc e 3 Min_sup=2 e 3 be ce Counting 3-candidates Freq 2-itemsets Scan D Itemset Sup Itemset Itemset Sup ab 1 bce ac 2 Scan D ac 2 bc 2 ae 1 be 3 Freq 3-itemsets bc 2 ce 2 Itemset Sup be 3 bce 2 ce 2 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 17

The Apriori Algorithm Level-wise, candidate generation and test • C k : Candidate itemset of size k • L k : frequent itemset of size k Candidate generation • L 1 = {frequent items}; • for (k = 1; L k != ∅ ; k++) do Test – C k+1 = candidates generated from L k ; – for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t – L k+1 = candidates in C k+ 1 with min_support • return ∪ k L k ; Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 18

Important Steps in Apriori • How to find frequent 1- and 2-itemsets? • How to generate candidates? – Step 1: self-joining L k – Step 2: pruning • How to count supports of candidates? Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 19

Finding Frequent 1- & 2-itemsets • Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array – Initialize c[item]=0 for each item – For each transaction T, for each item in T, c[item]++; – If c[item]>=min_sup, item is frequent • Finding frequent 2-itemsets using a 2- dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 20

Counting Array • A 2-dimensional triangle matrix can be implemented using a 1-dimensional array There are n items 1 2 3 4 5 For items i, j (i>j), 1 1 2 3 4 c[i,j] = c[(i-1)(2n-i)/2+j-i]; 2 5 6 7 3 8 9 Example: c[3,5] =c[(3-1)*(2*5-3)/ 4 10 2+5-3]=c[9] 5 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 21

Example of Candidate-generation • L 3 = { abc, abd, acd, ace, bcd } • Self-joining: L 3 *L 3 – abcd ß abc * abd – acde ß acd * ace • Pruning: – acde is removed because ade is not in L 3 • C 4 ={ abcd } Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 22

How to Generate Candidates? • Suppose the items in L k-1 are listed in an order • Step 1: self-join L k-1 INSERT INTO C k SELECT p.item 1 , p.item 2 , … , p.item k-1 , q.item k-1 FROM L k-1 p , L k-1 q WHERE p.item 1 =q.item 1 , … , p.item k-2 =q.item k-2 , p.item k-1 < q.item k-1 • Step 2: pruning – For each itemset c in C k do • For each ( k-1 )-subsets s of c do if ( s is not in L k-1 ) then delete c from C k Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 23

How to Count Supports? • Why is counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates • Method – Candidate itemsets are stored in a hash-tree – A leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 24

Example: Counting Supports Subset function Transaction: 1 2 3 5 6 3,6,9 1,4,7 2,5,8 1 + 2 3 5 6 2 3 4 1 3 + 5 6 5 6 7 3 6 7 1 4 5 3 5 6 3 4 5 1 3 6 3 6 8 3 5 7 1 2 + 3 5 6 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 25

Association Rules • Rule c à am • Support: 3 (i.e., the support Transaction database TDB of acm) TID Items bought • Confidence: 75% (i.e., 100 f, a, c, d, g, I, m, p sup(acm) / sup(c)) 200 a, b, c, f, l, m, o • Given a minimum support 300 b, f, h, j, o threshold and a minimum confidence threshold, find 400 b, c, k, s, p all association rules whose 500 a, f, c, e, l, p, m, n support and confidence passing the thresholds Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 26

To-Do List • Read Sections 6.1, 6.2.1 and 6.2.2 in the textbook • Understand the concept of frequent itemsets and association rules • Understand algorithm Apriori • Figure out how to use Weka to mine frequent itemsets Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 27

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden - PowerPoint PPT Presentation

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2 Burnt or Burned? E. Aiden and J-B Michel: Uncharted. Reverhead Books,

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

2/10/15 Today s goals Design Patterns What are Design Patterns? Announcements:

CSE 311: Foundations of Computing highlights Fall 2014 DFAs Regular Expressions Lecture

Thank You for Joining the Outdoor Seating Areas for Eating Establishments Webinar We will be

Scheduling drones to cover outdoor events O. Aichholzer 1 , L. E. Caraballo 2 , J.M. D nez 2 ,

SAPM Overview Semester Summary In this lecture we review the topics we have covered this

Reading Levels for Children Books (and many other applications) CORE-UA 109.01, Joanna Klukowska

Singleton Design Pattern EECS3311 A & E: Software Design Fall 2020 C HEN -W EI W ANG

Trusted Components Reuse, Contracts and Patterns Prof. Dr. Bertrand Meyer Dr. Karine Arnout

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden - PowerPoint PPT Presentation

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2 Burnt or Burned? E. Aiden and J-B Michel: Uncharted. Reverhead Books,

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Statistics and Data Analysis Logistic Regression &amp; Frequent Pattern Mining Ling-Chieh Kung

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

2/10/15 Today s goals Design Patterns What are Design Patterns? Announcements:

CSE 311: Foundations of Computing highlights Fall 2014 DFAs Regular Expressions Lecture

Thank You for Joining the Outdoor Seating Areas for Eating Establishments Webinar We will be

Scheduling drones to cover outdoor events O. Aichholzer 1 , L. E. Caraballo 2 , J.M. D nez 2 ,

SAPM Overview Semester Summary In this lecture we review the topics we have covered this

Reading Levels for Children Books (and many other applications) CORE-UA 109.01, Joanna Klukowska

Singleton Design Pattern EECS3311 A &amp; E: Software Design Fall 2020 C HEN -W EI W ANG

Trusted Components Reuse, Contracts and Patterns Prof. Dr. Bertrand Meyer Dr. Karine Arnout

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

Singleton Design Pattern EECS3311 A & E: Software Design Fall 2020 C HEN -W EI W ANG