Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint - PDF document

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele Dexters, UAntwerpen) Outline � Motivation � Max-Frequency � Algorithm � for one itemset � mining all Frequent Itemsets � Experiments � Conclusion 1

Motivation � Model: � Every timestamp an itemset arrives � Goal: � Find sets of items that frequently occur together � Take into account history, � Yet, recognize sudden bursts quickly Motivation � Most definitions of frequency rely heavily on the correct parameter settings � Sliding window length � Decay factor � … � Correct parameter setting is hard � Can be different for different items (not to mention sets!) 2

Outline � Motivation � Max-Frequency � Algorithm � for one itemset � mining all Frequent Itemsets � Experiments � Conclusion 3

Max-Frequency Therefore, a new frequency measure: mfreq( I , S S ) : = max(freq( I , last( k , S S ))) k = 1 ..| S S | Frequency is measured in the window where it is maximal. Itemset gets the benefit of the doubt … Example mfreq( a, ac abc ab ac ab bc ) ac bc ab ac ab bc 0 ac bc ab ac ab bc 1/2 ac bc ab ac ab bc 2/3 ac bc ab ac ab bc 3/4 ac bc ab ac ab bc 3/5 ac bc ab ac ab bc 4/6 4

Properties of Max-Freq + Detects sudden bursts + Takes into account the past - When target itemset arrives: sudden jump to a frequency of 1 + Solution: minimal window length 5

Outline � Motivation � Max-Frequency � Algorithm � for one itemset � mining all Frequent Itemsets � Experiments � Conclusion Algorithm How to do it for one itemset? 1. How to do it for a frequent itemset? 2. How to do it for all frequent itemsets? 3. Maintain a summary of the stream that allows to find the frequencies immediately. 6

Properties (one itemset) Checking all possible windows to find the maximal one: infeasible BUT: not every point needs to be checked ↓ Only some special points = the borders a a a b b b a b b a b a b a b a b b b b| a a b a b b a timestamp 1 21 27 8 3 1 # targets How to find a border? � Target set a � Is the marked position a border? a b a c bc a c bc a bc a b 7

How to find a border? � Target set a � Is the marked position a border? a b a c bc a c bc a bc a b 2/3 1/3 How to find a border? � Target set a � Is the marked position a border? a b a c bc a c bc a bc a b 2/3 1/3 NO 8

How to find a border? � Target set a � Is the marked position a border? a b a c bc a c bc a bc a b 2/3 1/3 > 2/3 NO How to find a border? � Target set a � Is the marked position a border? a b a c bc a c bc a bc a b 2/3 1/3 > 2/3 NO even bigger 9

How to find the borders? � This is true in general: a 1 a 2 l 1 l 2 p If a 1 / l 1 ≥ a 2 / l 2 , position p is never the border again! Very pow erful pruning criterion! The summary � Summary only keeps counts for the borders. 1 6 a b a c bc a c bc a bc a b 3 2 10

The summary � Summary only keeps counts for the borders. 1 6 a b a c bc a c bc a bc a b 3 2 � Frequencies always increasing � Thus: max-frequency in last cell � Block with largest frequency before border p i = always block from p i-1 Updating the Summary � When a new itemset arrives, the summary is updated. � borders need to be checked again a b a c bc a c bc a bc a b T 11

Updating the Summary � When a new itemset arrives, the summary is updated. � borders need to be checked again a b a c bc a c bc a bc a b T � no new « before » - blocks � only one new « after » - block � maximal block before: always previous border Updating the Summary � When a new itemset arrives, the summary is updated. � borders need to be checked again a b a c bc a c bc a bc a b T � no new « before » - blocks � only one new « after » - block � maximal block before: always previous border 12

Updating the Summary � The new position is a border if and only if it contains the target itemset. 1 6 9 a b a c bc a c bc a bc a b a b 3 2 1 1 6 b a b a c bc a c bc a bc a b 3 2 5 Summary: the Summary � Only keep entries for borders � Get Max-frequency = access last cell only � Update summary: � if target: add new entry � if non-target: check borders • only one check required: still in ascending order? • most recent border always drops first • no need to check at every timestamp 13

Mining Frequent Itemsets � Only interested in itemsets that are frequent. � We can throw away any border with a frequency lower than the minimal frequency. 1 6 9 a b a b a c bc a c bc a bc a b 3 2 1 minfeq = 2/3 Mining All Frequent Itemsets � We only need to maintain the summaries for the frequent itemsets � Can still be a lot, though … � every subset of the most recent transaction … � minimal window length reduces this problem � FUTURE WORK: reduce this number; rely, e.g., on approximate counts 14

Outline � Motivation � Max-Frequency � Algorithm � for one itemset � mining all Frequent Itemsets � Experiments � Conclusion Experiments � Size of the summaries � number of borders for random data � average, maximal number of borders in real-life data � Theoretical worst case 15

Experiments Uniform Distribution Twin Peaks distribution 16

Outline � Motivation � Max-Frequency � Algorithm � for one itemset � mining all Frequent Itemsets � Experiments � Conclusion Conclusions � New frequency measure � Summary for one itemset � small � easy to maintain � only few updates � Mining all frequent itemsets � only need summary for frequent itemsets 17

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint - PDF document

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele Dexters, UAntwerpen) Outline Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Chapter VII: Frequent Itemsets & Association Rules Information Retrieval & Data Mining

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters Bart Goethals Eindhoven

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

FP-growth Mining of Frequent Itemsets + Constraint-based Mining Francesco Bonchi e-mail:

Frequent Itemsets Itemset: a set of items E.g., acm = {a, c, m} Transaction database TDB

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Clustering & Unsupervised Learning Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter

MCMC Diagnostics Review In the practical you used Metropolis-Hastings with a Gaussian proposal

Static Performance Analysis with LLVM Clment Courbet G. Chatelet, B. De Backer, O. Sykora,

The M 3 (Measure-Measure-Model) Tool-Chain for Performance Prediction of Multi-tier Applications

Methodological issues for Agent-Based Models in the Social Sciences Juliette Rouchier - GREQAM

Throw Away Your Resume (And other tidbits you need to know about Informational Interviewing and

Understanding Git Nelson Elhage Anders Kaseorg Student Information Processing Board October 21,

Making most of Scala Akka, Scala, Spray, Specs2; all in 50 minutes! Jan Machacek Chief whip

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint - PDF document

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele Dexters, UAntwerpen) Outline Motivation Max-Frequency Algorithm for one itemset mining all Frequent Itemsets Experiments

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items

Frequent Item Sets Chau Tran &amp; Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Chapter VII: Frequent Itemsets &amp; Association Rules Information Retrieval &amp; Data Mining

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters Bart Goethals Eindhoven

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

FP-growth Mining of Frequent Itemsets + Constraint-based Mining Francesco Bonchi e-mail:

Frequent Itemsets Itemset: a set of items E.g., acm = {a, c, m} Transaction database TDB

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Clustering &amp; Unsupervised Learning Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter

MCMC Diagnostics Review In the practical you used Metropolis-Hastings with a Gaussian proposal

Static Performance Analysis with LLVM Clment Courbet G. Chatelet, B. De Backer, O. Sykora,

The M 3 (Measure-Measure-Model) Tool-Chain for Performance Prediction of Multi-tier Applications

Methodological issues for Agent-Based Models in the Social Sciences Juliette Rouchier - GREQAM

Throw Away Your Resume (And other tidbits you need to know about Informational Interviewing and

Understanding Git Nelson Elhage Anders Kaseorg Student Information Processing Board October 21,

Making most of Scala Akka, Scala, Spray, Specs2; all in 50 minutes! Jan Machacek Chief whip

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Chapter VII: Frequent Itemsets & Association Rules Information Retrieval & Data Mining

Clustering & Unsupervised Learning Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter