Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters - PDF document

Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters Bart Goethals Eindhoven University of Technology University of Antwerp University of Antwerp Abstract Therefore, we propose to consider for each itemset the window in which it has the highest frequency. More specif- ically, we define the current frequency of an itemset as the We study the problem of finding frequent itemsets in a maximum over all windows from the past until the current continuous stream of transactions. The current frequency of state that satisfy a minimal size constraint. Notice that this an itemset in a stream is defined as its maximal frequency is an extension of the max-frequency measure defined be- over all possible windows in the stream from any point in fore for items [1]. Hence, when the stream evolves, the the past until the current state that satisfy a minimal length length of the window containing the highest frequency for constraint. Properties of this new measure are studied and a given itemset can change continuously. This new stream an incremental algorithm that allows, at any time, to im- measure turns out to be very suitable to early detect sud- mediately produce the current frequencies of all frequent den bursts of occurrences of itemsets, while still taking into itemsets is proposed. Experimental and theoretical analy- account the history of the itemset. This behavior might be sis show that the space requirements for the algorithm are particularly useful in applications where hot topics, or pop- extremely small for many realistic data distributions. ular combinations of topics need to be tracked. Examples of such applications include, e.g., identifying stocks with a strong growth or tracking popular search terms on the inter- 1. Introduction net. In these applications it is of vital importance to identify sudden bursts quickly, while still taking into account the Mining frequent sets over streams of itemsets presents history. interesting new challenges over traditional mining in static databases. Due to the speed of new arriving data, it is as- Concretely, our contributions are the following. First, sumed that the history of the stream can not be revisited, (1) the max-frequency measure [1] is extended to itemsets unless it is stored. Storing large parts of a stream, however, and minimal window length, and (2) a detailed study of its is impossible as the amount of data is typically huge. behavior is performed, taking into account minimal win- Most previous work on mining frequently occurring dow length and minimal frequency thresholds, resulting in itemsets over data streams either focusses on (1) the sliding several important properties. (3) An efficient algorithm for window model, (2) the time-fading model, or (3) the land- computing the exact frequencies for all frequent itemsets at mark model. Each of these models requires a fixed window any time is proposed; this in contrast to the often only ap- length or decay factor, given by the user. In many applica- proximate algorithms for other methods. Finally, (4) a the- tions, however, choosing such parameters that are most ap- oretical and empirical evaluation of our proposed method is propriate for every itemset at every timepoint in an evolving given. stream is almost impossible. For example, consider a large retail chain of which sales can be considered as a stream. The organization of the paper is as follows. In Section 2, Then, in order to find frequent sets to do market basket the new measure is defined and the central problem state- analysis, it is very difficult to choose in which period of the ment is formally introduced. Section 3 gives several prop- collected data you are interested. For many products, the erties of the max-frequency and states the main theorem, on amount of them sold depends highly on the period of the which the incremental algorithm in Section 4 is based. In year. In summer time, e.g., sales of ice cream increase and Section 5, a theoretical analysis for the worst case is done. during the soccer world cup, sales of beer increase. Such Experimental results in Section 6 show that the memory re- seasonal behavior of a specific item or combination of items quirements for the algorithm are extremely small for many can only be discovered when choosing the correct window real-life data distributions. In Section 7, the relation be- size for that item(set). This size, however, can hide a similar tween our measure and existing related work is explored, behavior of other item(set)s in another window. and Section 8 concludes the paper.

2. Problem Statement 0.35 mwl=3 mwl=5 mwl=10 2.1. Streams and Max-Frequency 0.3 0.25 A stream � I 1 I 2 . . . I n � is a sequence of itemsets, denoted S , where n = | S | is the length of the stream. I 1 is 0.2 considered the first and oldest itemset in the stream, and I n the latest and most recent. We assume that the items in the 0.15 stream come from a finite set of items I . The number of sets in a stream S that con- 0.1 tain itemset I is denoted count ( I, S ) . For example, a b c c c b b c c a b c b b c b a b b c b c b b c a b c c b c a b b c c c b b c c a b c b b c b a b b c b c b b c a b c c b c a b count ( a, � ab c ad f � ) = 2 and count ( af, � ab c ad f � ) = Figure 1. Max-frequency for minimal window 1 . The frequency of I in S is defined as lengths 1 , 3 , and 10 . freq ( I, S ) := count ( I, S ) . | S | In the definition of the max-frequency, an explicit lower bound is given on the size of the windows in which the fre- freq ( a, � ab c ad f � ) For example, = 2 / 3 and quencies are considered. This lower bound is given to re- freq ( af, � ab c ad f � ) = 1 / 3 . lieve the undesirable effect of having a frequency of 100% Let S 1 be � I 1 1 . . . I 1 n 1 � , S 2 be � I 2 1 . . . I 2 n 2 � , . . . and in a window of length 1 , every time the target item arrives in S m be � I m 1 . . . I m n m � . The concatenation of the streams the stream. The effect of the minimal window length mwl S 1 , . . . , S m , denoted S 1 · S 2 · . . . · S m , is is illustrated in Figure 1. It is clear that for longer minimal window lengths, there are still jumps in the frequency, n 2 . . . I m . . . I m � I 1 1 . . . I 1 n 1 I 2 1 . . . I 2 n m � . 1 but they are less pronounced. Hence, setting an appropriate Let S = � I 1 I 2 . . . I n � . Then, S [ s, t ] denotes the sub- minimal window length effectively resolves the instability stream or window � I s I s +1 . . . I t � . The sub-stream of S of the max-frequency measure. consisting of the last k items of S , denoted last ( k, S ) , is 2.2. Evolving Streams � � last ( k, S ) := S | S | − k + 1 , | S | . A stream was defined as a statical object. In reality, how- We are now ready to define our new frequency measure: ever, a stream is an evolving object that is essentially un- Definition 1 Given a minimal window size mwl , the max- bounded. When processing a stream, it is to be assumed frequency mfreq mwl ( I, S ) of itemset I in a stream S is de- that only a small part of it can be kept in memory. fined as the maximum of the frequencies of I over all win- S t will denote the stream S up to timestamp t ; that is, the dows, of size at least mwl , extending from the end of the part of the stream that already passed at time t , S t = S [1 , t ] . stream; that is: For simplicity, we assume that the first itemset arrives at timestamp 1 , and since then, at every timestamp a new item- mfreq mwl ( I, S ) := k = mwl ,..., | S | ( freq ( I, last ( k, S ))) . max set is inserted into the stream. The main problem we study in this paper is the fol- If the length of the stream is less than mwl , the max- lowing: Given a minimal frequency threshold and a min- frequency is defined to be 0 . imal window length, for an evolving stream S , main- The longest window in which the maximum frequency tain a small summary of the stream in time, such that, is reached is called the maximal window for I in S , and at any timepoint t , all current frequent itemsets can be its starting point is denoted startmax mwl ( I, S ) . That is, produced instantly from this summary. More formally, startmax mwl ( I, S ) is the smallest index such that we will introduce a concise summary, summary ( S t ) , and � � efficient procedures Update , and Get mfreq , such that mfreq mwl ( I, S ) = freq ( I, S startmax mwl ( I, S ) , | S | ) . Update ( summary ( S t ) , I ) equals summary ( S t · � I � ) , and Get mfreq ( summary ( S t +1 )) equals mfreq mwl ( A, S t +1 ) . mwl wil be omitted when clear from the context. Because Update has to be executed every time a new Example 1 Let mwl = 3 . itemset arrives, it has to be extremely efficient in order to be finished before the next itemset arrives. Similarly, because mfreq mwl ( a, � a b a a a b � ) = 3 / 4 . the stream continuously grows, the summary must be inde- mfreq mwl ( a, � b c d a b c d a � ) = 2 / 5 . pendent of the number of items seen so far, or, at least grow

Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters - PDF document

Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters Bart Goethals Eindhoven University of Technology University of Antwerp University of Antwerp Abstract Therefore, we propose to consider for each itemset the window in which it

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Chapter VII: Frequent Itemsets & Association Rules Information Retrieval & Data Mining

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

FP-growth Mining of Frequent Itemsets + Constraint-based Mining Francesco Bonchi e-mail:

Frequent Itemsets Itemset: a set of items E.g., acm = {a, c, m} Transaction database TDB

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

SACM Social Media Agenda Content types Copy Photos Link sharing Pacing

Unix https://harvard-iacs.github.io/2019-CS207/lectures/lecture1/ David Sondak Harvard

Ray Smith Database Administrator Portland General Electric Objectives Philosophical

Pr Prog ogram ammin ming g La Lang nguag uage Michae ael Griff iffiths iths Corporat

ECMAScript 6: whats next for JavaScript? Dr. Axel Rauschmayer rauschma.de 2014-06-13 QCon

L A T EX Lecture for UNIST Jaewoong Lee Ulsan National Institute of Science and Technology

Yesod Web Framework Book 2 | OpenTopic | TOC Contents

Cascading Style Sheets (CSS) (CSS) - Konsep dasar CSS - CSS properties Pemrograman Web/TI/ AK

Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters - PDF document

Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters Bart Goethals Eindhoven University of Technology University of Antwerp University of Antwerp Abstract Therefore, we propose to consider for each itemset the window in which it

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele

Frequent Item Sets Chau Tran &amp; Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Chapter VII: Frequent Itemsets &amp; Association Rules Information Retrieval &amp; Data Mining

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

FP-growth Mining of Frequent Itemsets + Constraint-based Mining Francesco Bonchi e-mail:

Frequent Itemsets Itemset: a set of items E.g., acm = {a, c, m} Transaction database TDB

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

SACM Social Media Agenda Content types Copy Photos Link sharing Pacing

Unix https://harvard-iacs.github.io/2019-CS207/lectures/lecture1/ David Sondak Harvard

Ray Smith Database Administrator Portland General Electric Objectives Philosophical

Pr Prog ogram ammin ming g La Lang nguag uage Michae ael Griff iffiths iths Corporat

ECMAScript 6: whats next for JavaScript? Dr. Axel Rauschmayer rauschma.de 2014-06-13 QCon

L A T EX Lecture for UNIST Jaewoong Lee Ulsan National Institute of Science and Technology

Yesod Web Framework Book 2 | OpenTopic | TOC Contents

Cascading Style Sheets (CSS) (CSS) - Konsep dasar CSS - CSS properties Pemrograman Web/TI/ AK

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Chapter VII: Frequent Itemsets & Association Rules Information Retrieval & Data Mining