Mining Frequent Itemsets in a Stream Toon Calders Nele Dexters Bart Goethals Eindhoven University of Technology University of Antwerp University of Antwerp Abstract Therefore, we propose to consider for each itemset the window in which it has the highest frequency. More specif- ically, we define the current frequency of an itemset as the We study the problem of finding frequent itemsets in a maximum over all windows from the past until the current continuous stream of transactions. The current frequency of state that satisfy a minimal size constraint. Notice that this an itemset in a stream is defined as its maximal frequency is an extension of the max-frequency measure defined be- over all possible windows in the stream from any point in fore for items [1]. Hence, when the stream evolves, the the past until the current state that satisfy a minimal length length of the window containing the highest frequency for constraint. Properties of this new measure are studied and a given itemset can change continuously. This new stream an incremental algorithm that allows, at any time, to im- measure turns out to be very suitable to early detect sud- mediately produce the current frequencies of all frequent den bursts of occurrences of itemsets, while still taking into itemsets is proposed. Experimental and theoretical analy- account the history of the itemset. This behavior might be sis show that the space requirements for the algorithm are particularly useful in applications where hot topics, or pop- extremely small for many realistic data distributions. ular combinations of topics need to be tracked. Examples of such applications include, e.g., identifying stocks with a strong growth or tracking popular search terms on the inter- 1. Introduction net. In these applications it is of vital importance to identify sudden bursts quickly, while still taking into account the Mining frequent sets over streams of itemsets presents history. interesting new challenges over traditional mining in static databases. Due to the speed of new arriving data, it is as- Concretely, our contributions are the following. First, sumed that the history of the stream can not be revisited, (1) the max-frequency measure [1] is extended to itemsets unless it is stored. Storing large parts of a stream, however, and minimal window length, and (2) a detailed study of its is impossible as the amount of data is typically huge. behavior is performed, taking into account minimal win- Most previous work on mining frequently occurring dow length and minimal frequency thresholds, resulting in itemsets over data streams either focusses on (1) the sliding several important properties. (3) An efficient algorithm for window model, (2) the time-fading model, or (3) the land- computing the exact frequencies for all frequent itemsets at mark model. Each of these models requires a fixed window any time is proposed; this in contrast to the often only ap- length or decay factor, given by the user. In many applica- proximate algorithms for other methods. Finally, (4) a the- tions, however, choosing such parameters that are most ap- oretical and empirical evaluation of our proposed method is propriate for every itemset at every timepoint in an evolving given. stream is almost impossible. For example, consider a large retail chain of which sales can be considered as a stream. The organization of the paper is as follows. In Section 2, Then, in order to find frequent sets to do market basket the new measure is defined and the central problem state- analysis, it is very difficult to choose in which period of the ment is formally introduced. Section 3 gives several prop- collected data you are interested. For many products, the erties of the max-frequency and states the main theorem, on amount of them sold depends highly on the period of the which the incremental algorithm in Section 4 is based. In year. In summer time, e.g., sales of ice cream increase and Section 5, a theoretical analysis for the worst case is done. during the soccer world cup, sales of beer increase. Such Experimental results in Section 6 show that the memory re- seasonal behavior of a specific item or combination of items quirements for the algorithm are extremely small for many can only be discovered when choosing the correct window real-life data distributions. In Section 7, the relation be- size for that item(set). This size, however, can hide a similar tween our measure and existing related work is explored, behavior of other item(set)s in another window. and Section 8 concludes the paper.
2. Problem Statement 0.35 mwl=3 mwl=5 mwl=10 2.1. Streams and Max-Frequency 0.3 0.25 A stream � I 1 I 2 . . . I n � is a sequence of itemsets, de- noted S , where n = | S | is the length of the stream. I 1 is 0.2 considered the first and oldest itemset in the stream, and I n the latest and most recent. We assume that the items in the 0.15 stream come from a finite set of items I . The number of sets in a stream S that con- 0.1 tain itemset I is denoted count ( I, S ) . For example, a b c c c b b c c a b c b b c b a b b c b c b b c a b c c b c a b b c c c b b c c a b c b b c b a b b c b c b b c a b c c b c a b count ( a, � ab c ad f � ) = 2 and count ( af, � ab c ad f � ) = Figure 1. Max-frequency for minimal window 1 . The frequency of I in S is defined as lengths 1 , 3 , and 10 . freq ( I, S ) := count ( I, S ) . | S | In the definition of the max-frequency, an explicit lower bound is given on the size of the windows in which the fre- freq ( a, � ab c ad f � ) For example, = 2 / 3 and quencies are considered. This lower bound is given to re- freq ( af, � ab c ad f � ) = 1 / 3 . lieve the undesirable effect of having a frequency of 100% Let S 1 be � I 1 1 . . . I 1 n 1 � , S 2 be � I 2 1 . . . I 2 n 2 � , . . . and in a window of length 1 , every time the target item arrives in S m be � I m 1 . . . I m n m � . The concatenation of the streams the stream. The effect of the minimal window length mwl S 1 , . . . , S m , denoted S 1 · S 2 · . . . · S m , is is illustrated in Figure 1. It is clear that for longer mini- mal window lengths, there are still jumps in the frequency, n 2 . . . I m . . . I m � I 1 1 . . . I 1 n 1 I 2 1 . . . I 2 n m � . 1 but they are less pronounced. Hence, setting an appropriate Let S = � I 1 I 2 . . . I n � . Then, S [ s, t ] denotes the sub- minimal window length effectively resolves the instability stream or window � I s I s +1 . . . I t � . The sub-stream of S of the max-frequency measure. consisting of the last k items of S , denoted last ( k, S ) , is 2.2. Evolving Streams � � last ( k, S ) := S | S | − k + 1 , | S | . A stream was defined as a statical object. In reality, how- We are now ready to define our new frequency measure: ever, a stream is an evolving object that is essentially un- Definition 1 Given a minimal window size mwl , the max- bounded. When processing a stream, it is to be assumed frequency mfreq mwl ( I, S ) of itemset I in a stream S is de- that only a small part of it can be kept in memory. fined as the maximum of the frequencies of I over all win- S t will denote the stream S up to timestamp t ; that is, the dows, of size at least mwl , extending from the end of the part of the stream that already passed at time t , S t = S [1 , t ] . stream; that is: For simplicity, we assume that the first itemset arrives at timestamp 1 , and since then, at every timestamp a new item- mfreq mwl ( I, S ) := k = mwl ,..., | S | ( freq ( I, last ( k, S ))) . max set is inserted into the stream. The main problem we study in this paper is the fol- If the length of the stream is less than mwl , the max- lowing: Given a minimal frequency threshold and a min- frequency is defined to be 0 . imal window length, for an evolving stream S , main- The longest window in which the maximum frequency tain a small summary of the stream in time, such that, is reached is called the maximal window for I in S , and at any timepoint t , all current frequent itemsets can be its starting point is denoted startmax mwl ( I, S ) . That is, produced instantly from this summary. More formally, startmax mwl ( I, S ) is the smallest index such that we will introduce a concise summary, summary ( S t ) , and � � efficient procedures Update , and Get mfreq , such that mfreq mwl ( I, S ) = freq ( I, S startmax mwl ( I, S ) , | S | ) . Update ( summary ( S t ) , I ) equals summary ( S t · � I � ) , and Get mfreq ( summary ( S t +1 )) equals mfreq mwl ( A, S t +1 ) . mwl wil be omitted when clear from the context. Because Update has to be executed every time a new Example 1 Let mwl = 3 . itemset arrives, it has to be extremely efficient in order to be finished before the next itemset arrives. Similarly, because mfreq mwl ( a, � a b a a a b � ) = 3 / 4 . the stream continuously grows, the summary must be inde- mfreq mwl ( a, � b c d a b c d a � ) = 2 / 5 . pendent of the number of items seen so far, or, at least grow
Recommend
More recommend