Outline � Background Mining Sequential Patterns � Introduction � Problem Decomposition and Solution � Algorithm Authors: Rakesh Agrawal and Ramakrishnan Srikant � Performance Presenter: Yunping Wang � Discussions and Conclusion � Correlation Literature November 18, 2004 Mining Sequential Pattern 1 November 18, 2004 Mining Sequential Pattern 2 Background Introduction � Sequential Pattern Mining was first introduced � What is Sequential Pattern Mining? in 1995 � Key components of Sequential Pattern Mining � Sequential Pattern are ordered list of itemsets � Sequential Pattern Example � Sequential Pattern Mining Applications: � Sequence Database Example shopping history, weblog mining, DNA sequence modeling, disease treatment, natural disasters, etc. November 18, 2004 Mining Sequential Pattern 3 November 18, 2004 Mining Sequential Pattern 4
Introduction Introduction--- Definition � Itemset i , ( i 1 i 2 ... i m ) where i j is an item. � What is Sequential Pattern Mining? � Sequence s , 〈 s 1 s 2 … s n 〉 where s j is an itemset. Definition: � Sequence 〈 a 1 a 2 … a n 〉 contained in Given a set of sequences, where each sequence consists of a list 〈 b 1 b 2 … b n 〉 if there exist integers i 1 < i 2 ... of elements and each element consists of a set of items, and given a user-specified min support threshold, sequential < i n such that a 1 ⊆ b i1 , a 2 ⊆ b i2 ,…, a n ⊆ b in . pattern mining is to find all of the frequent subsequences, i.e., � A sequence s is maximal if it is not the subsequences whose occurrence frequency in the set of sequences is no less than min support. contained in any other sequence. November 18, 2004 Mining Sequential Pattern 5 November 18, 2004 Mining Sequential Pattern 6 Introduction--- Definition Introduction � Support of a sequence - % of � Key Components of Sequential pattern mining: customers who support the sequence. � Frequent time-ordered sequential patterns � For mining association rules, support was in the database. % of transactions. � Two conditions: Min Support and Maximal � Sequences that have support above Sequence minsup are large sequences. � Association rule --- intra-transaction; � Sequential rule --- inter-transaction November 18, 2004 Mining Sequential Pattern 7 November 18, 2004 Mining Sequential Pattern 8
30 90 t Sequence Database Sequence Pattern Examples CID=1 40 10 30 60 20 70 Examples 1 t � CID=2 Customer ID TransactionTime Items � 60% of customers typically rent “star wars”, then “Empire 30 1 1 30 50 strikes back”, and then “Return of Jedi”. 1 2 90 70 t � Note: these rentals need not to be consecutive. 2 1 10,20 CID=3 2 2 30 2 3 40,60,70 40 30 90 Example 2 3 1 70 30,50,70 � t CID=4 � 60% of customers buy “Fitted Sheet and flat sheet and 4 1 30 4 2 40,70 pillow”, followed by “comforter”, followed by “drapes and 4 3 90 90 ruffles” 5 1 90 t � Note: elements of a sequential pattern need not to be CID=5 simple items. MinSupport =40%, i.e. 2 customers Answer: (<30><90>) (CID1,4) (<30><40,70>) (CID2,4) Not Answer: <30> <40><70><90>(<30><40>)(<30><70>)(<40 70>) Why? November 18, 2004 Mining Sequential Pattern 9 November 18, 2004 Mining Sequential Pattern 10 Solution--- Sort Phases(1) Solution--- Sort Phases(2) 30 90 � Sort Phases t � Customer ID – Major key CID=1 CID: major key, TID: secondary key 40 10 � Transaction-time – Minor key 30 60 20 70 t Customer ID TransactionTime Items CID=2 30 1 1 30 50 1 2 90 Converts the original transaction database 70 2 1 10,20 t CID=3 2 2 30 into a database of customer sequences. 2 3 40,60,70 40 30 3 1 90 30,50,70 70 t 4 1 30 CID=4 4 2 40,70 4 3 90 90 5 1 90 t CID=5 November 18, 2004 Mining Sequential Pattern 11 November 18, 2004 Mining Sequential Pattern 12
Solution--- Litemset Phase(1) Solution--- Litemset Phase(2) Litemset Phase: � To get all large itemsets we can use the Apriori algorithms. � Find all large itemsets. � Need to modify support counting. Why? � For sequential patterns, support is � Because each itemset in a large measured by fraction of customers. sequence has to be a large itemset. November 18, 2004 Mining Sequential Pattern 13 November 18, 2004 Mining Sequential Pattern 14 Solution--- Litemset Phase(3) Solution --- Transform Phase(1) � Litemset Phase: � Example: find large itemset � Each large itemset is then mapped to a set of contiguous integers. Customer ID TransactionTime Items Litemset Result: 1 1 30 {30} {40} {70} {40 70}{90} Why? 1 2 90 itemset Map 2 1 10,20 2 2 30 Used to compare two large itemsets {30} 1 Difference from Apriori: 2 3 40,60,70 {40} 2 the support count should be � in constant time. 3 1 30,50,70 {70} 3 incremented only once per 4 1 30 {40 70} 4 customer 4 2 40,70 {90} 5 4 3 90 Litemsets 5 1 90 November 18, 2004 Mining Sequential Pattern 15 November 18, 2004 Mining Sequential Pattern 16
Solution --- Transform Phase(2) Solution --- Transform Phase(3) � Need to repeatedly determine which of 30 90 30 90 t t itemset Map a given set of large sequences are CID=1 CID=1 {30} 1 40 40 40,70 ! 10 10 contained in a customer sequence. 60 70 30 30 {40} 2 20 20 70 t t {70} 3 CID=2 CID=2 � Represent transactions as sets of large {40 70} 4 30 30 ! itemsets. {90} 5 50 70 70 t t Litemsets CID=3 CID=3 � Customer sequence now becomes a list 40 ! 70 40 90 90 30 30 40,70 of sets of itemsets. 70 t t CID=4 CID=4 90 90 t t CID=5 CID=5 November 18, 2004 Mining Sequential Pattern 17 November 18, 2004 Mining Sequential Pattern 18 Solution --- Transform Phase (4) Solution --- Transform Phase (5) 30 90 1 5 � Transform Database : t t CID=1 CID=1 40 2 40,70 ! <{1} {5}> 10 70 3 30 1 20 4 itemset Map t t CID=2 CID=2 <{1}{2 3 4}> {30} 1 30 1 {40} 2 ! 70 3 <{1 3}> {70} 3 t t CID=3 CID=3 {40 70} 4 40 2 ! {90} 5 <{1} {2 3 4} {5}> 70 3 90 5 30 1 40,70 4 Litemsets t t CID=4 CID=4 <{5}> 5 90 t t CID=5 CID=5 November 18, 2004 Mining Sequential Pattern 19 November 18, 2004 Mining Sequential Pattern 20
Solution--- Sequence Phase (1) Solution--- Sequence Phase (2) Two types of algorithms: � Use set of large itemsets to find the desired sequences. � Count-all: counts all large sequences, including non-maximal sequences. � Similar structure to Apriori algorithms used to � AprioriAll find large itemsets. � Count-some: try to avoid counting non- � Use seed set to generate candidate sequences. maximal sequences by counting longer � Count support for each candidate. sequences first. � Eliminate candidate sequences which are not � AprioriSome large. � DynamicSome November 18, 2004 Mining Sequential Pattern 21 November 18, 2004 Mining Sequential Pattern 22 Solution -- Maximal phase (1) Solution -- Maximal phase(2) � Maximal phase example: � Find the maximal sequences among the set of The large sequence is <1 2 3 4>, the sub-sequence <1 2 3><1 2 4> large sequences <1 3 4> <1 3 5><2 3 4> need to be deleted from final result. delete all sub-sequences in larger Sequence for (k=n; k>1; k--) do for each k-sequence S k do Delete from all subsequences of S k November 18, 2004 Mining Sequential Pattern 23 November 18, 2004 Mining Sequential Pattern 24
Algorithm Algorithm --- AprioriAll Algorithm(1) � AprioriAll � AprioriAll Algorithm C k : Candidate sequence of size k L k : frequent or large sequence of size k � AprioriSome L 1 = {large 1-sequence}; //result of litemset phase for ( k = 2; L k-1 != ∅ ; k ++) do begin C k = candidates generated from L k-1 ; for each customer-sequence c in database do � DynamicSome Increment the count of all candidates in C k that are contained in c L k =Candidates in C k with minimum support end Answer=Maximal sequences in ∪ k L k ; November 18, 2004 Mining Sequential Pattern 25 November 18, 2004 Mining Sequential Pattern 26 Algorithm --- AprioriAll Algorithm(3) Algorithm --- AprioriAll Algorithm(2) Highlight: � Candidate Generation --Join Step : � Candidate generation similar to C k is generated by joining L k-1 with itself candidate generation in finding large Insert into C k , Select p.litemset 1 , …, p.litemset k-1 , q.litemset k-1 itemsets. From L k-1 p, L k-1 q � The order matters ! Where p.litemset 1 = q.litemset 1 ,..., p.litemset k-2 = q.litemset k-2 For example: {1,2,3} X {1,2,4} = {1,2,3,4} and {1,2,4,3} November 18, 2004 Mining Sequential Pattern 27 November 18, 2004 Mining Sequential Pattern 28
Recommend
More recommend