Outline ` � Mining Sequential Patterns PrefixSpan: Mining Sequential Patterns � Problem statement Efficiently by Prefix-Projected Pattern � Definitions & examples Growth � Strategies � PrefixSpan algorithm Authors: � Motivation Jian Pei, Jiawei Han, Behzad Mortazavi-Asi, Helen Pinto Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu � Definitions & examples � Algorithm � Example � Performance study � Conclusions Presenter: Wojciech Stach 2 Sequential Pattern Mining Sequential Pattern Mining ` ` � Given � Find all the frequent subsequences, i.e. the subsequences whose occurrence frequency in the � a set of sequences, where each sequence consists of a list of elements and each element consists of set of items set of sequences is no less than min_support � user-specified min_support threshold Solution – 53 frequent subsequences <a><aa> <ab> <a(bc)> <a(bc)a> <aba> <abc> <a(abc)(ac)d(cf)> - 5 elements, 9 items id Sequence <(ab)> <(ab)c> <(ab)d> <(ab)f> <(ab)dc> <ac> id Sequence 10 <a(abc)(ac)d(cf)> <aca> <acb> <acc> <ad> <adc> <af> 10 <a(abc)(ac)d(cf)> <a(abc)(ac)d(cf)> - 9-sequence 20 <(ad)c(bc)(ae)> <b> <ba> <bc> <(bc)> <(bc)a> <bd> <bdc> <bf> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 30 <(ef)(ab)(df)cb> <c> <ca> <cb> <cc> <a(abc)(ac)d(cf)> = <a(cba)(ac)d(cf)> 40 <eg(af)cbc> 40 <eg(af)cbc> <d> <db> <dc> <dcb> <a(abc)(ac)d(cf)> ≠ <a(ac)(abc)d(cf)> <e> <ea> <eab> <eac> <eacb> <eb> <ebc> <ec> <ecb> <ef> <efb> <efc> <efcb> min_support = 2 <f> <fb> <fbc> <fc> <fcb> 3 4
Subsequence vs. super sequence Sequence Support Count ` ` � Given two sequences α =<a 1 a 2 …a n > and � A sequence database is a set of tuples <sid, s> β =<b 1 b 2 …b m > � A tuple <sid, s> is said to contain a sequence α , if � α is called a subsequence of β , denoted as α⊆ β , α is a subsequence of s, i.e., α ⊆ s if there exist integers 1 ≤ j 1 <j 2 <…<j n ≤ m such that � The support of a sequence α is the number of a 1 ⊆ b j1 , a 2 ⊆ b j2 ,…, a n ⊆ b jn tuples containing α � β is a super sequence of α α 1 =<a> support( α 1 ) = 4 id Sequence 10 <a(abc)(ac)d(cf)> β =<a(abc)(ac)d(cf)> β =<a(abc)(ac)d(cf)> α 2 =<ac> support( α 2 ) = 4 20 <(ad)c(bc)(ae)> α 1 =<aa(ac)d(c)> α 4 =<df(cf)> 30 <(ef)(ab)(df)cb> α 3 =<(ab)c> support( α 3 ) = 2 40 <eg(af)cbc> α 2 =<(ac)(ac)d(cf)> α 5 =<(cf)d> α 3 =<ac> α 6 =<(abc)dcf> 5 6 Strategies Outline ` ` � Apriori-property based � Mining Sequential Patterns � AprioriSome (1995) � Problem statement � AprioriAll (1995) � Definitions & examples � DynamicSome (1995) � Strategies � GSP (1996) � PrefixSpan algorithm � Motivation � Regular expression constraints � Definitions & examples � SPIRIT (1999) � Algorithm � Example � Data projection based � Performance study � Conclusions � FreeSpan (2000) 7 8
Motivation and Background Prefix ` ` Shortcomings of Apriori-like approaches � Given two sequences α =<a 1 a 2 …a n > and � β =<b 1 b 2 …b m >, m ≤ n Potentially huge set of candidate sequences � Multiple scans of databases � Sequence β is called a prefix of α if and only if: � Difficulties at mining long sequential patterns � � b i = a i for i ≤ m-1; � b m ⊆ a m ; FreeSpan ( Fre qu e nt pattern-projected S equential pa tter n � � All the items in (a m – b m ) are alphabetically after those in mining) – pattern growth method b m General idea is to use frequent items to recursively project � sequence databases into a smaller projected databases and grow subsequence fragments in each projected database α =<a(abc)(ac)d(cf)> α =<a(abc)(ac)d(cf)> PrefixSpan ( Prefix -projected S equential pa tter n mining) � Less projections and quickly shrinking sequences β =<a(abc)a> � β =<a(abc)c> 9 10 Projection Postfix ` ` � Given sequences α and β , such that β is a � Let α ’ =<a 1 a 2 …a n > be the projection of α w.r.t. subsequence of α . prefix β =<a 1 a 2 …a m-1 a’ m > (m ≤ n) � A subsequence α ’ of sequence α is called a � Sequence γ =<a’’ m a m+1 …a n > is called the postfix of projection of α w.r.t. β prefix if and only if α w.r.t. prefix β , denoted as γ = α / β , where a’’ m =(a m -a’ m ) � α ’ has prefix β ; � There exist no proper super-sequence α ’’ of α ’ such that � We also denote α = β⋅γ α ’’ is a subsequence of α and also has prefix β α =<a(abc)(ac)d(cf)> α ’ =<a(abc)(ac)d(cf)> β =<(bc)a> β =<a(abc)a> α ’ =<(bc)(ac)d(cf)> γ =<(_c)d(cf)> 11 12
PrefixSpan – Algorithm PrefixSpan – Algorithm (2) ` ` Input : A sequence database S, and the minimum support Method � � threshold min_sup Scan S| α once, find the set of frequent items b 1. such that: Output : The complete set of sequential patterns � b can be assembled to the last element of α to form a a) sequential pattern; or Method : Call PrefixSpan(<>,0,S) � <b> can be appended to α to form a sequential pattern. b) Subroutine PrefixSpan( α , l, S| α ) For each frequent item b, append it to α to form a � 2. sequential pattern α ’, and output α ’; Parameters : � For each α ’, construct α ’-projected database S| α ’, 3. α : sequential pattern, � and call PrefixSpan( α ’, l+1, S| α ’ ). l: the length of α ; � S| α : the α -projected database, if α ≠ <>; otherwise; the � sequence database S. 13 14 id Sequence 10 <a(abc)(ac)d(cf)> PrefixSpan - Example PrefixSpan – Example (2) 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> ` ` Find subsets of sequential patterns 3. Find length-1 sequential patterns min_support = 2 1. <d> <a> <b> <c> <d> <e> <f> <g> <a> <b> <c> <d> <e> <(_e)> <f> <(_f)> <(cf)> 4 4 4 3 3 3 1 1 2 3 0 1 0 1 1 <c(bc)(ae)> <(_f)cb> Divide search space 2. <db> <dc> Prefix <a> <b> <c> <d> <e> <f> <db> <dc> <b> <c> <(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <(ac)d(cf)> <(cf)> <(_f)(ab)(df)cb> <(ab)(df)cb> <(_c)> <(bc)> 2 1 <(_d)c(bc)(ae)> <(_c)(ae)> <(bc)(ae)> <c(bc)(ae)> <(af)cbc> <cbc> <b> <(_b)(df)cb> <(df)cb> <b> <(_f)cb> <(_f)cbc> <c> <bc> <dcb> <dcb > <> 15 16
id Sequence 10 <a(abc)(ac)d(cf)> PrefixSpan - characteristics Bi-level Projection 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> � No candidate sequence needs to be generated by 40 <eg(af)cbc> ` ` min_support = 2 PrefixSpan � Scan to get 1-length sequences � Projected databases keep shrinking � Construct a triangular matrix instead of projected � The major cost of PrefixSpan is the construction of databases for each length-1 patterns projected databases a 2 � How to reduce this cost? b (4,2,2) 1 ALL length-2 sequential c (4,2,1) (3,3,2) 3 pattern Different projection methods d (2,1,1) (2,2,0) (1,3,0) 0 e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0 � Bi-level projection f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1 � reduces the number and the size of projected databases a b c d e f � Pseudo-Projection Support(< ac >) = 4 Support(< cc >) = 3 Support(< ca >) = 2 � reduces the cost of projection when projected database can be Support(< (ac) >) = 1 held in main memory 17 18 Bi-level projection (2) Bi-level projection (3) - optimization ` ` � For each length-2 sequential pattern α , construct � “Do we need to include every item in a postfix in the α -projected database and find the frequent the projected databases?” items � NO! Item pruning in projected database by 3-way � Construct corresponding S-matrix Apriori checking <ab> a b c (_c) d (_d) e (_e) f (_f) Any super-sequence of <(_c)(ac)(cf)> 2 0 2 2 0 1 0 0 1 0 c can be excluded from construction of <ac> is not frequent it can never be a sequential <ab> - projected database <(_c)a> pattern <c> <aba> <abc> <a(bc)> To construct <a(bc)>-projected database, a 0 <a(bd)> is not frequent sequence <a(bcde)df> should be projected to <(_e)df> c (1,0,1) 1 instead of <(_de)df> (_c) ( φ ,2, φ ) ( φ ,1, φ ) φ a c (_c) <a(bc)a> 19 20
Recommend
More recommend