Faster Subsequence and Don’t -Care Pattern Matching on Compressed Texts Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Department of Informatics, Kyushu University, JAPAN Originally presented at CPM 2011 1
Self introduction Name: Shunsuke Inenaga (稲永 俊介) Affiliation: Kyushu University, Japan Research interests: String matching, Text compression, Algorithms, Data structures 2
Agenda Subsequence Pattern Matching Compressed String Processing Straight Line Program (SLP) Algorithms ◦ Minimum Subsequence Occurrences on SLP ◦ Fixed Length Don’t Care Matching on SLP ◦ Variable Length Don’t Care Matching on SLP Summary 3
Subsequences String P of length m is a subsequence of string T of length N ∃ i 0 , ..., i m – 1 s.t. 0 ≤ i 0 < … < i m – 1 ≤ N -1 and P [ j ] = T [ i j ] for all j = 0, ..., m – 1 4
Example 0123456789 accbabbcab T = abc P = 5
Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) abc P = 6
Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = 7
Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) 8
Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) ( 4 , 5 , 7 ) 9
Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) ( 4 , 5 , 7 ) ( 4 , 6 , 7 ) 10
There can be too many occurrences 0123456789 ababababab P = aaa T = a a a a a a a a a # of choices of a a a 𝑂 indices is O 𝑛 a a a … 11
Consider only start & end 0123456789 ababababab P = aaa T = a a a two occurrences are a a a equivalent ( 0 , 6 ) a a a they start and end at the same positions a a a a a a there still exist O ( N 2 ) … non-equivalent occurrences 12
Minimal Subsequence Occurrences An occurrence ( i 0 , i m – 1 ) of subsequence P in T is minimal , if there is no occurrence of P in T [ i 0 : i m – 1 – 1] or T [ i 0 +1 : i m – 1 ]. In other words, ( i 0 , i m – 1 ) is minimal, if there is no other occurrence of P within T [ i 0 : i m – 1 ]. 13
Minimal Subsequence Occurrences 0123456789 ababababab P = aaa T = a a a ( 0 , 4 ) a a a a a a there are only O ( N ) a a a minimal occurrences a a a ( 2 , 6 ) … 14
Problem setting We want to solve the problem of computing minimal occurrences of a query pattern when a text is given in a compressed form . 15
Compressed String Processing String Processing Process BIG Compressed Representation compress Compressed String decompress String Light Processing without explicit decompression can dramatically save time and space 16
Straight Line Program [1/2] An SLP S is a sequence of n assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable, ) ( a a expr k : X i X j ( i, j < k ) . SLP S for string T is a context free grammar in the Chomsky normal form s.t. L ( S ) = { T }.
Straight Line Program [2/2] SLP S X 1 = a X 2 = b X 3 = X 1 X 2 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T = N N = O (2 n )
Straight Line Program [2/2] SLP S X 1 = a X 2 = b X 3 = X 1 X 2 X 8 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 7 X 5 X 8 = X 7 X 5 T = N N = O (2 n )
SLP: Abstract model of compression Output of grammar-based compression algorithms (e.g., Re-pair, Sequitur, LZ78) of size n can be trivially converted to SLPs of size O ( n ) in O ( n ) time. Output of LZ77 of size r can be converted to an SLP of size O ( r log N ) in O ( r log N ) time. Therefore, algorithms working on SLPs are so useful that they can be applied to various types of compressed strings. 21
Our contribution Given an SLP-compressed text and an uncompressed pattern, we propose O ( nm ) algorithms for: ◦ Subsequence pattern matching ◦ FLDC (f ixed length don’t care) pattern matching ◦ VLDC (variable length don’t care) pattern matching n = size of SLP m = length of pattern 22
Subsequence matching 23
Subsequence Problems on SLP [Cégielski et al . 2006] Minimal Subsequence Occurrences Input : SLP of size n representing string T , string P Output : # of minimal subsequence occurrences of P in T Several variations, e.g.: Bounded Minimal Subsequence Occurrences Input : SLP of size n representing T , string P , integer w Output : # of minimal subsequence occurrences ( i 0 , i m – 1 ) of P in T satisfying i m – 1 – i 0 ≤ w 24
Comparison to previous work Decomp.& [ Troníček 2001 ] O ( Nm ) = O (2 n m ) [Cégielski et al . 2006] O ( nm 2 log m ) [Tiskin 2009] O ( nm 1.5 ) [Tiskin 2011] O ( nm log m ) This Work Subsequence problems O ( nm ) Extensions to pattern matching with Fixed/Variable Length Don’t Care Symbols O ( nm ) 25
串: Stabbed occurrences For X i = X l X r , an occurrence ( u , v ) of P is said to be a stabbed occurrence in X i if : 0 ≤ u < | X l | ≤ v ≤ | X i | -1. X i X l X r u v おでん u' v' “ODEN” 串 串 (KUSHI) is a Kanji character meaning “skewer”, used to stab food. 26
Every occurrence is stabbed Observation For any interval [ u , v ] with 0 ≦ u ≦ v ≦ N -1, there exists a variable X i which stabs [ u , v ]. X n X i 0 N -1 u v 27
Counting minimal occurrences M i : # of minimal occurrences of P in X i M 串 ( l , r ): # of stabbed minimal occurrences of P in X i =X l X r M n is the solution to our Problem Computing M i • If X i = X l X r ( l , r < i ) • If X i = a (a ∈ Σ ) if m ≠1 or P ≠ a 0 M i = M l + M r + M 串 ( l , r ) M i = 1 if m =1 and P = a X i P = abc M i = 4 X l X r aabcxabaxcbxcxabxc M l = 1 M r = 1 stabbed minimal M 串 ( l , r ) = 2 28 occurrences June 27-29, CPM 2011 @ Palermo
there are at most m – 1 Computing M 串 ( l , r ) stabbed minimal occurrences X i P = abbba X l X r a a b b a b a b b a b a L ( r , m – k ) k R ( l , k ) a b b a b a b b a b a - ∞ 0 a b b a b a b b a b a 1 5 1 a b b a b a b b a b a 2 5 4 a b b a b a b b a b a 3 2 4 a b b a b a b b a b a 4 2 6 a b b a b a b b a b a - 5 6 shortest suffix of X l shortest prefix of X r containing P [0: m - k -1] containing P [ m - k : m -1] 29
there are at most m – 1 Computing M 串 ( l , r ) crossing minimal occurrences X i P = abbba X l X r a a b b a b a b b a b a L ( r , m – k ) k R ( l , k ) a b b a b a b b a b a - ∞ 0 a b b a b a b b a b a 1 5 1 a b b a b a b b a b a 2 5 4 a b b a b a b b a b a 3 2 4 a b b a b a b b a b a 4 2 6 a b b a b a b b a b a - 5 6 shortest suffix of X l shortest prefix of X r containing P [0: m - k -1] containing P [ m - k : m -1] 30
Computing M 串 ( l , r ) Lemma M 串 ( l , r ) for all X i = X l X r can be computed in a total of O ( nm ) time using L and R. C := 0, rmin := R ( l , 0) for k := 1 to m – 1 if rmin > R ( l , k ) and L ( r , m - k )< L ( r , m - k -1) then C := C + 1 X i X l X r rmin := R ( l , k ) end if rmin end for M 串 ( l , r ) := C R ( l , k ) L ( r , m - k ) L ( r , m - k -1) L ( i , j ) : Length of shortest prefix of X i s.t. P [ j :m-1] is subsequence R ( i , j ) : Length of shortest suffix of X i s.t. P [0: m - j -1] is subsequence 31
Computing Q (to compute L ) Q ( i , j ): length of longest prefix of P [ j :] which is also a subsequence of X i . ( i =1, ..., n , j =0, ..., m ) Computing Q ( i , j ) • If X i = X l X r ( l , r < i ) • If X i = a (a ∈ Σ ) if P [ j ] ≠ a 0 Q ( i , j ) = Q ( l , j ) + Q ( r , j' ) Q ( i , j ) = 1 if P [ j ]= a ( j' = j + Q ( l , j )) X i X l X r ··· ··· ··· ··· j j' Q ( l , j ) characters Q ( r , j' ) characters 32
Computing Q (to compute L ) X i X l X r x x a x b x c d x e x x P [ j :]= abcdef P [ j' :]= cdef Q ( l , j ) = 2 Q ( r , j' ) = 3 Q ( i , j ) := Q ( l , j )+ Q ( r , j' ) j' := j + Q ( l , j ) = 2 + 3 = 5 Lemma [Cégielski et al. ] For all i =1, ..., n and j =0, ..., m Q ( i , j ) can be calculated in O ( nm ) time using DP. 33
Computing L L ( i , j ): length of shortest prefix of X i s.t. P [ j :] is subsequence ( i =1,..., n , j =0,..., m ) ( ∞ if P [ j :] is not subsequence of X i ) Computing L ( i , j ) If X i = a (a ∈ Σ ) • • If X i = X l X r 0 if j = m L ( l , j ) if j' = m L ( i , j )= L ( i , j )= 1 if P [ j :]= a | X l | + L ( r , j' ) if j' < m ∞ if P [ j :] ≠ a ( j' = j + Q ( l , j )) X i X l X r ··· ··· ··· j j' | X l | L ( r,j' ) L ( i , j ) = | X l | + L ( r , j' ) 34
[Cégielski et al. , 2007] Computing L O ( nm 2 log m ) Lemma L ( i , j ) can be computed for all i =1, ..., n , j =0, ..., m , in a total of O ( nm ) time using Q ( i , j ). P [ j :] = abcdef X i X r X l xabxcxxdexfxx L ( l , j )= ∞ j' := j +3 P [ j' :]= def L ( r , j' )=5 Q ( l , j )=3 |X l | = 6 = 11 L ( i , j )=| X l |+ L ( r , j' ) 35
Result Minimal Subsequence Occurrences Problem Input : SLP of size n representing string T , string P Output : # of minimal occurrences of subsequence P in T Theorem Given an SLP of size n and a pattern of length m , minimal subsequence occurrences can be computed in O ( nm ) time and space. O ( Nm ) = O (2 n m ) Decomp.&[ Troníček 2001] O ( nm 2 log m ) [Cégielski et al. 2007] O ( nm ) O ( nm 1.5 ) [Tiskin 2009] O ( nm log m ) [Tiskin 2011] 36
FLDC matching 37
Recommend
More recommend