faster subsequence and don t care pattern matching on
play

Faster Subsequence and Dont -Care Pattern Matching on Compressed - PowerPoint PPT Presentation

Faster Subsequence and Dont -Care Pattern Matching on Compressed Texts Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Department of Informatics, Kyushu University, JAPAN Originally presented at CPM 2011 1 Self


  1. Faster Subsequence and Don’t -Care Pattern Matching on Compressed Texts Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Department of Informatics, Kyushu University, JAPAN Originally presented at CPM 2011 1

  2. Self introduction  Name: Shunsuke Inenaga (稲永 俊介)  Affiliation: Kyushu University, Japan  Research interests: String matching, Text compression, Algorithms, Data structures 2

  3. Agenda  Subsequence Pattern Matching  Compressed String Processing  Straight Line Program (SLP)  Algorithms ◦ Minimum Subsequence Occurrences on SLP ◦ Fixed Length Don’t Care Matching on SLP ◦ Variable Length Don’t Care Matching on SLP  Summary 3

  4. Subsequences  String P of length m is a subsequence of string T of length N  ∃ i 0 , ..., i m – 1 s.t. 0 ≤ i 0 < … < i m – 1 ≤ N -1 and P [ j ] = T [ i j ] for all j = 0, ..., m – 1 4

  5. Example 0123456789 accbabbcab T = abc P = 5

  6. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) abc P = 6

  7. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = 7

  8. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) 8

  9. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) ( 4 , 5 , 7 ) 9

  10. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) ( 4 , 5 , 7 ) ( 4 , 6 , 7 ) 10

  11. There can be too many occurrences 0123456789 ababababab P = aaa T = a a a a a a a a a # of choices of a a a 𝑂 indices is O 𝑛 a a a … 11

  12. Consider only start & end 0123456789 ababababab P = aaa T = a a a two occurrences are a a a equivalent ( 0 , 6 ) a a a  they start and end at the same positions a a a a a a there still exist O ( N 2 ) … non-equivalent occurrences 12

  13. Minimal Subsequence Occurrences  An occurrence ( i 0 , i m – 1 ) of subsequence P in T is minimal , if there is no occurrence of P in T [ i 0 : i m – 1 – 1] or T [ i 0 +1 : i m – 1 ].  In other words, ( i 0 , i m – 1 ) is minimal, if there is no other occurrence of P within T [ i 0 : i m – 1 ]. 13

  14. Minimal Subsequence Occurrences 0123456789 ababababab P = aaa T = a a a ( 0 , 4 ) a a a a a a there are only O ( N ) a a a minimal occurrences a a a ( 2 , 6 ) … 14

  15. Problem setting  We want to solve the problem of computing minimal occurrences of a query pattern when a text is given in a compressed form . 15

  16. Compressed String Processing String Processing Process BIG Compressed Representation compress Compressed String decompress String Light Processing without explicit decompression can dramatically save time and space 16

  17. Straight Line Program [1/2] An SLP S is a sequence of n assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable,   ) ( a a expr k : X i X j ( i, j < k ) . SLP S for string T is a context free grammar in the Chomsky normal form s.t. L ( S ) = { T }.

  18. Straight Line Program [2/2] SLP S X 1 = a X 2 = b X 3 = X 1 X 2 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T = N N = O (2 n )

  19. Straight Line Program [2/2] SLP S X 1 = a X 2 = b X 3 = X 1 X 2 X 8 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 7 X 5 X 8 = X 7 X 5 T = N N = O (2 n )

  20. SLP: Abstract model of compression  Output of grammar-based compression algorithms (e.g., Re-pair, Sequitur, LZ78) of size n can be trivially converted to SLPs of size O ( n ) in O ( n ) time.  Output of LZ77 of size r can be converted to an SLP of size O ( r log N ) in O ( r log N ) time.  Therefore, algorithms working on SLPs are so useful that they can be applied to various types of compressed strings. 21

  21. Our contribution Given an SLP-compressed text and an uncompressed pattern, we propose O ( nm ) algorithms for: ◦ Subsequence pattern matching ◦ FLDC (f ixed length don’t care) pattern matching ◦ VLDC (variable length don’t care) pattern matching n = size of SLP m = length of pattern 22

  22. Subsequence matching 23

  23. Subsequence Problems on SLP [Cégielski et al . 2006] Minimal Subsequence Occurrences Input : SLP of size n representing string T , string P Output : # of minimal subsequence occurrences of P in T Several variations, e.g.: Bounded Minimal Subsequence Occurrences Input : SLP of size n representing T , string P , integer w Output : # of minimal subsequence occurrences ( i 0 , i m – 1 ) of P in T satisfying i m – 1 – i 0 ≤ w 24

  24. Comparison to previous work Decomp.& [ Troníček 2001 ]  O ( Nm ) = O (2 n m )  [Cégielski et al . 2006] O ( nm 2 log m )  [Tiskin 2009] O ( nm 1.5 )  [Tiskin 2011] O ( nm log m ) This Work Subsequence problems  O ( nm ) Extensions to pattern matching with Fixed/Variable Length Don’t Care Symbols  O ( nm ) 25

  25. 串: Stabbed occurrences For X i = X l X r , an occurrence ( u , v ) of P is said to be a stabbed occurrence in X i if : 0 ≤ u < | X l | ≤ v ≤ | X i | -1. X i X l X r u v おでん u' v' “ODEN” 串 串 (KUSHI) is a Kanji character meaning “skewer”, used to stab food. 26

  26. Every occurrence is stabbed Observation For any interval [ u , v ] with 0 ≦ u ≦ v ≦ N -1, there exists a variable X i which stabs [ u , v ]. X n X i 0 N -1 u v 27

  27. Counting minimal occurrences  M i : # of minimal occurrences of P in X i  M 串 ( l , r ): # of stabbed minimal occurrences of P in X i =X l X r M n is the solution to our Problem Computing M i • If X i = X l X r ( l , r < i ) • If X i = a (a ∈ Σ ) if m ≠1 or P ≠ a 0 M i = M l + M r + M 串 ( l , r ) M i = 1 if m =1 and P = a X i P = abc M i = 4 X l X r aabcxabaxcbxcxabxc M l = 1 M r = 1 stabbed minimal M 串 ( l , r ) = 2 28 occurrences June 27-29, CPM 2011 @ Palermo

  28. there are at most m – 1 Computing M 串 ( l , r ) stabbed minimal occurrences X i P = abbba X l X r a a b b a b a b b a b a L ( r , m – k ) k R ( l , k ) a b b a b a b b a b a - ∞ 0 a b b a b a b b a b a 1 5 1 a b b a b a b b a b a 2 5 4 a b b a b a b b a b a 3 2 4 a b b a b a b b a b a 4 2 6 a b b a b a b b a b a - 5 6 shortest suffix of X l shortest prefix of X r containing P [0: m - k -1] containing P [ m - k : m -1] 29

  29. there are at most m – 1 Computing M 串 ( l , r ) crossing minimal occurrences X i P = abbba X l X r a a b b a b a b b a b a L ( r , m – k ) k R ( l , k ) a b b a b a b b a b a - ∞ 0 a b b a b a b b a b a 1 5 1 a b b a b a b b a b a 2 5 4 a b b a b a b b a b a 3 2 4 a b b a b a b b a b a 4 2 6 a b b a b a b b a b a - 5 6 shortest suffix of X l shortest prefix of X r containing P [0: m - k -1] containing P [ m - k : m -1] 30

  30. Computing M 串 ( l , r ) Lemma M 串 ( l , r ) for all X i = X l X r can be computed in a total of O ( nm ) time using L and R. C := 0, rmin := R ( l , 0) for k := 1 to m – 1 if rmin > R ( l , k ) and L ( r , m - k )< L ( r , m - k -1) then C := C + 1 X i X l X r rmin := R ( l , k ) end if rmin end for M 串 ( l , r ) := C R ( l , k ) L ( r , m - k ) L ( r , m - k -1) L ( i , j ) : Length of shortest prefix of X i s.t. P [ j :m-1] is subsequence R ( i , j ) : Length of shortest suffix of X i s.t. P [0: m - j -1] is subsequence 31

  31. Computing Q (to compute L ) Q ( i , j ): length of longest prefix of P [ j :] which is also a subsequence of X i . ( i =1, ..., n , j =0, ..., m ) Computing Q ( i , j ) • If X i = X l X r ( l , r < i ) • If X i = a (a ∈ Σ ) if P [ j ] ≠ a 0 Q ( i , j ) = Q ( l , j ) + Q ( r , j' ) Q ( i , j ) = 1 if P [ j ]= a ( j' = j + Q ( l , j )) X i X l X r ··· ··· ··· ··· j j' Q ( l , j ) characters Q ( r , j' ) characters 32

  32. Computing Q (to compute L ) X i X l X r x x a x b x c d x e x x P [ j :]= abcdef P [ j' :]= cdef Q ( l , j ) = 2 Q ( r , j' ) = 3 Q ( i , j ) := Q ( l , j )+ Q ( r , j' ) j' := j + Q ( l , j ) = 2 + 3 = 5 Lemma [Cégielski et al. ] For all i =1, ..., n and j =0, ..., m Q ( i , j ) can be calculated in O ( nm ) time using DP. 33

  33. Computing L L ( i , j ): length of shortest prefix of X i s.t. P [ j :] is subsequence ( i =1,..., n , j =0,..., m ) ( ∞ if P [ j :] is not subsequence of X i ) Computing L ( i , j ) If X i = a (a ∈ Σ ) • • If X i = X l X r 0 if j = m L ( l , j ) if j' = m L ( i , j )= L ( i , j )= 1 if P [ j :]= a | X l | + L ( r , j' ) if j' < m ∞ if P [ j :] ≠ a ( j' = j + Q ( l , j )) X i X l X r ··· ··· ··· j j' | X l | L ( r,j' ) L ( i , j ) = | X l | + L ( r , j' ) 34

  34. [Cégielski et al. , 2007] Computing L O ( nm 2 log m ) Lemma L ( i , j ) can be computed for all i =1, ..., n , j =0, ..., m , in a total of O ( nm ) time using Q ( i , j ). P [ j :] = abcdef X i X r X l xabxcxxdexfxx L ( l , j )= ∞ j' := j +3 P [ j' :]= def L ( r , j' )=5 Q ( l , j )=3 |X l | = 6 = 11 L ( i , j )=| X l |+ L ( r , j' ) 35

  35. Result Minimal Subsequence Occurrences Problem Input : SLP of size n representing string T , string P Output : # of minimal occurrences of subsequence P in T Theorem Given an SLP of size n and a pattern of length m , minimal subsequence occurrences can be computed in O ( nm ) time and space. O ( Nm ) = O (2 n m ) Decomp.&[ Troníček 2001] O ( nm 2 log m ) [Cégielski et al. 2007] O ( nm ) O ( nm 1.5 ) [Tiskin 2009] O ( nm log m ) [Tiskin 2011] 36

  36. FLDC matching 37

Recommend


More recommend