pattern matching on compressed t exts ii
play

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu - PowerPoint PPT Presentation

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda Fully Compressed Pattern Matching Straight Line Program Compressed String Comparison Period of Compressed String Pattern Discovery


  1. Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan

  2. Agenda  Fully Compressed Pattern Matching  Straight Line Program  Compressed String Comparison  Period of Compressed String  Pattern Discovery from Compressed String (Palindrome and Square)  FCPM for 2D SLP  Open Problems

  3. Fully Compressed Pattern Matching [1/3] compressed pattern: pattern: &(aG Dagstuhl compressed text: geoiy083qa0gj(#*gpfomo)#(JGWRE$(U)%ARY)(J PED(A%RJG)ER%U)JGODAAQWT$JGWRE)$R J)REWJFDOPIJKSeoiy083qa0gj(#*gpfomo)#(JG WRE$(U)%ARY)(JPED(A%RJG)ER%U)JGODAA QWT$JGWRE)$geoiy083qa0gj(#*gpfomo)#(JG WRE$(U)%ARY)(JPED(A%RJG)ER%U)JGODAA QWT$JGWRE)$geoiy083qa0gj(#*gpfomo)#(JG WRE$(U)%ARY)(

  4. Fully Compressed Pattern Matching [2/3] classical pattern matching algorithm uncompressed text uncompressed pattern compressed text compressed pattern matching algorithm uncompressed pattern compressed fully compressed pattern text matching algorithm compressed p pattern

  5. Possible Application of FCPM compressed text compressed I’m here. pattern where.jpg wally.jpg

  6. Fully Compressed Pattern Matching [3/3] FCPM Problem Input : T = compress( T ) and P = compress( P ) . Output : Set Occ ( T , P ) of substring occurrences of pattern P in text T .         Occ T P ( , ) | u | 1: T uPw , u w , 

  7. Straight Line Program [1/2] SLP T : sequence of assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable,    ( a a expr k : X i X j ( i, j < k ) . SLP T for string T is a CFG in Chomsky normal form s.t. L ( T ) = { T } .

  8. Straight Line Program [2/2] SLP T X 1 = a X 2 = b X 3 = X 1 X 2 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T = N N = O ( 2 n )

  9. Straight Line Program [2/2] SLP T X 1 = a X 2 = b X 3 = X 1 X 2 X 8 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T = N N = O ( 2 n )

  10. From LZ77 to SLP For any string T given in LZ77-compressed form of size k , an SLP generating T of size O ( k 2 ) can be constructed in O ( k 2 ) time. [Rytter ’00, ’03, ’04]

  11. FCPM for SLP FCPM Problem for SLP Input : SLP T for text T and SLP P for pattern P . Output : Compact representation of set Occ ( T , P ) of substring occurrences of P in T .  We want to solve the problem efficiently (i.e., polynomial time & space in n and m ). ◦ n = the size of SLP T , m = the size of SLP P  T (also P ) cannot be decompressed  |T| = O (2 n )  compact representation  |Occ ( T , P )| = O (2 n )

  12. Key Definition Occ ( X , Y ) = { i Occ ( X , Y ) | | X l | - | Y | i | X l |} X set of occurrences of Y X l X r that cover or touch the boundary of X l and X r . X : variable of T Y : variable of P Y

  13. Key Lemma [Miyazaki et al. ’97] X Occ ( X , Y ) forms a single arithmetic progression . X l X r O ( 1 ) space Y

  14. Key Observation  Occ X Y ( , )     Occ X Y ( , ) Occ ( X Y , ) Occ X Y ( , ) | X | l r l [Miyazaki et al. ’97] X X l X r Computing Occ ( X, Y ) is reduced to computing Occ ( X , Y ) . Y Y Y

  15. DP for Occ ( X i , Y j ) Occ ( T , P ) X n X n Occ ( X n ,Y 1 ) Occ ( X n ,Y 1 ) Occ ( X n ,Y j ) Occ ( X n ,Y j ) Occ ( X n ,Y m ) Occ ( X n ,Y m ) O (1) space X i X i Occ ( X i ,Y 1 ) Occ ( X i ,Y 1 ) Occ ( X i ,Y j ) Occ ( X i ,Y j ) Occ ( X i ,Y m ) Occ ( X i ,Y m ) X 1 X 1 Occ ( X 1 ,Y j ) Occ ( X 1 ,Y j ) Occ ( X 1 ,Y m ) Occ ( X 1 ,Y m ) Occ ( X 1 ,Y 1 ) Occ ( X 1 ,Y 1 ) Y 1 Y 1 Y j Y j Y m Y m Compact representation of Occ ( T , P ) which answers a membership query to Occ ( T, P ) in O ( n ) time.

  16. Known Results Time Space Compression Miyazaki et al. ’97 SLP O ( m 2 n 2 ) O ( mn ) Lifshits ’07 O ( mn 2 ) SLP O ( mn ) Hirao et al. ’00 Balanced SLP O ( mn ) O ( mn ) Balanced SLP

  17. Fully Compressed Subsequence Pattern Matching [1/2] FC Subsequence PM Problem Input : SLP T for text T and SLP P for pattern P . Output : Find whether P is a subsequence of T .  P is said to be a subsequence of T , if P can be obtained by removing zero or more characters from T .

  18. Fully Compressed Subsequence Pattern Matching [2/2] The Fully Compressed Subsequence Pattern Matching Problem on SLP compressed strings is NP-hard. [Lifshits & Lohrey ’06]

  19. Compressed String Comparison [1/2] CSC Problem Input : SLPs T and S for strings T and S , resp. Output : Dis(similarity) of T and S .

  20. Compressed String Comparison [2/2] Measure Time Space Reference Equality Lifshits ’07 O ( mn 2 ) O ( mn ) Hamming #P-complete PSPACE Lifshits ’07 Distance Matsubara et Longest Common O (( m + n ) 4 log( m + n )) O (( m + n ) 3 ) Substring al. ’08 Lifshits & Longest Common NP-hard PSPACE Subsequence Lohrey ’06

  21. Property of common substrings [1/3]  For each common substring Z of string S and T , there always exists a variable X i = X l X r and Y j = Y L Y R such that: ◦ Z is a common substring of X i and Y j ◦ Z contains an overlap between X l and Y R X i Overlap X l X r Z w common Z substring Y L Y R Y j

  22. Property of common substrings [2/3] • For each common substring Z of string S and T , there always exists a string w such that: – w is a substring of Z – w is an overlap of variables of S and T X i Overlap X l X r w Y L Y R Y j

  23. Property of common substrings [1/3]  For each common substring Z of string S and T , there always exists a string w such that: ◦ Z can be calculated by expanding w X i Overlap X l X r Z w common Z substring Y L Y R Expand Expand Y j Process Process

  24. Computing Overlaps Lemma [Karpinski et al. ’97] For any variables X i and X j of SLP T , OL ( X i , X j ) can be represented by O ( n ) arithmetic progressions. X i Y j Theorem [Karpinski et ai. ’97] For any SLP T , OL ( X i , X j ) can be computed in total of O ( n 4 log n ) time and O( n 3 ) space for each i , j .

  25. Periods of Compressed String [1/2] Compressed Period Problem Input : SLP T for string T . Output : Compact representation of set Period ( T ) of periods of T .          Period T ( ) | T | | u | : T uv wu v w , , 

  26. Periods of Compressed String [2/2] An O ( n ) -size representation of Period ( T ) can be computed in O ( n 4 ) time with O ( n 3 ) space. [Lifshits ’06, ’07]

  27. Compressed Palindrome Discovery [1/2] Compressed Palindrome Discovery Problem Input : SLP T for string T . Output : Compact representation of set Pal ( T ) of maximal palindromes of T .  Pal ( T ) = { } ( p,q ) : T [ p : q ] is the maximal palindrome centered at    ( p q ) / 2 .    ex. T = baabbaa

  28. Compressed Palindrome Discovery [2/2] An O ( n 2 ) -size representation of Pal ( T ) can be computed in O ( n 4 ) time with O ( n 2 ) space. [Matsubara et al. ’08]

  29. Composition System CS T : sequence of assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable,    ( a a expr k : X i X j ( i, j < k ) , [ p ] X i X j [ q ] ( i, j < k ).  [ p ] X = X [1: p ]  X [ q ] = X [| X |- q +1:| X |]

  30. From LZ77 to CS For any string T given in LZ77-compressed form of size k , a CS generating T of size O ( k log k ) can be constructed in polynomial time. [Gasieniec et al. ’96]

  31. Compressed Square Discovery [1/2] Compressed Square Problem Input : CS T for string T . Output : Check the square freeness of T (whether T contains a square or not).  A square is any non-empty string of the form xx .

  32. Compressed Square Discovery [2/2] We can test square freeness of T in polynomial time in the size of given composition system T . [Gasieniec et al. ’96, Rytter’00]

  33. 2D SLP 2D SLP T : sequence of assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable,    ( a a expr k : X i X j  ( i, j < k, height ( X i ) = height ( X j ) ) , X i X j ฀ ( i, j < k, width ( X i ) = width ( X j ) ) , X i = X k = X k X i X j X j horizontal concatenation  vertical concatenation ฀

  34. FCPM for 2D SLP The Fully Compressed Pattern Matching Problem for 2D SLP is  P -complete. 2 [Berman et al. ’97, Rytter’00]

  35. Open Problems [1/2]  Edit distance of two SLP-compressed strings.  Compact representation of all maximal runs of an SLP-compressed string. ◦ A run is any string x whose minimal period p satisfies p |x|/ 2 .  8 ◦ ex.  ( aab ) aabaabaa 3

  36. Max Number of Runs c cN in a String [Kolpakov & Kucherov ’99] c 5N 5N [Rytter ’06] 1.048N 1.05N [Crochemore et al. ’08] 4N 3.48N [Puglisi et al. ’08] 1.00N 3N 3.44N [Rytter ’07] 0.944565N 0.95N [Kusano et al. ’08] 2N 1.6N 0.927N [Crochemore & Ilie ’08] 0.90N [Franek et al. ’03] N N: (uncompressed) text length 0

  37. Open Problems [2/2]  Fully Compressed Tree Pattern Matching for grammar based XML compression. ◦ TGCA (Tree Grammar Compression Algorithm) [Onuma et al. ’06]

Recommend


More recommend