algorithms on grammar compressed strings
play

Algorithms on grammar compressed strings Shunsuke Inenaga Kyushu - PowerPoint PPT Presentation

Dagstuhl Seminar 13232 Algorithms on grammar compressed strings Shunsuke Inenaga Kyushu University, Japan What we did after Dagstuhl Seminar 08261 In Dagstuhl Seminar 08261 (in 2008), I gave a survey talk about algorithmic results on


  1. Dagstuhl Seminar 13232 Algorithms on grammar compressed strings Shunsuke Inenaga Kyushu University, Japan

  2. What we did after Dagstuhl Seminar 08261  In Dagstuhl Seminar 08261 (in 2008), I gave a survey talk about algorithmic results on grammar-based compressed strings, which were achieved before 2008.  Today, I will talk about our new(er) results we achieved after 2008.

  3. Collaborations  Japanese: Hideo Bannai, Tomohiro I, Masayuki Takeda, Keisuke Goto, Yuto Nakashima, Kouji Shimohira, Takanori Yamamoto (Kyushu U.), Ayumi Shinohara, Kazuyuki Narisawa, Wataru Matsubara (Tohoku U.)  International: Pawe ł Gawrychowski (Max Planck), Travis Gagie (U. Helsinki), Gad M. Landau (U. Haifa), Moshe Lewenstein (Bar Ilan U.)

  4. Compressed String Processing (CSP) non-CSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . decompress . . . . . . . . . . . process process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . compressed . . . . . . . . . . . data output uncompressed data CSP In CSP we do not process process decompress the whole data compressed output data

  5. Compressed String Processing [Cont.]  Suppose that huge string data is stored in a compressed form.  Given a compressed string, our goal is to perform various kinds of processing on the compressed string, without decompressing the whole string.  Our input is a straight-line program (SLP).

  6. Straight Line Program (SLP) An SLP is a sequence of productions X 1 = expr 1 , X 2 = expr 2 , ···, X n = expr n expr i = a ( a ) • expr i = X l X r ( l , r < i ) •  The size of the SLP is the number n of productions.  An SLP is essentially a CFG deriving a single string.  SLPs model outputs of grammar-based compression algorithms (e.g., Re-pair, Sequitur, LZ78, etc).

  7. Example of SLP SLP S Derivation tree T of SLP S 7 X 1 = a X 2 = b 6 X 3 = X 1 X 1 5 5 X 4 = X 1 X 2 3 4 3 4 4 X 5 = X 3 X 4 1 1 1 1 2 1 1 2 1 2 X 6 = X 5 X 4 X 7 = X 5 X 6 a a a b a a a b a b string represented by SLP S

  8. DAG view of SLP DAG for SLP S Derivation tree T of SLP S 7 7 6 6 5 5 5 3 4 3 4 3 4 4 1 1 1 1 2 1 1 2 1 2 1 2 a a a b a a a b a b a b  DAG is compressed representation of derivation tree.  SLP is compressed representation of string.

  9. Important Remark X 6 6 5 X 5 X 4 4 4 3 1 1 2 1 1 2 a a a b a b  Derivation trees are used only for explanations, and are never constructed in our algorithms.  CSP on SLPs can be seen as algorithmic technique to perform various kinds of operations on the DAG for SLP, not on the derivation tree.

  10. Notations n : the size of a given SLP S h : the height of the derivation tree T of S N : the length of the decompressed string w that is represented by SLP S  log 2 N n always holds. h  In theory, N = O (2 n ) .  Solutions polynomial in n are beneficial.

  11. Pattern Mining problem time space (words) q -gram O ( qn ) O ( qn ) frequencies q -gram O ( N -  ) O ( N -  ) frequencies q -gram non-overlapping O ( q 2 n ) O ( qn ) frequencies longest repeating O ( n 4 log n ) O ( n 3 ) substring N -  min( qn , N ) always holds

  12. SLP Text v.s Uncompressed Pattern problem time space (words) (window) subsequence O ( nM ) O ( nM ) matching (window) VLDC pattern O ( nM ) O ( nM ) matching O (( N -  ) log M ) O (( N -  ) log M ) convolution  M is the length of uncompressed pattern  N -  min( nM , N ) always holds

  13. String Regularities problem time space (words) square freeness O ( n 3 h log N ) O ( n 2 ) repetitions O ( n 3 h ) O ( n 2 ) (runs & squares) palindromes O ( n 2 ) O ( nh ( n + h log N )) gapped O ( nh ( n 2 + g log N )) O ( n ( n + g )) palindromes periods O ( n 2 h ) O ( n 2 ) O ( nh ( n  log 2 N )) covers O ( n 2 ) g is the fixed gap length

  14. Factorization problem time space (words) LZ78 factorization O ( n + s log N ) O ( n + s ) LZ78 factorization O ( n + s log s ) O ( n + s log s ) O ( n 2 + z ) LZ77 factorization O ( zn 2 h log N ) Lyndon O ( n 4 + mn 3 h ) O ( n 2 ) factorization Lyndon O ( nh ( n + log 2 N )) O ( n 2 ) factorization  s is the number of LZ78 factors  z is the number of LZ77 factors  m is the number of Lyndon factors

  15. And Some Others problem time space (words) longest common O ( n 4 log n ) O ( n 2 log N ) substring O ( n 3 h ) preprocess longest common O ( n 2 ) extension O ( h log N ) query Aho-Corasick O ( n 4 log n ) O ( n 2 log N ) automaton Our SLP-based Aho-Corasick automaton runs in O (| u | ( k + h + log|  |)) time on uncompressed text u , where k is the number of patterns.

  16. q -gram Frequency on SLP Problem 1 ( q -gram frequencies on SLP) Given an SLP S representing string w and a positive integer q , compute Occ ( w , p ) for all substrings p of w of length q . Occ ( w , p ) : the number of occurrences of p in w

  17. Solution for Uncompressed String  Given the uncompressed string w , we can solve the q -gram frequencies problem in O ( N ) time, using the suffix array and LCP array of w . SA LCP q = 3 8 - $ 7 0 a$ 5 1 aba$ 3 3 ababa$ 1 5 abababa$ 6 0 ba$ 4 2 baba$ 2 4 bababa$

  18. Solution for Uncompressed String  Given the uncompressed string w , we can solve the q -gram frequencies problem in O ( N ) time, using the suffix array and LCP array of w . SA LCP q = 3 8 - $ 7 0 a$ 5 1 aba$ 3 3 ababa$ 1 5 abababa$ 6 0 ba$ 4 2 baba$ 2 4 bababa$

  19. Solution for Uncompressed String  Given the uncompressed string w , we can solve the q -gram frequencies problem in O ( N ) time, using the suffix array and LCP array of w . SA LCP q = 3 8 - $ 7 0 a$ 5 1 <3 aba$ 3 3 3 ababa$ 1 5 abababa$ 6 0 ba$ 4 2 baba$ 2 4 bababa$

  20. Solution for Uncompressed String  Given the uncompressed string w , we can solve the q -gram frequencies problem in O ( N ) time, using the suffix array and LCP array of w . SA LCP q = 3 8 - $ 7 0 a$ 5 1 <3 aba$ 3 3 3 ababa$ 1 5 3 abababa$ 6 0 ba$ 4 2 baba$ 2 4 bababa$

  21. Solution for Uncompressed String  Given the uncompressed string w , we can solve the q -gram frequencies problem in O ( N ) time, using the suffix array and LCP array of w . SA LCP Output ( pos , q , # occ ) q = 3 8 - $ 7 0 a$ ( 5 , 3 , 3 ) 5 1 <3 aba$ 3 3 3 ababa$ 1 5 3 abababa$ 6 0 ba$ <3 4 2 baba$ 2 4 bababa$

  22. Solution for Uncompressed String  Given the uncompressed string w , we can solve the q -gram frequencies problem in O ( N ) time, using the suffix array and LCP array of w . SA LCP Output ( pos , q , # occ ) q = 3 8 - $ 7 0 a$ ( 5 , 3 , 3 ) 5 1 aba$ 3 3 ababa$ 1 5 abababa$ 6 0 ba$ ( 4 , 3 , 2 ) 4 2 baba$ <3 2 4 bababa$ 3

  23. Solution for Uncompressed String In the sequel, I will show how to simulate this O ( N ) -time algorithm in O ( qn ) time. SA LCP Output ( pos , q , # occ ) q = 3 8 - $ 7 0 a$ ( 5 , 3 , 3 ) 5 1 aba$ 3 3 ababa$ 1 5 abababa$ 6 0 ba$ ( 4 , 3 , 2 ) 4 2 baba$ <3 2 4 bababa$ 3

  24. Stab An integer interval [ b , e ] ( 1 N ) is said to be b e stabbed by a variable X i , if the LCA of the b th and e th leaves of the derivation tree T is labeled by X i . 7 6 5 5 3 4 4 3 4 1 1 1 1 2 1 1 2 1 2 a a a b a a a b a b 1 2 3 4 5 6 7 8 9 10

  25. Observation  Assume that the occurrence of a q -gram p starting at position j is stabbed by variable X i .  Then, in any other occurrence of X i in T , there is another stabbed occurrence of p . T X i X i X i p p p w j j+q -1

  26. Sub-problems  Hence, the q -gram frequencies problem on SLP reduces to the following sub-problems: Problem 2 For each variable X i , count the number of occurrences of X i in the derivation tree T . Problem 3 For each variable X i , count the number of occurrences of each q -gram stabbed by X i .

  27. Solving Problem 2 Lemma 1 Problem 2 can be solved in O ( n ) time. 7 6 5 3 4 1 2 a b

  28. Solving Problem 2 Lemma 1 Problem 2 can be solved in O ( n ) time. 1  The root occurs exactly once. 7 6 5 3 4 1 2 a b

  29. Solving Problem 2 Lemma 1 Problem 2 can be solved in O ( n ) time. 1  For each node in a topological 7 1 order, propagate its number of 6 occurrences to its children. 1 5 3 4 1 2 a b

  30. Solving Problem 2 Lemma 1 Problem 2 can be solved in O ( n ) time. 1  For each node in a topological 7 1 order, propagate its number of 6 occurrences to its children. 2 5 3 4 1 1 2 a b

  31. Solving Problem 2 Lemma 1 Problem 2 can be solved in O ( n ) time. 1  For each node in a topological 7 1 order, propagate its number of 6 occurrences to its children. 2 5 2 3 4 3 1 2 a b

  32. Solving Problem 2 Lemma 1 Problem 2 can be solved in O ( n ) time. 1  For each node in a topological 7 1 order, propagate its number of 6 occurrences to its children. 2 5 2 3 4 3 4 1 2 a b

  33. Solving Problem 2 Lemma 1 Problem 2 can be solved in O ( n ) time. 1  For each node in a topological 7 1 order, propagate its number of 6 occurrences to its children. 2 5 2 3 4 3 7 3 1 2 a b

  34. Solving Problem 2 Lemma 1 Problem 2 can be solved in O ( n ) time. 1 7 7 1 6 6 2 5 5 5 2 3 4 3 3 4 4 3 4 7 3 1 2 1 1 1 1 2 1 2 1 1 2 a b a a a b a a a b a b

Recommend


More recommend