Composite repetition-aware text indexing Djamal Belazzougui Fabio Cunial Travis Gagie Nicola Prezza Mathieu Raffinot
Compressed text indexes ◮ LZ family: LZ77 or LZ78. ◮ BWT family: FM index or Run-length encoded BWT (RLBWT). ◮ Compact directed acyclic word graph.
Repetition measures ◮ Number of phrases in Lempel-Ziv parsing (LZ77). ◮ Number of runs in Burrows Wheeler Transform (RLBWT). ◮ Number of maximal repeats. Number of right extensions and/or left extensions of maximal repeats (CDAWG).
Repetition measures (notation) ◮ Number of phrases in Lempel-Ziv parsing |Z T | (LZ77). ◮ Number of runs in BWT |R T | (RLBWT). ◮ Number of runs in BWT of reverse |R T | (RLBWT). ◮ Number of right extensions of maximal repeats |E r T ∪ F r T | (CDAWG). ◮ Number of left extensions of maximal repeats |E ℓ T ∪ F ℓ T | (CDAWG).
Repetition measures Highly-repetitive strings 39 Saccharomyces cerevisiae genomes Composite repetition-aware data structures Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 r r Distinct measures of repetition all grow sublinearly (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France. [1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25. http://pizzachili.dcc.uchile.cl/repcorpus.html
Results Combining repetition-aware data structures Highly-repetitive strings Locating 39 Saccharomyces cerevisiae genomes Words: RLBWT+CDAWG RLBWT T CDAWG T LZ77 index RLBWT+LZ77 , [1] Composite repetition-aware Locating Locating Time: data structures RLBWT T RLBWT+CDAWG RLBWT+LZ77 [2] Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 r r [1] Suffix tree representations Distinct measures of repetition all grow sublinearly (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France. [1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections . Journal of [1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25. Computational Biology, 17(3):281–308, 2010. [2] Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences . Theoretical Computer Science, 483:115–133, 2013. http://pizzachili.dcc.uchile.cl/repcorpus.html
Results Combining repetition-aware data structures Suffix tree representation Highly-repetitive strings Locating 39 Saccharomyces cerevisiae genomes Words: RLBWT+CDAWG RLBWT T CDAWG T LZ77 index RLBWT+LZ77 , [1] Composite repetition-aware Locating Locating Time: data structures Words: RLBWT T RLBWT+CDAWG RLBWT+LZ77 [2] Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 r r [1] Suffix tree representations Distinct measures of repetition all grow sublinearly (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France. [1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections . Journal of [1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25. Computational Biology, 17(3):281–308, 2010. [2] Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences . Theoretical Computer Science, 483:115–133, 2013. http://pizzachili.dcc.uchile.cl/repcorpus.html
Locate with LZ77 and RLBWT Locating with RLBWT+LZ77 RLBWT T CDAWG T LZ77 index , Rank/select in Primary occurrences: time, words time, Composite repetition-aware (predecessor data structure) . words (4-sided range reporting) . data structures Secondary occurrences: cccccccccccc time, words (2-sided range reporting) . Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. [1] Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ (N) . Information Processing Letters, 17(2):81–84, 1983. (3) Department of Mathematics and Computer Science, University of Udine, Italy. [2] Timothy M. Chan, Kasper Green Larsen, and Mihai P ă tra ş cu. Orthogonal range searching on the RAM, revisited . In Proceedings of the Twenty- (4) LIAFA, Paris Diderot University - Paris 7, France. seventh Annual Symposium on Computational Geometry, pages 1–10. ACM, 2011. [3] Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching . In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 141–155, 1996.
Locate with LZ77 and RLBWT Locating with RLBWT+LZ77 Locating with RLBWT+LZ77 Locating with RLBWT+LZ77 1 1 k k m m P = P = RLBWT T RLBWT T RLBWT T RLBWT T CDAWG T LZ77 index predecessor data structure: , words -time rank Rank/select in Primary occurrences: P [1.. k -1] time, words time, Composite repetition-aware (predecessor data structure) . words (4-sided range reporting) . data structures Secondary occurrences: words words P [ k .. m ] P [ k .. m ] cccccccccccc time, time time words (2-sided range reporting) . Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. [1] Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ (N) . Information Processing Letters, 17(2):81–84, 1983. (3) Department of Mathematics and Computer Science, University of Udine, Italy. [2] Timothy M. Chan, Kasper Green Larsen, and Mihai P ă tra ş cu. Orthogonal range searching on the RAM, revisited . In Proceedings of the Twenty- (4) LIAFA, Paris Diderot University - Paris 7, France. seventh Annual Symposium on Computational Geometry, pages 1–10. ACM, 2011. [3] Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching . In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 141–155, 1996.
Locate with CDAWG Locating with RLBWT+CDAWG Locating with RLBWT+CDAWG blind a P = W 1 = a X RLBWT T ( c , p ) CDAWG T ε Composite repetition-aware T ( c , | Y |) data structures |W| ( a , |X| ) |V| W 1 P Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 X V Y c W = (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. p (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France. [1] Maxime Crochemore and Christophe Hancart. Automata for matching patterns . In Handbook of formal languages, pages 399–462. Springer, 1997.
Suffix tree operations with CDAWG Suffix tree operations Suffix tree operations CDAWG for locating Time: Time: matching statistics ( c, p ) ε T constant-space traversal Composite repetition-aware | ) | Y c , ( 5 V data structures |W| 1) 1) Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 2) 5 V (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. c W= (3) Department of Mathematics and Computer Science, University of Udine, Italy. c c c c c a 3) 3) (4) LIAFA, Paris Diderot University - Paris 7, France. p Y
Maximal Repeats and LZ-factorization Rightmost maximal repeats and LZ factors Rightmost maximal repeats and LZ factors W i T i T i T i+ 1 T i+ 1 c c Composite repetition-aware data structures maximal repeat X X c Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Maximal Repeats and LZ-77 Rightmost maximal repeats and LZ factors Rightmost maximal repeats and LZ factors Rightmost maximal repeats and LZ factors W i W i T i T i T i T i+ 1 T i+ 1 T i+ 1 W j T j T j+ 1 c c c d Composite repetition-aware data structures maximal repeat X maximal repeat X maximal repeat X X X X c c d Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Recommend
More recommend