indexing in repetition aware space
play

Indexing in repetition-aware space Nicola Prezza University of - PowerPoint PPT Presentation

Overview LZ77 in RLE space lz-rlbwt, in practice Indexing in repetition-aware space Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured Data" Overview


  1. Overview LZ77 in RLE space lz-rlbwt, in practice Indexing in repetition-aware space Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured Data"

  2. Overview LZ77 in RLE space lz-rlbwt, in practice Topics: LZ 77 computation in O ( | RLBWT | ) space RLBWT ↔ LZ 77 conversions in O ( | RLBWT | + | LZ 77 | ) space lz-rlbwt construction in asymptotically-optimal space The DYNAMIC library Practical variants of the lz-rlbwt index + results

  3. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Lempel-Ziv parsing Problem: LZ77 computation LZ77 can be computed with an index on T [ 1 .. n ] Problem: on extremely repetitive texts, an entropy-compressed FM-index can be exponentially larger than | LZ 77 | ... r = number of runs in BWT(T) r is a good measure of repetitiveness, and can be exponentially smaller than n on repetitive texts Goal: build LZ77 in O ( r ) working space

  4. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Run-length compression of the BWT T = "abcabbcaabcabcabbc" #abcabbcaabcabcabbc aabcabcabbc#abcabbc abbc#abcabbcaabcabc abbcaabcabcabbc#abc abcabbc#abcabbcaabc abcabbcaabcabcabbc# abcabcabbc#abcabbca bbc#abcabbcaabcabca bbcaabcabcabbc#abca bc#abcabbcaabcabcab bcaabcabcabbc#abcab bcabbc#abcabbcaabca bcabbcaabcabcabbc#a bcabcabbc#abcabbcaa c#abcabbcaabcabcabb caabcabcabbc#abcabb cabbc#abcabbcaabcab cabbcaabcabcabbc#ab cabcabbc#abcabbcaab BWT(abcabbcaabcabcabbc) = ccccc#aaabbaaabbbbb RLE ( BWT ( T )) = RLBWT ( T ) = � 5 , c �� 1 , # �� 3 , a �� 2 , b �� 3 , a �� 5 , b � r = 6

  5. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Case of study: highly repetitive text collections Motivational example (from pizzachili.dcc.uchile.cl ) All revisions (since 2001, text only) of en.wikipedia.org/wiki/Albert_Einstein Uncompressed: 456 MB z ≈ 76 · 10 3 z log n + z log σ ≈ 310 KB (7-Zip: 314 KB) 1400x compression rate r ≈ 290 · 10 3 r log ( n / r ) + r log σ ≈ 544 KB 840x compression rate

  6. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm LZ77 in RLE space: overview algorithm build (online) RLBWT ( ← − T ) 1 forward-navigate T using RLBWT ( ← − T ) and: 2 keep, for each BWT run, only the 2 most extern SA samples search the current LZ77 factor on RLBWT ( ← − T ) It can be shown that this SA sampling is sufficient to locate at least 1 previous occurrence of LZ factors

  7. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Example T = bababaa #aababab aababab# Between horizontal lines: range of current ab#aabab LZ phrase (here full range) abab#aab Parsing rule 1/3: current BWT range ababab#a contains > 1 runs and none of the "b" b#aababa inside the range is marked with a SA sample: new LZ phrase �− , 0 , b � bab#aaba Sampling rule 1/2: always add a SA babab#aa sample on current position

  8. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Example #aababab 0 T = bababaa aababab# ab#aabab LZ77(T) = �− , 0 , b � abab#aab ababab#a Parsing rule 1. new LZ phrase �− , 0 , a � b#aababa Sampling rule 1. Add new SA sample on bab#aaba current position babab#aa

  9. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Example T = bababaa #aababab 0 LZ77(T) = �− , 0 , b � �− , 0 , a � aababab# ab#aabab Parsing rule 2/3: current BWT range abab#aab contains > 1 runs and there is a sampled ababab#a b in the current range. occ = sample − length = 0 − 0 = 0. b#aababa 1 Update BWT range ("b") bab#aaba Sampling rule 1. Add new SA sample on babab#aa current position

  10. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Example #aababab 0 T = bababaa aababab# LZ77(T) = �− , 0 , b � �− , 0 , a � ab#aabab 2 abab#aab Parsing rule 3/3: current range contains ababab#a only one run. Keep previous occ (=0), update BWT range ("ab") b#aababa 1 bab#aaba Sampling rule 1. Add new SA sample on current position babab#aa

  11. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Example #aababab 0 T = bababaa aababab# LZ77(T) = �− , 0 , b � �− , 0 , a � ab#aabab 2 abab#aab Parsing rule 2. ababab#a occ = sample − length = 2 − 2 = 0 b#aababa 1 Update BWT range ("bab") bab#aaba 3 Sampling rule 1. Add new SA sample on babab#aa current position

  12. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Example T = bababaa #aababab 0 LZ77(T) = �− , 0 , b � �− , 0 , a � aababab# ab#aabab 2 Parsing rule 3. Keep previous occ (=0), update BWT range ("abab"). abab#aab 4 ababab#a Sampling rule 1. Add new SA sample on current position. b#aababa 1 Sampling rule 2/2: the current a -run has bab#aaba 3 now 3 samples. babab#aa delete the sample in the middle (3)

  13. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Example T = bababaa LZ77(T) = �− , 0 , b � �− , 0 , a � #aababab 0 aababab# Parsing rule 1. Output new phrase ab#aabab 2 � occ , length , current _ char � = � 0 , 4 , a � , abab#aab 4 reset BWT range ababab#a Sampling rule 1. Add new SA sample on b#aababa 1 current position. bab#aaba ✁ 3 Sampling rule 2. The current a -run has babab#aa 5 now 3 samples: delete the sample in the middle (1)

  14. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Example #aababab 0 aababab# ab#aabab 2 T = bababaa abab#aab 4 LZ77(T) = �− , 0 , b � �− , 0 , a � � 0 , 4 , a � ababab#a 6 Finish! ✁ b#aababa 1 bab#aaba babab#aa 5

  15. Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm Data structures Important: at each stage, ≤ 2 r SA samples Result LZ77 can be computed in O ( r ) space (words) and O ( n log r ) time Simon G. does not like big-O notation... r ( 4 log ( n / r ) + 2 log n + log σ )( 1 + o ( 1 )) bits

  16. Overview LZ77 in RLE space lz-rlbwt, in practice applications Applications: The lz-rlbwt index can be built in O ( r + z ) words of space and O ( n · ( log r + log z )) time Conversion between compressed formats/compressed indexes: next slide ...

  17. Overview LZ77 in RLE space lz-rlbwt, in practice applications Conversion between compressed formats LZ 77 = � π i , λ i , c i � i = 1 ,..., z RLBWT = � λ i , c i � i = 1 ,..., r Results We can compute RLBWT → LZ 77 in O ( n log r ) time and O ( r ) words of space We can compute LZ 77 → RLBWT in O ( n ( log r + log z )) time and O ( r + z ) words of space (not discussed here)

  18. Overview LZ77 in RLE space lz-rlbwt, in practice Some results The DYNAMIC library github.com/nicolaprezza/DYNAMIC

  19. Overview LZ77 in RLE space lz-rlbwt, in practice Some results Theoretical bounds, two examples: SPSI: sequence s 1 , . . . , s m of total sum M . Space: ≈ 1 . 3 · m · log ( M / m ) bits O ( log m ) -time sum , search , update , and insert Run-length encoded string with r runs: Space: ≈ r · ( 1 . 1 · log | Σ | + 2 . 6 · log ( n / r )) bits O ( log r ) -time rank , select , access , and insert

  20. Overview LZ77 in RLE space lz-rlbwt, in practice Some results LZ77 construction algorithms: benchmark File Size (MB) 7-Zip-compressed size (MB) Rate cere 440.0 8.10 0.0184 para 410.0 9.80 0.0239 influenzae 148.0 2.50 0.0169 escherichia 108.0 7.10 0.0657 sdsl 1024.0 0.60 0.0006 samtools 1024.0 1.20 0.0012 boost 1024.0 0.20 0.0002 bwa 419.0 0.38 0.0009 Einstein 1024.0 1.60 0.0016 earth 1024.0 1.70 0.0017 Bush 1024.0 1.90 0.0019 wikipedia 1024.0 2.40 0.0023

  21. Overview LZ77 in RLE space lz-rlbwt, in practice Some results cere para influenzae escherichia RAM (log 10 (MB)) 4 4 4 4 ● ● 3 3 3 3 ● ● 2 2 2 2 1 1 1 1 1 3 5 1 3 5 1 3 5 1 3 5 sdsl samtools boost bwa RAM (log 10 (MB)) 4 4 4 4 ● ● ● ● 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 1 3 5 1 3 5 1 3 5 1 3 5 einstein earth bush wikipedia RAM (log 10 (MB)) 4 4 4 4 ● ● ● ● 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 1 3 5 1 3 5 1 3 5 1 3 5 Time (log 10 (s)) Time (log 10 (s)) Time (log 10 (s)) Time (log 10 (s)) ● ISA6r KKP1s LZscan h0−lz77 rle−lz77−1 rle−lz77−2 plain size 7−zip

  22. Overview LZ77 in RLE space lz-rlbwt, in practice lz-rlbwt implementation lz-rlbwt: implementation C++ implementations of the lz-rlbwt index (using SDSL): https://github.com/nicolaprezza/lz-rlbwt https://github.com/nicolaprezza/lz-rlbwt-sparse

  23. Overview LZ77 in RLE space lz-rlbwt, in practice lz-rlbwt implementation Variants We propose 3 variants of the index: full, bidirectional, sparse RLBWT sparsification We implemented RLBWT (SDSL) using sparsification on the gap-encoded bitvectors: ( 1 + ǫ ) r log ( n / r ) + r log σ bits of space ⇒ half of the space of the RLCSA

  24. Overview LZ77 in RLE space lz-rlbwt, in practice lz-rlbwt implementation Full index RLBWT ( T ) , RLBWT ( ← − T ) , 4-sided and 2-sided range structures, subset of suffix-tree nodes Theorem The lz-rlbwt-f index takes � � 6 z log n + ( 2 + ǫ ) r log ( n / r ) + 2 r log σ · ( 1 + o ( 1 )) bits of space and supports: Count in O ( m · ( log ( n / r ) + log σ )) time Locate in O (( m + occ ) · log n ) time For any constant 0 < ǫ ≤ 1 (RLBWT sparsification).

Recommend


More recommend