in place longest common extensions
play

In-place Longest Common Extensions Nicola Prezza University of - PowerPoint PPT Presentation

Overview Monte Carlo LCE structure Deterministic data structure In-place Longest Common Extensions Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured


  1. Overview Monte Carlo LCE structure Deterministic data structure In-place Longest Common Extensions Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured Data"

  2. Overview Monte Carlo LCE structure Deterministic data structure Longest Common Extension queries 0 1 2 3 4 5 6 7 8 9 T = a a b a b a b a a b LCE ( 1 , 5 ) = 3

  3. Overview Monte Carlo LCE structure Deterministic data structure State of the art Space (bits) Query time build time Reference O ( n log n ) O ( 1 ) O ( n ) ST + LCA O ( n log n ) O ( 1 ) O ( n ) RMQ + LCP O ( n 2 + ǫ ) n ⌈ log 2 σ ⌉ + O ( nw /τ ) O ( τ ) [Bille2015] O ( n 3 / 2 ) exp. n ⌈ log 2 σ ⌉ + O ( nw /τ ) O ( τ ) [Bille2015] n ⌈ log 2 σ ⌉ + O ( nw /τ ) O ( τ log τ ) O ( n τ ) [Tanimura2016] n ⌈ log 2 σ ⌉ O ( ℓ ) — store only T ℓ = LCE ( i , j )

  4. Overview Monte Carlo LCE structure Deterministic data structure Result presented Deterministic data structure of size n ⌈ log 2 σ ⌉ bits supporting optimal O ( m log σ/ w ) -time extraction of T [ i , . . . , i + m − 1 ] O ( log 2 ℓ ) LCE queries Construction: O ( n log n ) expected time and O ( n ) words of space in-place data structure: no little-o terms LCE improvable to O ( log ℓ ) using O ( log n ) words of additional space

  5. Overview Monte Carlo LCE structure Deterministic data structure Applications In-place algorithms to Compute Suffix array in O ( n log 2 n ) exp time (exact) Compute LCP array in O ( n log 2 n ) exp time (exact) Sparse suffix sorting (Monte Carlo)

  6. Overview Monte Carlo LCE structure Deterministic data structure Steps Replace text with Karp-Rabin fingerprints of a subset of its prefixes Choose randomly the modulo q in such a way that we can statistically compress fingerprints down to n ⌈ log 2 σ ⌉ bits De-randomize For simplicity, only binary case σ = 2 considered here. Easy to extend to σ ∈ O ( w )

  7. Overview Monte Carlo LCE structure Deterministic data structure Choose a block size τ ∈ Θ( w ) 1 Choose a τ -bits random prime q (modulo of KR function) 2 Chose uniform seed ¯ s ∈ [ 0 , q ) 3 Left-pad T with ¯ s 4 Break text in τ -bits blocks: array B [ 1 , . . . , n /τ ] of τ -bits integers 5 Example τ = 5 , q = 10001 (= 17 ) , ¯ s = 00101 B = 00101 01011 11010 10101 11010 00001

  8. Overview Monte Carlo LCE structure Deterministic data structure Build array P’ of Karp-Rabin fingerprints of prefixes ending at block boundaries add bitvector D [ 1 , . . . , n /τ ] marking P ′ [ i ] ≥ q Example τ = 5 , q = 10001 (= 17 ) B = 00101 01011 11010 10101 11010 00001 P’ = 01101 10010 01110 10101 00101 01011 D = 0 1 0 1 0 0 Property 1 With P’ and D we can recover B (therefore T ): If B [ i ] < q : B [ i ] = P ′ [ i ] − 2 τ · P ′ [ i − 1 ] mod q 1 If B [ i ] ≥ q the following holds: B [ i ] mod q = B [ i ] − q 2 ⇒ add q to the value in (1)

  9. Overview Monte Carlo LCE structure Deterministic data structure P’ and D take n + n /τ bits of space and support: Optimal-time text extraction Computation of Karp-Rabin fingerprint of any text substring ⇒ LCE queries in O ( log ℓ ) steps of exponential+binary search a a O ( log 2 ℓ ) total time because we need to compute powers of 2 mod q Can we reduce space to n bits?

  10. Overview Monte Carlo LCE structure Deterministic data structure Idea Pick q in such a way that few P ′ [ i ] start with a 1 Property: each P ′ [ i ] is a uniform number in [ 0 , q ) (thanks to the seed) Combinations of block values with τ = 4 . q = 1011 (= 11 ) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 = q 1100 1101 1110 1111 P ( P ′ [ i ] begins with 1 ) = red / ( red + black ) = ( q − 2 τ − 1 ) / q

  11. Overview Monte Carlo LCE structure Deterministic data structure Goal: few P ′ [ i ] starting with 1. Solve ( q − 2 τ − 1 ) / q ≤ 1 / n Result Pick q uniformly from � � �� n 2 τ − 1 , 2 τ − 1 Z = n − 1

  12. Overview Monte Carlo LCE structure Deterministic data structure Final step: build array P by removing first bit from each P ′ [ i ] , store ranks of P ′ -blocks starting with 1 in an array S Example τ = 5 P = 1101 0010 1110 0101 0101 1011 D = 0 1 0 1 0 0 S = { 2 , 4 } E [ | S | + | P | + | D | ] = n + O ( w ) bits Construction Pick pairs � q , ¯ s � until overall size is n bits (+ O ( 1 ) words) ⇒ O ( n ) exp construction time τ = ( 8 + c ) w for any constant c (see why in the paper:) LCE failure probability ≤ n − c (proof in paper)

  13. Overview Monte Carlo LCE structure Deterministic data structure In-place construction We can replace T with our structure in O ( n ) expected time while using only O ( 1 ) additional words of working space. Construction can be inverted in the same space/time (restoring text)

  14. Overview Monte Carlo LCE structure Deterministic data structure Applications Suffix sorting Easy to lexicographically compare two text suffixes using LCE queries Result 1: in-place sparse suffix sorting Any set S = { i 1 , . . . , i b } of b suffixes of a text T ∈ Σ n can be sorted correctly with high probability in O ( n + b log b · log 2 n ) expected time using O ( 1 ) words of space on top of T and S

  15. Overview Monte Carlo LCE structure Deterministic data structure Important: while computing LCE queries, in exponential/binary searches we only compare (fingerprints of) text substrings of length 2 e Theorem 1 In O ( n log n ) expected time and O ( n ) words of space we can check whether the KR function is collision-free over all pairs of substrings of T having the same length k = 2 e , for all 0 ≤ e ≤ log 2 n Theorem 2 In O ( n log 2 n ) worst-case time and n words of space (on top of T ) we can check whether the KR function is collision-free over all pairs of substrings of T having the same length k = 2 e , for all 0 ≤ e ≤ log 2 n ⇒ our deterministic structure can be built in O ( n log n ) exp time and linear space

  16. Overview Monte Carlo LCE structure Deterministic data structure Applications in-place SA construction The suffix array SA of T ∈ Σ n can be computed in O ( n log 2 n ) expected time using O ( 1 ) words of space on top of T and SA . The above does not improve state of the art [Franceschini2007]. The following does: in-place LCP construction The Longest Common Prefix ( LCP ) array can be computed in O ( n log 2 n ) expected time using O ( 1 ) words of space on top of the text and the LCP . Previous fastest in-place LCP array construction algorithm runs in quadratic time.

Recommend


More recommend