sparse suffix tree construction in small space
play

Sparse Suffix Tree Construction in Small Space Philip Bille, Inge Li - PowerPoint PPT Presentation

Sparse Suffix Tree Construction in Small Space Philip Bille, Inge Li Grtz, Hjalte Wedel Vildhj (Technical University of Denmark) Johannes Fischer, (Karlsruhe Institute of Technology) Tsvi Kopelowitz, (Weizmann Institute of Science) Benjamin


  1. Sparse Suffix Tree Construction in Small Space Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj (Technical University of Denmark) Johannes Fischer, (Karlsruhe Institute of Technology) Tsvi Kopelowitz, (Weizmann Institute of Science) Benjamin Sach (University of Warwick)

  2. The sparse suffix array (SSA) n a n a n a s T T b

  3. The sparse suffix array (SSA) a n a n a s 1 b n a n a n a s a n a n a s 2 T T b n a n a s 3 a n a s 4 n a s 5 a s 6 s 7

  4. The sparse suffix array (SSA) a n a n a s 1 b n a n a n a s a n a n a s 2 T T b n a n a s 3 a n a s 4 n a s 5 a s 6 s 7

  5. The sparse suffix array (SSA) a n a n a s 1 b n a n a n a s a n a n a s 2 T T b n a n a s 3 a n a s 4 n a s 5 a s 6 s 7

  6. The sparse suffix array (SSA) a n a n a s 1 b n a n a n a s a n a n a s 2 T T b n a n a s 3 a n a s 4 Sort the suffixes lexicographically n a s 5 a s 6 s 7

  7. The sparse suffix array (SSA) a n a n a s 2 n a n a n a s a n a s 4 T T b a s 6 a n a n a s 1 Sort the suffixes b lexicographically n a n a s 3 n a s 5 s 7

  8. The sparse suffix array (SSA) a n a n a s 2 n a n a n a s a n a s 4 T T b a s 6 a n a n a s 1 Sort the suffixes b lexicographically n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7

  9. The sparse suffix array (SSA) a n a n a s 2 n a n a n a s a n a s 4 T T b a s 6 a n a n a s 1 Sort the suffixes b lexicographically n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log and O ( n ) extra space

  10. The sparse suffix array (SSA) a n a n a s 2 n a n a n a s a n a s 4 T T b a s 6 a n a n a s 1 Sort the suffixes b lexicographically n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log and O ( n ) extra space • What if we only care about a few of the suffixes?

  11. The sparse suffix array (SSA) a n a n a s 2 n a n a n a s a n a s 4 T T b a s 6 a n a n a s 1 b n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log and O ( n ) extra space • What if we only care about a few of the suffixes?

  12. The sparse suffix array (SSA) a n a n a s 2 n a n a n a s a n a s 4 T T b a n a n a s a s 6 n a s a n a n a s b 1 b a s n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log and O ( n ) extra space • What if we only care about a few of the suffixes?

  13. The sparse suffix array (SSA) a n a n a s 2 n a n a n a s a n a s 4 T T b a n a n a s a s 6 n a s a n a n a s b 1 b a s n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log and O ( n ) extra space • What if we only care about a few of the suffixes?

  14. The sparse suffix array (SSA) a n a n a s 2 n O ( b ) a n a n a s a n a s 4 T T b a n a n a s a s 6 n a s a n a n a s b 1 b a s n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log and O ( n ) extra space • What if we only care about a few of the suffixes?

  15. The sparse suffix array (SSA) a n a n a s 2 n O ( b ) a n a n a s a n a s 4 T T b a n a n a s a s 6 n a s a n a n a s b 1 b a s n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log and O ( n ) extra space • What if we only care about a few of the suffixes?

  16. The sparse suffix array (SSA) a n a n a s 2 n O ( b ) a n a n a s a n a s 4 T T b a n a n a s a s 6 n a s a n a n a s b 1 b a s n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log Sparse Suffix Array 2 6 5 and O ( n ) extra space b • What if we only care about a few of the suffixes?

  17. The sparse suffix array (SSA) a n a n a s 2 n O ( b ) a n a n a s a n a s 4 T T b a n a n a s a s 6 n a s a n a n a s b 1 b a s n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • Can be built in O ( n ) time log Sparse Suffix Array 2 6 5 and O ( n ) extra space b • What if we only care about a few of the suffixes? The sparse text indexing problem has been open since the 1960s . . . with first, partial results from 1996 onwards

  18. The sparse suffix array (SSA) a n a n a s 2 n O ( b ) a n a n a s a n a s 4 T T b a n a n a s a s 6 n a s a n a n a s b 1 b a s n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 Sparse Suffix Array 2 6 5 b

  19. The sparse suffix array (SSA) a n a n a s 2 n O ( b ) a n a n a s a n a s 4 T T b a n a n a s a s 6 n a s a n a n a s b 1 b a s n a n a s 3 n a s 5 n s 7 Suffix Array 2 4 6 1 3 5 7 • O ( n log 2 b ) time (Monte-Carlo) Sparse Suffix Array 2 6 5 • O (( n + b 2 ) log 2 b ) time with high b probability (Las-Vegas) • both in O ( b ) extra space

  20. The sparse suffix tree (SST) n O ( b ) bananas a n a n a s T T b s a n a n a n a s a n a s b a s nas na s s nas • O ( n log 2 b ) time (Monte-Carlo) s • O (( n + b 2 ) log 2 b ) time with high probability (Las-Vegas) • both in O ( b ) space Conversion between SSA and SST is simple and takes O ( n log b ) time

  21. LCPs - a fundamental tool for string algorithms n a b c b a a b c a b a a T b b For any ( i, j ) , the longest common prefix is the largest ℓ such that T [ i . . . i + ℓ − 1] = T [ j . . . j + ℓ − 1] it’s the furthest you can go before hitting a mismatch

  22. LCPs - a fundamental tool for string algorithms n a b c b a a b c a b a a T b b i j For any ( i, j ) , the longest common prefix is the largest ℓ such that T [ i . . . i + ℓ − 1] = T [ j . . . j + ℓ − 1] it’s the furthest you can go before hitting a mismatch

  23. LCPs - a fundamental tool for string algorithms n a b c b a a b c a b a a T b b 3 i j For any ( i, j ) , the longest common prefix is the largest ℓ such that T [ i . . . i + ℓ − 1] = T [ j . . . j + ℓ − 1] it’s the furthest you can go before hitting a mismatch

  24. LCPs - a fundamental tool for string algorithms n a b c b a a b c a b a a T b b 4 i j For any ( i, j ) , the longest common prefix is the largest ℓ such that T [ i . . . i + ℓ − 1] = T [ j . . . j + ℓ − 1] it’s the furthest you can go before hitting a mismatch

  25. LCPs - a fundamental tool for string algorithms n a b c b a a b c a b a a T b b i j For any ( i, j ) , the longest common prefix is the largest ℓ such that T [ i . . . i + ℓ − 1] = T [ j . . . j + ℓ − 1] it’s the furthest you can go before hitting a mismatch • LCP data structures are typically based on the suffix array or suffix tree.

  26. LCPs - a fundamental tool for string algorithms n a b c b a a b c a b a a T b b i j For any ( i, j ) , the longest common prefix is the largest ℓ such that T [ i . . . i + ℓ − 1] = T [ j . . . j + ℓ − 1] it’s the furthest you can go before hitting a mismatch • LCP data structures are typically based on the suffix array or suffix tree. • We do the opposite - we use batched LCP queries to construct the sparse suffix array

  27. LCPs - a fundamental tool for string algorithms n a b c b a a b c a b a a T b b i j For any ( i, j ) , the longest common prefix is the largest ℓ such that T [ i . . . i + ℓ − 1] = T [ j . . . j + ℓ − 1] it’s the furthest you can go before hitting a mismatch • LCP data structures are typically based on the suffix array or suffix tree. • We do the opposite - we use batched LCP queries to construct the sparse suffix array • These LCP queries will be answered using Karp-Rabin fingerprints to ensure that the space remains small

  28. Karp-Rabin fingerprints of strings a b a c c b a b c b S k =0 S [ k ] r k mod p φ ( S ) = � | S |− 1 Here p = Θ( n 4 ) is a prime and 1 ≤ r < p is a random integer S 1 = S 2 iff φ ( S 1 ) = φ ( S 2 ) with high probability,

  29. Karp-Rabin fingerprints of strings a b a c c b a b c b S k =0 S [ k ] r k mod p φ ( S ) = � | S |− 1 Here p = Θ( n 4 ) is a prime and 1 ≤ r < p is a random integer S 1 = S 2 iff φ ( S 1 ) = φ ( S 2 ) with high probability, Observe that φ ( S ) fits in an O (log n ) bit word

  30. Karp-Rabin fingerprints of strings a b a c c b a b c b S k =0 S [ k ] r k mod p φ ( S ) = � | S |− 1 Here p = Θ( n 4 ) is a prime and 1 ≤ r < p is a random integer S 1 = S 2 iff φ ( S 1 ) = φ ( S 2 ) with high probability, Observe that φ ( S ) fits in an O (log n ) bit word Given φ ( S [0 , ℓ ]) and φ ( S [0 , r ]) we can compute φ ( S [ ℓ + 1 , r ]) in O (1) time

  31. Simple, Monte-Carlo batched LCP queries Input : a string, T of length n and b pairs, ( i, j ) Output : for each pair ( i, j ) output the largest ℓ s.t. T [ i . . . i + ℓ − 1] = T [ j . . . j + ℓ − 1]

Recommend


More recommend