tries and suffix trees
play

Tries and Suffix Trees Inge Li Grtz String indexing problem String - PowerPoint PPT Presentation

Tries and Suffix Trees Inge Li Grtz String indexing problem String matching problem. Given strings T (text) and P (pattern) over an alphabet , report starting positions of all occurrences of P in T. Finite automaton: O(m + n) time


  1. Tries and Suffix Trees Inge Li Gørtz

  2. String indexing problem • String matching problem. Given strings T (text) and P (pattern) over an alphabet Σ , report starting positions of all occurrences of P in T. • Finite automaton: O(m Σ + n) time and space • KMP: O(m+n) time and space • String indexing problem. Given a string S of characters from an alphabet Σ . Preprocess S into a data structure to support • Search(P): Return starting position of all occurrences of P in S. • Today: Data structure using O(n) space and supporting Search(P) in O(m) time. • Applications: • Search engines, e.g. prefix searches. • Finding common substrings of many biological strings • Finding repeating substructures in biological strings • Detecting DNA contamination

  3. Outline • Tries • Compressed tries • Su ffi x trees • Applications of su ffi x trees

  4. Tries

  5. Tries • Text retrieval b t s e h e h y S 2 a a e e l S 4 S 6 S 3 l l s l S 1 s S 5 • Trie over the strings: sells, by, the, sea, shells, tea.

  6. Tries • Text retrieval • Prefix-free? b t s e h e h y S 2 a a e e l S 4 S 6 S 3 l l s l S 1 s S 5 • Trie over the strings: sells, by, the, sea, shells, tea, she .

  7. Tries • Text retrieval • Prefix-free? b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells, by, the, sea, shells, tea, she .

  8. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  9. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  10. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  11. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  12. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  13. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  14. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  15. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  16. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  17. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  18. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  19. Tries • Text retrieval • Search for “sea” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  20. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  21. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  22. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  23. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  24. Tries • Text retrieval • Search for “short” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5 • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

  25. Tries • Build a trie over the strings: by$, sells$, sea$. b s e y a $ l S 2 $ l S 4 s $ S 1

  26. Trie • Properties of the trie. A trie T storing a collection S of s strings of total length n from an alphabet of size d has the following properties: • How many children can a node have? • How many leaves does T have? • What is the height of T? • What is the number of nodes in T?

  27. Trie • Search time: O(d) in each node => O(dm). • O(m) if d constant. • d not constant: use dictionary • Hashing O(1) • Balanced BST: O(log d) • Time and space for a trie (for small/constant d): • O(m) for searching for a string of length m. • O(n) space. • Preprocessing: O(n)

  28. Tries • Prefix search: return all words in the trie starting with “se” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5

  29. Tries • Prefix search: return all words in the trie starting with “se” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5

  30. Tries • Prefix search: return all words in the trie starting with “se” b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5

  31. Trie • Time for prefix search: O(m) + time to report all occurrences. Could be large!! • Solution: compact tries.

  32. Compact tries

  33. Tries • Compact trie: Chains of nodes with a single child is merged into a single node. b t s e h e h y a a e e $ l S 2 $ $ $ l $ l S 7 S 4 S 6 S 3 s l $ s S 1 $ S 5

  34. Tries • Compact trie: Chains of nodes with a single child is merged into a single node. b s t e y h e h $ a l e a e $ l l $ $ $ s l $ s $ b y t $ s S 2 e he e h a e $ $ l l a l l S 6 $ S 3 s $ s $ $ S 4 S 1 S 5 S 7

  35. Trie • Properties of the compact trie. A compact trie T storing a collection S of s strings of total length n from an alphabet of size d has the following properties: • Every internal node of T has at least 2 and at most d children. • T has s leaves • The number of nodes in T is < 2s. • Time and space for a compact trie (constant d): • O(m) for searching for a string of length m. • O(m + occ) for prefix search, where occ = #occurrences • O(s) space. • Preprocessing: O(n)

  36. Suffix trees

  37. Suffix tree • String indexing problem. Given a string S of characters from an alphabet Σ . Preprocess S into a data structure to support • Search(P): Return starting position of all occurrences of P in S. • Build a compressed trie over all su ffi xes of S (su ffi x tree). Label leaves with index of su ffi x. • Observation: An occurrence of P is a prefix of a su ffi x of S. occurrence of P Su ffi x of S

  38. Suffix tree • String indexing problem. Given a string S of characters from an alphabet Σ . Preprocess S into a data structure to support • Search(P): Return starting position of all occurrences of P in S. • Build a compressed trie over all su ffi xes of S (su ffi x tree). Label leaves with index of su ffi x. • Observation: An occurrence of P is a prefix of a su ffi x of S. occurrence of P Su ffi x of S • Example: P = ana. b a n a n a s t r i n g s s a l a d s Su ffi x of S Su ffi x of S

  39. Suffix Tree • Su ffi x tree: over the string banana$ a n $ b a a n n 7 a a $ n n a $ a 6 $ $ n a $ 1 3 5 $ 2 4

  40. Suffix Tree • Su ffi x tree: over the string banana$ a n $ b a a n n 7 a a $ n n a $ a 6 $ $ n a $ 1 3 5 $ 2 4 • Search for P . • Report labels of all leaves below final node

  41. Suffix Tree • Su ffi x tree: over the string banana$ • Find all occurrences of P=“an” a n $ b a a n n 7 a a $ n n a $ a 6 $ $ n a $ 1 3 5 $ 2 4 • Search for P . • Report labels of all leaves below final node

Recommend


More recommend