String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li Gørtz 1 , Hjalte Wedel Vildhøj 1 , and Søren Vind 1 1 Technical University of Denmark, DTU Informatics SWAT 2012, Helsinki July 6, 2012 1 / 37
String Indexing for Patterns with Wildcards Problem Definition Build an index for a string t ∈ Σ ∗ , that, given a query pattern p , quickly can report where p occurs in t . p = p 0 ∗ p 1 ∗ . . . ∗ p j Example t = combinatorialpatternmatching p = ∗ at ∗ ∗ ∗ n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ❝ ♦ ♠ ❜ ✐ ♥ ❛ t ♦ r ✐ ❛ ❧ ♣ ❛ t t ❡ r ♥ ♠ ❛ t ❝ ❤ ✐ ♥ ❣ a t n ∗ ∗ ∗ ∗ a t n ∗ ∗ ∗ ∗ 2 / 37
Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 3 / 37
Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 4 / 37
Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 5 / 37
Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 6 / 37
Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 7 / 37
Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 8 / 37
Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 Time: O ( σ j m + occ ) Space: O ( n ) 9 / 37
Two Simple Solutions Simple Linear Time Index bananas$ na s$ a 1 7 nas$ n s$ s$ a 6 3 5 nas$ s$ 2 4 10 / 37
Two Simple Solutions Simple Linear Time Index bananas$ na s$ a 1 7 nas$ n s$ s$ a 6 3 5 nas$ s$ 2 4 11 / 37
Two Simple Solutions Simple Linear Time Index bananas$ na s$ a ∗ 1 7 nas$ n s$ s$ na s$ a $ a 6 3 5 6 7 nas$ nas$ s$ na s$ s$ 2 4 5 2 4 nas$ s $ 1 3 12 / 37
Two Simple Solutions Simple Linear Time Index bananas$ na s$ a ∗ 1 7 nas$ n s$ s$ na s$ ∗ ∗ a $ a 6 3 5 6 7 nas$ nas$ as$ s$ na s$ s$ ∗ a $ $ 2 4 6 3 5 5 2 4 nas$ nas$ as$ s$ s $ $ 2 4 2 4 1 3 13 / 37
14 / 37 6 $ 5 s $ na 3 $ s n a s $ a 1 ∗ 7 4 s $ n $ a s $ 2 s$ 6 4 $ na a s $ 2 ∗ s$ 4 nas$ a 2 5 $ 3 a s $ n a ∗ s $ 1 s$ 5 na 3 $ ∗ 7 a s ∗ $ 1 s $ nas$ 3 1 s$ 3 ∗ s$ $ 5 as$ Two Simple Solutions ∗ s$ 5 3 nas$ na 3 4 s $ bananas$ 1 n a s $ 2 ∗ 4 $ Simple Linear Time Index $ 6 a s a $ ∗ a 2 s$ 4 ∗ nas$ 2 s$ 6 s$ 2 n ∗ a $ 4 as$ ∗ s$ 4 2 nas$ 2
Two Simple Solutions Simple Linear Time Index p = ∗ na ∗ bananas$ na s$ a ∗ 1 7 nas$ n s$ s$ na s$ ∗ ∗ a $ ∗ a 6 3 5 6 7 nas$ nas$ as$ s$ na s$ s$ na s ∗ a ∗ ∗ ∗ ∗ a $ $ $ $ 2 4 6 3 5 5 2 4 5 6 nas$ n nas$ n n as$ a s$ a s s$ s a s a s ∗ ∗ $ ∗ a $ s $ s $ $ s $ s $ $ $ $ $ 2 4 2 4 2 4 3 1 3 5 2 4 2 4 1 3 n a a s$ a s s $ s $ s $ $ $ $ 2 2 4 1 3 1 3 15 / 37
Two Simple Solutions Simple Linear Time Index p = ∗ na ∗ bananas$ na s$ a ∗ 1 7 nas$ n s$ s$ na s$ ∗ ∗ a $ ∗ a 6 3 5 6 7 nas$ nas$ as$ s$ na s$ s$ na s ∗ a ∗ ∗ ∗ ∗ a $ $ $ $ 2 4 6 3 5 5 2 4 5 6 nas$ n nas$ n n as$ a s$ a s s$ s a s a s ∗ ∗ $ ∗ a $ s $ s $ $ s $ s $ $ $ $ $ 2 4 2 4 2 4 3 1 3 5 2 4 2 4 1 3 n a a s$ a s s $ s $ s $ $ $ $ Time: O ( m + j + occ ) Space: O ( n k + 1 ) 2 2 4 1 3 1 3 16 / 37
The Longest Common Prefix Data Structure 1 LCP Queries Let C i be a set of substrings of the indexed string. Consider the following query on the compressed trie T ( C i ) storing the strings in C i . LCP ( x , i , ℓ ) : The location where the search for x ∈ Σ ∗ stops when starting in location ℓ ∈ T ( C i ) . Example: x = angry and C i = suff ( bananas ) . ℓ bananas n s a a LCP ( x , i , ℓ ) n n s a s a s n s a s T ( C i ) 1 R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares . Proc. 36th STOC, 2004. 17 / 37
The Longest Common Prefix Data Structure 1 An Application Search for subpatterns in the suffix tree using the LCP data structure: ◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards: ◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution. 1 R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares . Proc. 36th STOC, 2004. 18 / 37
The Longest Common Prefix Data Structure 1 An Application Search for subpatterns in the suffix tree using the LCP data structure: ◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards: ◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution. How fast can you answer an LCP query? ◮ O ( log log n ) time and O ( n log n ) space. ⇒ Index with query time O ( m + σ j log log n + occ ) and space O ( n log n ) . ◮ We show that you can also do O ( log n ) time and O ( n ) space. ⇒ Index with query time O ( m + σ j log n + occ ) and space O ( n ) . 1 R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares . Proc. 36th STOC, 2004. 19 / 37
The Longest Common Prefix Data Structure 1 An Application Search for subpatterns in the suffix tree using the LCP data structure: ◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards: ◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution. How fast can you answer an LCP query? ◮ O ( log log n ) time and O ( n log n ) space. ⇒ Index with query time O ( m + σ j log log n + occ ) and space O ( n log n ) . ◮ We show that you can also do O ( log n ) time and O ( n ) space. ⇒ Index with query time O ( m + σ j log n + occ ) and space O ( n ) . 1 R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares . Proc. 36th STOC, 2004. 20 / 37
S OLUTION 1 An Unbounded Wildcard Index Using Linear Space O ( m + σ j log log n + occ ) Query Time: Space Usage: O ( n ) 21 / 37
An Unbounded Wildcard Index Using Linear Space ART Decomposition 2 Definition: ◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree . Example: A tree with n = 16 leaves (log n = 4). 2 S. Alstrup, T. Husfeldt, and T. Rauhe Marked ancestor problems . Proc. 39th FOCS, 1998. 22 / 37
An Unbounded Wildcard Index Using Linear Space ART Decomposition 2 Definition: ◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree . Example: A tree with n = 16 leaves (log n = 4). B 1 2 S. Alstrup, T. Husfeldt, and T. Rauhe Marked ancestor problems . Proc. 39th FOCS, 1998. 23 / 37
An Unbounded Wildcard Index Using Linear Space ART Decomposition 2 Definition: ◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree . Example: A tree with n = 16 leaves (log n = 4). B 3 B 2 B 5 B 6 B 7 B 1 B 4 B 8 B 9 2 S. Alstrup, T. Husfeldt, and T. Rauhe Marked ancestor problems . Proc. 39th FOCS, 1998. 24 / 37
An Unbounded Wildcard Index Using Linear Space ART Decomposition 2 Definition: ◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree . Example: A tree with n = 16 leaves (log n = 4). B 3 B 2 B 5 B 6 B 7 B 1 B 4 B 8 B 9 n Property: The top tree has O ( log n ) leaves. 2 S. Alstrup, T. Husfeldt, and T. Rauhe Marked ancestor problems . Proc. 39th FOCS, 1998. 25 / 37
Recommend
More recommend