String indexing in the Word RAM model, part 4 Paweł Gawrychowski University of Wrocław & Max-Planck-Institut für Informatik Paweł Gawrychowski String indexing in the Word RAM model IV 1 / 32
We consider a fundamental data structure question: how to represent a tree? (Compacted) Trie A trie is simply a tree with edges labeled by single characters. A compacted trie is created by replacing maximal chains of unary vertices with single edges labeled by (possibly long) words. Navigation queries Given a pattern p , we want to traverse the edges of a compacted trie to find the node corresponding to p . If there is no such node, we would like to compute its longest prefix for which the corresponding node does exist. Paweł Gawrychowski String indexing in the Word RAM model IV 2 / 32
Consider p = wewpxcwrehyzrt and the following compacted trie. qoidkbasdk wewpxc w t r r q e w qtkjdknewnbog povmnxd tovndfed hyugfecvbx n b o g khjkdjd d d f n v m c Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32
Consider p = wewpxcwrehyzrt and the following compacted trie. qoidkbasdk wewpxc w t r r q e w qtkjdknewnbog povmnxd tovndfed hyugfecvbx n b o g khjkdjd d d f n v m c Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32
Consider p = wewpxcwrehyzrt and the following compacted trie. qoidkbasdk wewpxc w t r r q e w qtkjdknewnbog povmnxd tovndfed hyugfecvbx n b o g khjkdjd d d f n v m c Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32
Consider p = wewpxcwrehyzrt and the following compacted trie. qoidkbasdk wewpxc w t r r q e w qtkjdknewnbog povmnxd tovndfed hy n b ugfecvbx o g khjkdjd d d f n v m c Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32
Splitting an edge Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node. abrakadabra Notice that this covers adding a new edge outgoing from an existing node. Paweł Gawrychowski String indexing in the Word RAM model IV 4 / 32
Splitting an edge Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node. abrakadabra z y x Notice that this covers adding a new edge outgoing from an existing node. Paweł Gawrychowski String indexing in the Word RAM model IV 4 / 32
Static case (yesterday) Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently ? Dynamic case Can we maintain a compacted trie so that: the resulting structure is small , 1 we can execute navigation queries efficiently , 2 we can split any edge efficiently ? 3 There are clearly three parameters: the number of nodes in the compacted trie n , the size of the alphabet σ , and the length of the pattern m . We aim to achieve good bounds in terms of those n , σ, m . Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32
Static case (yesterday) Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently ? Dynamic case Can we maintain a compacted trie so that: the resulting structure is small , 1 we can execute navigation queries efficiently , 2 we can split any edge efficiently ? 3 There are clearly three parameters: the number of nodes in the compacted trie n , the size of the alphabet σ , and the length of the pattern m . We aim to achieve good bounds in terms of those n , σ, m . Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32
Static case (yesterday) Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently ? Dynamic case Can we maintain a compacted trie so that: the resulting structure is small , 1 we can execute navigation queries efficiently , 2 we can split any edge efficiently ? 3 There are clearly three parameters: the number of nodes in the compacted trie n , the size of the alphabet σ , and the length of the pattern m . We aim to achieve good bounds in terms of those n , σ, m . Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32
It seems reasonable to consider the scenario where σ is non-constant, yet (significantly) smaller than n . Hence we get the following question: what are the best possible time bounds in terms of σ ? Gawrychowski and Fischer There exists a deterministic linear-size structure supporting navigation log log log σ ) time and splitting edges in O ( log 2 log σ log 2 log σ in O ( m + log log log σ ) . To make the above result useful, we develop a suffix tree oracle which can be used to locate the edge which should be split after prepending log 2 log σ a letter to the current text in O ( log log n + log log log σ ) time. Paweł Gawrychowski String indexing in the Word RAM model IV 6 / 32
Let us consider the dynamic case, and assume that n = O ( σ ) . Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups. Levels of nodes Let f ( ℓ ) = 2 ( 3 2 ) ℓ . We say that a node v is of level ℓ when the number of leaves in its subtree belongs to [ f ( ℓ ) , 2 f ( ℓ + 1 )] . We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level. Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32
Let us consider the dynamic case, and assume that n = O ( σ ) . Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups. Levels of nodes Let f ( ℓ ) = 2 ( 3 2 ) ℓ . We say that a node v is of level ℓ when the number of leaves in its subtree belongs to [ f ( ℓ ) , 2 f ( ℓ + 1 )] . We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level. Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32
Let us consider the dynamic case, and assume that n = O ( σ ) . Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups. Levels of nodes Let f ( ℓ ) = 2 ( 3 2 ) ℓ . We say that a node v is of level ℓ when the number of leaves in its subtree belongs to [ f ( ℓ ) , 2 f ( ℓ + 1 )] . We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level. Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32
Now, we classify the edges into two types: from a node to a node of the same level, 1 from a node to a node of a smaller level, 2 Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32
Now, we classify the edges into two types: from a node to a node of the same level, 1 from a node to a node of a smaller level, 2 Those edges are stored in a static dictionary with a constant access time. We already know that such dictionary can be constructed in close-to-linear time, and this turns out to be enough because of the way we defined the levels. More precisely, it cannot happen too often that a level of a node increases. Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32
Now, we classify the edges into two types: from a node to a node of the same level, 1 from a node to a node of a smaller level, 2 Those edges are stored in a dynamic dictionary structure. For this we develop a weighted variant of the exponential search trees of Andersson and Thorup, which we call the wexponential search trees. Andersson and Thorup 2002 An exponential search tree is a dynamic predecessor structure storing a subset of [ 1 , U ] with O ( log 2 log U log log log U ) time for insertions and predecessor queries. Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32
Even without the modification, the query complexity is fairly decent, log 3 log σ namely O ( m + log log log σ ) . This is because there are at most t = Θ( log log σ ) edges of type (2) on any path descending from the root. w i ∈ [ f ( i ) , 2 f ( i + 1)] w t w t − 1 w t − 2 w t − 3 Paweł Gawrychowski String indexing in the Word RAM model IV 9 / 32
We want to be faster though. The subsequent accesses to the dynamic dictionary structures are not completely independent, so there is hope! Wexponential search trees There exists a linear-size dynamic structure storing a collection of n weighted elements from [ 1 , U ] with the following bounds: predecessor search takes O ( log log W log log U log log log U ) , where W is the 1 log w current total weight, and w is the weight of the predecessor, inserting a new element of weight 1 takes O ( log log W ) , 2 increasing a weight of an element of weight w by 1 takes 3 O ( log log W log w ) . Paweł Gawrychowski String indexing in the Word RAM model IV 10 / 32
Recommend
More recommend