Suffix Trees Construction and Applications João Carreira 2008
Outline ● Why Suffix Trees? ● Definition ● Ukkonen's Algorithm (construction) ● Applications
Why Suffix Trees?
Why Suffix Trees? ● Asymptotically fast.
Why Suffix Trees? ● Asymptotically fast. ● The basis of state of the art data structures.
Why Suffix Trees? ● Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them.
Why Suffix Trees? ● Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them. ● Challenging.
Why Suffix Trees? ● Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them. ● Challenging. ● Expose interesting algorithmic ideas.
Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m
Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m ● edge-label vs node-label
Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children
Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children ● the label of the leaf j is S[ j..m ]
Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children ● the label of the leaf j is S[ j..m ] ● no two edges out of the same node can have edge-labels beginning with the same character
Definition Example String: xabxac Length (m): 6 characters Number of Leaves: 6 Node 5 label: ac
Implicit vs Explicit ● What if we have “ axabx ” ?
Ukkonen's Algorithm suffix tree construction
Ukkonen's Algorithm suffix tree construction ● Text : S[ 1.. m ] ● m phases ● phase j is divided into j extensions: In extension j of phase i + 1: ● find the end of the path from the root labeled with substring S[ j..i ] ● extend the substring by adding the character S( i + 1) to its end
Extension Rules ● Rule 1: Path β ends at a leaf. S( i + 1) is added to the end of the label on that leaf edge.
Extension Rules ● Rule 2: No path from the end of β starts with S( i + 1), but at least one labeled path continues from the end of β .
Extension Rules ● Rule 3: Some path from the end of β starts with S( i + 1), so we do nothing.
Ukkonen's Algorithm suffix tree construction Complexity:
Ukkonen's Algorithm suffix tree construction Complexity: ● m phases
Ukkonen's Algorithm suffix tree construction Complexity: ● m phases ● phase j -> j extensions
Ukkonen's Algorithm suffix tree construction Complexity: ● m phases ● phase j -> j extensions ● find the end of the path of substring β: O(| β |) = O( m )
Ukkonen's Algorithm suffix tree construction Complexity: ● m phases ● phase j -> j extensions ● find the end of the path of substring β: O(| β |) = O( m ) ● each extension: O(1)
Ukkonen's Algorithm suffix tree construction Complexity: ● m phases ● phase j -> j extensions ● find the end of the path of substring β: O(| β |) = O( m ) ● each extension: O(1) O( m 3 )
“First make it run, then make it run fast.” Brian Kernighan
Suffix Links Definition: ● For an internal node v with path-label xα , if there is another node s( v ), with path-label α , then a pointer from v to s( v ) is called a suffix link .
Suffix Links Lemma: ● If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies:
Suffix Links Lemma: ● If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: ● S[ j..i ] continues with c ≠ S(i + 1)
Suffix Links Lemma: ● If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: ● S[ j..i ] continues with c ≠ S(i + 1) ● S[ j + 1..i ] continues with c.
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ.
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree. 4. If a new internal w was created in extension j – 1 (by rule 2), then string α must end at node s(w), the end node for the suffix link from w. Create the suffix link (w, s(w)) from w to s(w).
Node Depth The node-depth of v is at most one greater than the node depth of s( v ). xß xß ß ß xα xα α α xλ xλ λ λ Node depth: 4 Node depth: 3 equal node-depth: 3
Skip/count Trick ● γ number of characters in an edge ● “Directly implemented” edge traversal: O(|γ|)
Skip/count Trick ● γ number of characters in an edge ● “Directly implemented” edge traversal: O(|γ|) ● “Jump” from node to node. ● K = number of nodes in a path ● Time to traverse a path: O(|K|)
Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O( m ) time. Proof:
Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O( m ) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1
Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link.
Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one.
Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Each suffix link traversal decreases the node-depth by at most another one.
Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Each suffix link traversal decreases the node-depth by at most another one. ● Each down-walk moves to a node of greater depth.
Recommend
More recommend