Suffix Trees Construction and Applications Joo Carreira 2008 - PowerPoint PPT Presentation

Suffix Trees Construction and Applications João Carreira 2008

Outline ● Why Suffix Trees? ● Definition ● Ukkonen's Algorithm (construction) ● Applications

Why Suffix Trees?

Why Suffix Trees? ● Asymptotically fast.

Why Suffix Trees? ● Asymptotically fast. ● The basis of state of the art data structures.

Why Suffix Trees? ● Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them.

Why Suffix Trees? ● Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them. ● Challenging.

Why Suffix Trees? ● Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them. ● Challenging. ● Expose interesting algorithmic ideas.

Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m

Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m ● edge-label vs node-label

Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children

Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children ● the label of the leaf j is S[ j..m ]

Definition Suffix Tree for an m -character string: ● m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children ● the label of the leaf j is S[ j..m ] ● no two edges out of the same node can have edge-labels beginning with the same character

Definition Example String: xabxac Length (m): 6 characters Number of Leaves: 6 Node 5 label: ac

Implicit vs Explicit ● What if we have “ axabx ” ?

Ukkonen's Algorithm suffix tree construction

Ukkonen's Algorithm suffix tree construction ● Text : S[ 1.. m ] ● m phases ● phase j is divided into j extensions: In extension j of phase i + 1: ● find the end of the path from the root labeled with substring S[ j..i ] ● extend the substring by adding the character S( i + 1) to its end

Extension Rules ● Rule 1: Path β ends at a leaf. S( i + 1) is added to the end of the label on that leaf edge.

Extension Rules ● Rule 2: No path from the end of β starts with S( i + 1), but at least one labeled path continues from the end of β .

Extension Rules ● Rule 3: Some path from the end of β starts with S( i + 1), so we do nothing.

Ukkonen's Algorithm suffix tree construction Complexity:

Ukkonen's Algorithm suffix tree construction Complexity: ● m phases

Ukkonen's Algorithm suffix tree construction Complexity: ● m phases ● phase j -> j extensions

Ukkonen's Algorithm suffix tree construction Complexity: ● m phases ● phase j -> j extensions ● find the end of the path of substring β: O(| β |) = O( m )

Ukkonen's Algorithm suffix tree construction Complexity: ● m phases ● phase j -> j extensions ● find the end of the path of substring β: O(| β |) = O( m ) ● each extension: O(1)

Ukkonen's Algorithm suffix tree construction Complexity: ● m phases ● phase j -> j extensions ● find the end of the path of substring β: O(| β |) = O( m ) ● each extension: O(1) O( m 3 )

“First make it run, then make it run fast.” Brian Kernighan

Suffix Links Definition: ● For an internal node v with path-label xα , if there is another node s( v ), with path-label α , then a pointer from v to s( v ) is called a suffix link .

Suffix Links Lemma: ● If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies:

Suffix Links Lemma: ● If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: ● S[ j..i ] continues with c ≠ S(i + 1)

Suffix Links Lemma: ● If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: ● S[ j..i ] continues with c ≠ S(i + 1) ● S[ j + 1..i ] continues with c.

Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ.

Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.

Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree. 4. If a new internal w was created in extension j – 1 (by rule 2), then string α must end at node s(w), the end node for the suffix link from w. Create the suffix link (w, s(w)) from w to s(w).

Node Depth The node-depth of v is at most one greater than the node depth of s( v ). xß xß ß ß xα xα α α xλ xλ λ λ Node depth: 4 Node depth: 3 equal node-depth: 3

Skip/count Trick ● γ number of characters in an edge ● “Directly implemented” edge traversal: O(|γ|)

Skip/count Trick ● γ number of characters in an edge ● “Directly implemented” edge traversal: O(|γ|) ● “Jump” from node to node. ● K = number of nodes in a path ● Time to traverse a path: O(|K|)

Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O( m ) time. Proof:

Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O( m ) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1

Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link.

Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one.

Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Each suffix link traversal decreases the node-depth by at most another one.

Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Each suffix link traversal decreases the node-depth by at most another one. ● Each down-walk moves to a node of greater depth.

Suffix Trees Construction and Applications Joo Carreira 2008 - PowerPoint PPT Presentation

Suffix Trees Construction and Applications Joo Carreira 2008 Outline Why Suffix Trees? Definition Ukkonen's Algorithm (construction) Applications Why Suffix Trees? Why Suffix Trees? Asymptotically fast. Why Suffix Trees?

Algorithms in Bioinformatics: A Practical Introduction Suffix tree Overview What is suffix

capitalise Suffix terrorise fertilise ise suffix words are usually just created by simply

Approximate Pattern Matching Using Suffix Tries Hendrik Nigul nigulh@math.ut.ee University of

Lecture 15: Suffix trees, suffix arrays, and their applica8ons

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of

This week, we are going to look at adding words ending in the suffix al. Can you remember what

Suffix tree and Suffix array Karatsuba CS214: Algorithms and Complexity Shanghai Jiao Tong

An Algorithm for Suffix Stripping Evaluation Algorithm Porter (1980) Notations Rules Further

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Algorithms Theory 15 Text Search (2) Construction of suffix trees Prof. Dr. S. Albers

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Trees Applied Multivariate Statistics Spring 2012 Overview Intuition for Trees

A Haskell Implementation of Turing Machines Lim Shao En Zhang Licheng Computer Science

ABSTRACT _____________________________________________ THE TOPOLOGICAL INDICES OF SOME GRAPHS OF

Presentation to the ETR-RT May 11, 2015 Spilios Makris (Chair) Palindrome Technologies Outline

SKP 2014 problem presentation; spoiler alert! Administration Back and Forth Cryptography

Synthetic Bio-Communication S YNTHETIC B IO -C OMMUNICATION 1. A TOMIC 2. I NTERCELLULAR 3. T IME

Week 9: 10/28-11/1, 2013 Unit II continues Finish History Alive! Ch. 19: Foreign Policy; Ch.

1 ISAlliance SCAP VoIP Project Update 12 June 2009 Lawrence G Dobranski, CISSP-ISSAP, CISM, CSSLP

Algorithms In Music Outreach to Students Interested in the Arts Lisa Lajeunesse Capilano

Sambuz

Useful Links

Newsletter

Mail Us