csi5126 algorithms in bioinformatics
play

CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte - PowerPoint PPT Presentation

. Boyer-Moore . . . . . . . . . Preamble Notation Suffjx Trees . Preamble Notation Boyer-Moore Suffjx Trees CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte School of Electrical Engineering and Computer Science


  1. . Preamble . . . . . . . . . . Notation . Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Notation (continued) also called a factor . A substring, prefjx or suffjx is proper if it’s not the entire string (and it is not empty). Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . S [ i .. j ] denotes the (contiguous) substring of S that starts at position i and stops at position j , S ( i ) S ( i + 1 ) . . . S ( j ) ; S [ 1 .. i ] is the prefjx of S . S [ i .. | S | ] is the suffjx of S .

  2. . Boyer-Moore . . . . . . . . . Preamble Notation Suffjx Trees . Preamble Notation Boyer-Moore Suffjx Trees Notation (continued) We say that S is a subsequence of T , if there exists an can be obtained by deleting zero or more characters of T . E.g. tie is a subsequence of otherwise . We say that two characters match if they are the same; otherwise we say it’s a mismatch . Let P denote a pattern (query, for now a string) and T be Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . increasing set of indices of T , i 1 < i 2 < . . . < i m , such that S = T ( i 1 ) T ( i 2 ) . . . T ( i m ) . In other words, the string S a text (think of it as a database), in general | P | << | T | .

  3. . Suffjx Trees . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Motivation . Problem: Given a pattern P and a text T , determine if P occurs in T . Problem: Given a pattern P and a text T , fjnd all occurrences of P in T . P = string T = Algorithms on text (strings) have long been studied in computer science, and computation on molecular sequence data (strings) is at the heart of computational molecular biology. Present and potential algorithms for string computation provide a signifjcant intersection between computer science and molecular biology. How do you approach such problem? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  4. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Naïve algorithm Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A window of size | P | is moved along the text ( T ). In the worst case, for every starting location, 1 . . . | T | , all the symbols of the pattern ( | P | ) must be considered. Therefore requiring | T | × | P | comparisons.

  5. public while } count ; } } done = p . charAt ( o f f s e t ) != t . charAt ( pos+o f f s e t ) ; } else { count++; s t a t i c ( o f f s e t == lp ) { i f o f f s e t ++; done ) { ( ! done = true ; boolean done = p . charAt ( o f f s e t ) != t . charAt ( pos+o f f s e t ) ; lp=p . length () , o f f s e t = 0; f i n d a l l ( String p , String t ) { int int l t = t . length () , count = 0; for ( int pos++) { int pos=0; pos<=lt − lp ; } return . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  6. . . . . . . . . . . . . Preamble . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Discussion In practice , what behaviour do you expect? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  7. . Preamble . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Notation . Boyer-Moore Suffjx Trees Discussion 1 the while loop for each iteration of the exterior for loop (simplifjed reasoning). Assuming random pattern and text, one would expect to What is the maximum length of a pattern that you would expect to fjnd at least once in the human genome? Conclusion : you’d expect the inner loop stops rapidly. How do you speed it up? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics First, | P | << | T | . But also, with probability ( 1 − | Σ | ) the algorithm will skip fjnd 1 complete exact match every | Σ | | P | positions. log 4 3 , 000 , 000 , 000 ∼ 16.

  8. . Preamble . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Notation . Boyer-Moore Suffjx Trees Speeding up There are two fundamentally difgerent approaches: Pre-processing P (e.g. Boyer-Moore) Pre-processing T (e.g. Suffjx Trees) the Boyer-Moore algorithm fjrst, before comparing P and T , we are willing to spend time and space , analyzing P , pre-calculating indices that we know will be useful later and will reduce the total number of comparison and shift operations needed. In the case of suffjx trees, we are willing to spend time and space on the analysis Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ But, what do you mean by pre-processing ? Let’s consider of T .

  9. . Boyer-Moore : ideas . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees When comparing two strings, P and T , the Boyer-Moore . algorithm proceed from right to left: T: xpbctbxabpqxctbpq P: tpabxab *^^^^ tpabxab Once a mismatch has been found, it applies one of 2 rules to shift the position of the pattern with respect to the text (instead of systematically shifting the pattern one position to the right, as the naïve algorithm does): Bad character rule (Strong) good suffjx rule . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  10. . . . . . . . . . . . . Preamble . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Bad character rule P = tpabxab R Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . Defjnition . R ( x ) is the rightmost occurrence of the character x in P . R ( x ) = 0 if x does not occur in P . Preprocessing . Calculate R ( x ) ∀ x ∈ Σ (alphabet); this necessitates O ( n ) operations. Σ = { a , b , p , t , x } = { 6 , 7 , 2 , 1 , 5 }

  11. . Suffjx Trees . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Bad character rule (continued) . T: xpbctbxabpqxctbpq P: tpabxab *^^^^ tpabxab The naïve algorithm would shift the pattern one position to the right, comparing the two strings again. However, we could have known in advance that a mismatch would occur because the location of the right most occurrence of t in P is on the left hand side of the symbol p (the symbol that will be aligned with t of T when P is shifted one position to the right). Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  12. . . . . . . . . . . . . Preamble . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Idea behind the strong suffjx rules T and y of P are distinct (i.e. the fjrst mismatch). Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . t x T t y P ⇒ The boxes labeled t are identical substrings, the characters x of

  13. . . . . . . . . . . . . Preamble . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Idea behind the strong suffjx rules characters y and z are distinct. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . t x T t’ t z y P ⇒ t ′ is a substing of P that matches the suffjx t , furthermore,

  14. . . . . . . . . . . . . Preamble . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Idea behind the strong suffjx rules z are distinct, z has actually a chance of matching x . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . t x T t’ t z y P t’ t z y P ⇒ shift P so that t ′ now aligns with t in T , since characters y and

  15. . Notation . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Boyer-Moore . Suffjx Trees Remarks It can be shown that the pre-processing time to calculate the index for the bad character rule and the size of the pattern . The resulting algorithm runs in expected linear time w.r.t. the size of the database . Boyer-Moore method can be extended so that in the worst-case it also runs in linear time. Other well known algorithms are Knuth-Morris-Pratt , Apostolico-Giancarlo and Aho-Corasick to name a few. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics strong suffjx rule can be done in linear time w.r.t. the

  16. . respect to the size of the database ( T , i.e. text) , . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Suffjx Trees (ST) With suitable extensions, Boyer-Moore and other exact string matching algorithms run in linear time with operations . . P , i.e. independent of the size Marcel Turcotte longest repeat. More later. longest common substring of two strings or fjnding the ST have many more applications , such as fjnding the done. of the database (!) ; once the preprocessing has been Search is done in Suffjx trees algorithms run in linear time with respect T . In most applications, P preprocessing time/space . T necessitates to the size of the query ( P , i.e. pattern) but . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . and its preprocessing necessitates order of | P |

  17. . With suitable extensions, Boyer-Moore and other exact . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Suffjx Trees (ST) string matching algorithms run in linear time with . respect to the size of the database ( T , i.e. text) , operations . Suffjx trees algorithms run in linear time with respect to the size of the query ( P , i.e. pattern) but In most applications, P T . Search is done in P , i.e. independent of the size of the database (!) ; once the preprocessing has been done. ST have many more applications , such as fjnding the longest common substring of two strings or fjnding the longest repeat. More later. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics and its preprocessing necessitates order of | P | necessitates O ( | T | ) preprocessing time/space .

  18. . Suffjx Trees (ST) . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees With suitable extensions, Boyer-Moore and other exact . string matching algorithms run in linear time with respect to the size of the database ( T , i.e. text) , operations . Suffjx trees algorithms run in linear time with respect to the size of the query ( P , i.e. pattern) but Search is done in P , i.e. independent of the size of the database (!) ; once the preprocessing has been done. ST have many more applications , such as fjnding the longest common substring of two strings or fjnding the longest repeat. More later. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . and its preprocessing necessitates order of | P | necessitates O ( | T | ) preprocessing time/space . In most applications, | P | << | T | .

  19. . Boyer-Moore . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Suffjx Trees . Suffjx Trees (ST) With suitable extensions, Boyer-Moore and other exact string matching algorithms run in linear time with respect to the size of the database ( T , i.e. text) , operations . Suffjx trees algorithms run in linear time with respect to the size of the query ( P , i.e. pattern) but of the database (!) ; once the preprocessing has been done. ST have many more applications , such as fjnding the longest common substring of two strings or fjnding the longest repeat. More later. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . and its preprocessing necessitates order of | P | necessitates O ( | T | ) preprocessing time/space . In most applications, | P | << | T | . Search is done in O ( | P | ) , i.e. independent of the size

  20. . Boyer-Moore . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Suffjx Trees . Suffjx Trees (ST) With suitable extensions, Boyer-Moore and other exact string matching algorithms run in linear time with respect to the size of the database ( T , i.e. text) , operations . Suffjx trees algorithms run in linear time with respect to the size of the query ( P , i.e. pattern) but of the database (!) ; once the preprocessing has been done. ST have many more applications , such as fjnding the longest common substring of two strings or fjnding the longest repeat. More later. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . and its preprocessing necessitates order of | P | necessitates O ( | T | ) preprocessing time/space . In most applications, | P | << | T | . Search is done in O ( | P | ) , i.e. independent of the size

  21. . Suffjx Trees . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Substring problem . Original problem. “One is fjrst given a text T of length prepared to take in any unknown string S of length n and determine that S is not contained in T ”. In practice, the preprocessing takes time and necessitates a lot of disk space, it is therefore used in situations where the database is static and the queries are frequent . The preprocessing requires m memory, however, the constant can be as large as a hundred, with the best known implementation (and most complex one) requiring 28 bytes per input byte , i.e. the suffjx tree of a 3 Gbytes string would require 84 Gbytes . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics m . After O ( m ) , or linear, preprocessing time, one must be O ( n ) time either fjnd an occurrence of S in T or

  22. . Suffjx Trees . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Substring problem . Original problem. “One is fjrst given a text T of length prepared to take in any unknown string S of length n and determine that S is not contained in T ”. In practice, the preprocessing takes time and necessitates a lot of disk space, it is therefore used in situations where the database is static and the queries are frequent . The preprocessing requires m memory, however, the constant can be as large as a hundred, with the best known implementation (and most complex one) requiring 28 bytes per input byte , i.e. the suffjx tree of a 3 Gbytes string would require 84 Gbytes . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics m . After O ( m ) , or linear, preprocessing time, one must be O ( n ) time either fjnd an occurrence of S in T or

  23. . Notation . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Boyer-Moore . Suffjx Trees Substring problem Original problem. “One is fjrst given a text T of length prepared to take in any unknown string S of length n and determine that S is not contained in T ”. In practice, the preprocessing takes time and necessitates a lot of disk space, it is therefore used in situations where the database is static and the queries are frequent . constant can be as large as a hundred, with the best known implementation (and most complex one) requiring 28 bytes per input byte , i.e. the suffjx tree of a 3 Gbytes string would require 84 Gbytes . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics m . After O ( m ) , or linear, preprocessing time, one must be O ( n ) time either fjnd an occurrence of S in T or The preprocessing requires O ( m ) memory, however, the

  24. . Boyer-Moore . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Suffjx Trees . Chronology Weiner (1973); fjrst linear time algorithm for constructing suffjx trees. Declared “algorithm of the year” by Knuth. McCreight (1976); presents a simpler algorithm which is also more space effjcient. Ukkonen (1995); this linear algorithm also allows for online left-to-right processing and is conceptually easier to understand than the previous two methods. (method of choice) Recent developments (last 10–15 years) with suffjx arrays imply that suffjx trees are mainly used as conceptual and/or didactic tools. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ Our discussion follows Gusfjeld (1997) .

  25. . Suffjx Trees . . . . . . . . Preamble Notation Boyer-Moore Preamble . Notation Boyer-Moore Suffjx Trees The name trie comes the word re trie val. A trie is a multi-way tree used to store strings (or key values of varying sizes). A trie is built in such a way that all the strings sharing a common prefjx are represented with a single path from the root to an internal node representing the prefjx, and all the descendants of this node represent all the possible suffjxes. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A related topic : Trie, keyword-tree, A + -tree

  26. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Here is the trie for the words: a , an , al , all , bi and bio . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A related topic : Trie, keyword-tree, A + -tree a b n l i o l

  27. . Preamble . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Notation . Boyer-Moore Suffjx Trees 1. the edges of the tree are labelled with non-empty strings 2. the labels of the outgoing edges of a node all start with a difgerent letter. for each letter of the alphabet plus one to represent the end of a string. child. Given a trie, to determine if a string occurs in the tree, it suffjce to fjnd a path from the root to a leaf such that the concatenation of the labels spells out the string . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . A + -tree Given an alphabet, A , an A + -tree is a fjnite rooted tree such that: over A ; Corollary: all internal nodes have up to |A| + 1 children; one child In an A + -tree, the nodes are allowed to have a single

  28. . Observation 3 ississippi 2 1 mississippi i th suffjx of T . A string S occurs at position i of T ifg S is the prefjx of the Suffjx Trees 4 Boyer-Moore Notation Preamble Suffjx Trees Boyer-Moore Notation Preamble ssissippi sissippi . ppi Marcel Turcotte E.g. there are two occurrences of issi , positions 2 and 5. i 11 pi 10 9 5 ippi 8 sippi 7 ssippi 6 issippi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  29. . suffjxes of a given string S occur and 2) is compact . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Suffjx Tree A suffjx tree is a (PATRICIA * ) trie in which 1) all the An . -tree is compact if all the nodes are branching nodes (2 or more successors) or a leaf; except for the root, which is allowed to have a single successor. The concatenation of all the arc labels from the root to a leaf constitutes a suffjx of the string S . By traversing the tree it is possible to enumerate all the suffjxes of the string S . Nodes with a single descendant can be removed, the incoming and outgoing arcs are also removed and replaced by a new edge who’s label is the concatenation of the two labels. * Practical Algorithm to Retrieve Information Coded in Alphanumeric Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  30. . Suffjx Tree . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees A suffjx tree is a (PATRICIA * ) trie in which 1) all the . suffjxes of a given string S occur and 2) is compact . nodes (2 or more successors) or a leaf; except for the root, which is allowed to have a single successor. The concatenation of all the arc labels from the root to a leaf constitutes a suffjx of the string S . By traversing the tree it is possible to enumerate all the suffjxes of the string S . Nodes with a single descendant can be removed, the incoming and outgoing arcs are also removed and replaced by a new edge who’s label is the concatenation of the two labels. * Practical Algorithm to Retrieve Information Coded in Alphanumeric Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics An A + -tree is compact if all the nodes are branching

  31. . Suffjx Tree . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees A suffjx tree is a (PATRICIA * ) trie in which 1) all the . suffjxes of a given string S occur and 2) is compact . nodes (2 or more successors) or a leaf; except for the root, which is allowed to have a single successor. The concatenation of all the arc labels from the root to a leaf constitutes a suffjx of the string S . By traversing the tree it is possible to enumerate all the suffjxes of the string S . Nodes with a single descendant can be removed, the incoming and outgoing arcs are also removed and replaced by a new edge who’s label is the concatenation of the two labels. * Practical Algorithm to Retrieve Information Coded in Alphanumeric Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics An A + -tree is compact if all the nodes are branching

  32. . Suffjx Tree . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees A suffjx tree is a (PATRICIA * ) trie in which 1) all the . suffjxes of a given string S occur and 2) is compact . nodes (2 or more successors) or a leaf; except for the root, which is allowed to have a single successor. The concatenation of all the arc labels from the root to a leaf constitutes a suffjx of the string S . By traversing the tree it is possible to enumerate all the suffjxes of the string S . Nodes with a single descendant can be removed, the incoming and outgoing arcs are also removed and replaced by a new edge who’s label is the concatenation of the two labels. * Practical Algorithm to Retrieve Information Coded in Alphanumeric Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics An A + -tree is compact if all the nodes are branching

  33. . Suffjx Tree . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees A suffjx tree is a (PATRICIA * ) trie in which 1) all the . suffjxes of a given string S occur and 2) is compact . nodes (2 or more successors) or a leaf; except for the root, which is allowed to have a single successor. The concatenation of all the arc labels from the root to a leaf constitutes a suffjx of the string S . By traversing the tree it is possible to enumerate all the suffjxes of the string S . Nodes with a single descendant can be removed, the incoming and outgoing arcs are also removed and replaced by a new edge who’s label is the concatenation of the two labels. * Practical Algorithm to Retrieve Information Coded in Alphanumeric Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics An A + -tree is compact if all the nodes are branching

  34. . Suffjx Tree for xabxac . . . . . Preamble Notation Boyer-Moore Suffjx Trees . Notation Boyer-Moore Suffjx Trees A suffjx tree is a data structure to hold all the suffjxes of T . . 123456 1 xabxac 2 abxac 3 bxac 4 xac 5 ac 6 c Marcel Turcotte . Preamble . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . x a b x a c 1 b c a x 4 a b c c c x a 3 6 5 c 2

  35. For any leaf i , the concatenation of the edge-labels on the . Suffjx Trees . . . . . . . . Preamble Notation Boyer-Moore Preamble . Notation Boyer-Moore Suffjx Trees Defjnitions directed tree with exactly m leaves numbered 1 to m . Edges are labeled with (non-empty) sub-strings of S . No two edges out of a node can have edge-labels beginning with the same character . path from the root to leaf i , exactly spells out the suffjx of S that starts at position i , i.e. S i m Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A suffjx tree T for an m -character string S is a rooted

  36. For any leaf i , the concatenation of the edge-labels on the . Suffjx Trees . . . . . . . . Preamble Notation Boyer-Moore Preamble . Notation Boyer-Moore Suffjx Trees Defjnitions directed tree with exactly m leaves numbered 1 to m . Edges are labeled with (non-empty) sub-strings of S . No two edges out of a node can have edge-labels beginning with the same character . path from the root to leaf i , exactly spells out the suffjx of S that starts at position i , i.e. S i m Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A suffjx tree T for an m -character string S is a rooted

  37. For any leaf i , the concatenation of the edge-labels on the . Suffjx Trees . . . . . . . . Preamble Notation Boyer-Moore Preamble . Notation Boyer-Moore Suffjx Trees Defjnitions directed tree with exactly m leaves numbered 1 to m . Edges are labeled with (non-empty) sub-strings of S . No two edges out of a node can have edge-labels beginning with the same character . path from the root to leaf i , exactly spells out the suffjx of S that starts at position i , i.e. S i m Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A suffjx tree T for an m -character string S is a rooted

  38. . Boyer-Moore . . . . . . . . . Preamble Notation Suffjx Trees . Preamble Notation Boyer-Moore Suffjx Trees Defjnitions directed tree with exactly m leaves numbered 1 to m . Edges are labeled with (non-empty) sub-strings of S . No two edges out of a node can have edge-labels beginning with the same character . path from the root to leaf i , exactly spells out the suffjx of Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A suffjx tree T for an m -character string S is a rooted For any leaf i , the concatenation of the edge-labels on the S that starts at position i , i.e. S [ i .. m ]

  39. . Preamble . . . . . . . . . . Notation . Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees The Need for a Terminator ( xabxa ) In the above tree, xa and a are two suffjxes that are a prefjx of another suffjx, which means that to insert them in the tree we violate our defjnition of a suffjx tree. Marcel Turcotte . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . x a b x a 1 b ε a x 4 a b ε x a 3 5 2 would have to have empty labels , denoted by ϵ , and this would

  40. . Preamble . . . . . . . . . . . . Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Suffjx tree for xabxac if one suffjx of S matches also matches a prefjx of another suffjx of S then no suffjx tree can be built, to circumvent the problem a termination character (a symbol which is Marcel Turcotte . Notation . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . x a b x a c 1 b c a x 4 a b c c c x a 3 6 5 c 2 ∀ i the concatenation of all edges from the root to the leaf spells out the suffjx that starts at position i , S [ i .. m ] , where m = | S | . not part of Σ ) is added to the end of S , i.e. S $ .

  41. . Preamble . . . . . . . . . . Notation . Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Using ST : Find all occurrences of P in T Propose an algorithm for fjnding all the occurrences of P in T , once a suffjx tree of T has been built. What is the time complexity of your algorithm? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  42. { Initialization } while i fi; P in T . last match is numbered with a starting location of report success, every leaf below the point of the else report failure, P does not appear anywhere in T n then Let od; i do if i 1 ; m n and i { Search stage } = | P | = | T | Build a suffix tree T for T in O ( m ) := ≤ n and match P ( i ) in T := i + 1 ; ≤ . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  43. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore . Preamble Notation Boyer-Moore Suffjx Trees Example : fjnding all xa’s in xabxac Marcel Turcotte . Suffjx Trees . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . x a b x a c 1 b c a x 4 a b c c c x a 3 6 5 c 2 1 2 3 4 5 6 x a b x a c

  44. . Suffjx Trees . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Features . The path is unique because there are no two edges out of a node starting with the same letter, thus each branching decision is unique. If P occurs in T then it ought to be a prefjx of a suffjx of T . To further report all occurrences requires traversing the subtree , and will necessitates time proportional to the number of occurrences, k , and is independent of the size of the labels leading to those k leaves. The topology of a suffjx tree is unique , in other words, the suffjx trees produced by any two algorithms should identical, except for the order of the children. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  45. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . 1 m S = x a b x a c

  46. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . m 1 S = x a b x a c x a b x a c 1

  47. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . 1 m S = x a b x a c i x a b x a c 1

  48. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 a b x a c 2

  49. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 a b x a c 2

  50. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b a x b a x c a c 3 2

  51. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b a x b a x c a c 3 2

  52. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b a x b a x c a c 3 2

  53. . . . . . . . . . . . . . . . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . Preamble . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b c a x b 4 a x c a c 3 2

  54. . . . . . . . . . . . . . . . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . Preamble . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b c a x b 4 a x c a c 3 2

  55. . . . . . . . . . . . . . . . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . Preamble . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b c a x 4 a b c x a 3 c 2

  56. . . . . . . . . . . . . . . Preamble . Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . Notation . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b c a x 4 a b c c x a 3 5 c 2

  57. . . . . . . . . . . . . . . Preamble . Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . Notation . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b c a x 4 a b c c x a 3 5 c 2

  58. . . . . . . . . . . . . . . Preamble Notation . Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : building a suffjx tree Marcel Turcotte . Boyer-Moore . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . 1 m S = x a b x a c i x a b x a c 1 b c a x 4 a b c c c x a 3 6 5 c 2

  59. . . . . . . . . . . . . Preamble . Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees { Initialization } Create a new tree, enter the single edge Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Naïve algorithm to build a suffjx tree in O ( m 2 ) S [ 1 .. m ]$

  60. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees for i from 2 to m do Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Naïve algorithm to build a suffjx tree in O ( m 2 ) (cont.) { Successively add S [ i .. m ]$ to the growing tree T } find the longest match for S [ i .. m ] in T Let's call S ( j ) the position of the mismatch if S ( j ) was found at a node, say w , then add a new child to w labeled S [ j .. m ]$ else S ( j ) is in the middle of an edge, say ( u , v ), then insert a new node w : replace ( u , v ) by ( u , w ) and ( w , v ), where ( u , w ) correspond to the portion of ( u , v ) that matched S [ i .. j ] and ( w , v ) the remaining part. Finally, insert a new edge ( w , i ) labelled S [ j .. m ] .

  61. . . . . . . . . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Example : rococo Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  62. . Preamble . . . . . . . . . . Notation . Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Exercises Build by hand a suffjx tree for some of these words: molecule, allele, rococo, tarantara, tartar, repetitive, murmurs, mathematic, banana and monotonous. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  63. S 2 space! . The total number of nodes is node). Therefore, the maximum number of internal is algorithm only necessitates adding a branch out of an existing it forces the creation of at most one internal node (sometimes, the S suffjxes are added one by one. When a suffjx is added, The n S . Consider the naïve algorithm. Suffjx Trees Size of the tree : memory usage the number of leaves is Boyer-Moore Notation Preamble Suffjx Trees Boyer-Moore Notation S , and S . Hence, the total number of nodes . Since there are Marcel Turcotte [ Hackers, can you do better? ] long, implies S edges, each of them labeled with a string S are strings themselves. is The presentation suggests that the labels on the arcs of the tree also? S space requirement will be S , does this mean that the The total number of nodes is S . Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  64. S 2 space! . Size of the tree : memory usage node). Therefore, the maximum number of internal is algorithm only necessitates adding a branch out of an existing it forces the creation of at most one internal node (sometimes, the S suffjxes are added one by one. When a suffjx is added, The n Consider the naïve algorithm. Suffjx Trees the number of leaves is Boyer-Moore Notation Preamble Suffjx Trees Boyer-Moore Notation S , and is S . Hence, the total number of nodes Since there are Marcel Turcotte [ Hackers, can you do better? ] long, implies S edges, each of them labeled with a string S are strings themselves. . The presentation suggests that the labels on the arcs of the tree also? S space requirement will be S , does this mean that the The total number of nodes is S . Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics The total number of nodes is O ( | S | ) .

  65. S 2 space! . Suffjx Trees S , and node). Therefore, the maximum number of internal is algorithm only necessitates adding a branch out of an existing it forces the creation of at most one internal node (sometimes, the When a suffjx is added, Size of the tree : memory usage Boyer-Moore S . Hence, the total number of nodes Notation Preamble Suffjx Trees Boyer-Moore Notation Preamble the number of leaves is is . Since there are Marcel Turcotte [ Hackers, can you do better? ] long, implies S edges, each of them labeled with a string S are strings themselves. S . The presentation suggests that the labels on the arcs of the tree also? S space requirement will be S , does this mean that the The total number of nodes is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics The total number of nodes is O ( | S | ) . Consider the naïve algorithm. The n = | S | suffjxes are added one by one.

  66. S 2 space! . Suffjx Trees S , and Therefore, the maximum number of internal is node). algorithm only necessitates adding a branch out of an existing it forces the creation of at most one internal node (sometimes, the Size of the tree : memory usage Boyer-Moore S . Hence, the total number of nodes Notation Preamble Suffjx Trees Boyer-Moore Notation Preamble the number of leaves is is . Since there are Marcel Turcotte [ Hackers, can you do better? ] long, implies S edges, each of them labeled with a string S are strings themselves. S . The presentation suggests that the labels on the arcs of the tree also? S space requirement will be S , does this mean that the The total number of nodes is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics The total number of nodes is O ( | S | ) . Consider the naïve algorithm. The n = | S | suffjxes are added one by one. When a suffjx is added,

  67. S 2 space! . it forces the creation of at most one internal node (sometimes, the . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Size of the tree : memory usage The total number of nodes is algorithm only necessitates adding a branch out of an existing . S , does this mean that the space requirement will be S also? The presentation suggests that the labels on the arcs of the tree are strings themselves. Since there are S edges, each of them labeled with a string S long, implies [ Hackers, can you do better? ] Marcel Turcotte . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . The total number of nodes is O ( | S | ) . Consider the naïve algorithm. The n = | S | suffjxes are added one by one. When a suffjx is added, node). Therefore, the maximum number of internal is O ( | S | ) , and the number of leaves is O ( | S | ) . Hence, the total number of nodes is O ( | S | ) .

  68. S 2 space! . Boyer-Moore . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Suffjx Trees . Size of the tree : memory usage it forces the creation of at most one internal node (sometimes, the algorithm only necessitates adding a branch out of an existing The presentation suggests that the labels on the arcs of the tree are strings themselves. Since there are S edges, each of them labeled with a string S long, implies [ Hackers, can you do better? ] Marcel Turcotte . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . The total number of nodes is O ( | S | ) . Consider the naïve algorithm. The n = | S | suffjxes are added one by one. When a suffjx is added, node). Therefore, the maximum number of internal is O ( | S | ) , and the number of leaves is O ( | S | ) . Hence, the total number of nodes is O ( | S | ) . The total number of nodes is O ( | S | ) , does this mean that the space requirement will be O ( | S | ) also?

  69. S 2 space! . Boyer-Moore . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Suffjx Trees . Size of the tree : memory usage it forces the creation of at most one internal node (sometimes, the algorithm only necessitates adding a branch out of an existing The presentation suggests that the labels on the arcs of the tree are strings themselves. Since there are S edges, each of them labeled with a string S long, implies [ Hackers, can you do better? ] Marcel Turcotte . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . The total number of nodes is O ( | S | ) . Consider the naïve algorithm. The n = | S | suffjxes are added one by one. When a suffjx is added, node). Therefore, the maximum number of internal is O ( | S | ) , and the number of leaves is O ( | S | ) . Hence, the total number of nodes is O ( | S | ) . The total number of nodes is O ( | S | ) , does this mean that the space requirement will be O ( | S | ) also?

  70. . Boyer-Moore . . . . . . . . . Preamble Notation Suffjx Trees . Preamble Notation Boyer-Moore Suffjx Trees Size of the tree : memory usage it forces the creation of at most one internal node (sometimes, the algorithm only necessitates adding a branch out of an existing The presentation suggests that the labels on the arcs of the tree are strings themselves. [ Hackers, can you do better? ] Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . The total number of nodes is O ( | S | ) . Consider the naïve algorithm. The n = | S | suffjxes are added one by one. When a suffjx is added, node). Therefore, the maximum number of internal is O ( | S | ) , and the number of leaves is O ( | S | ) . Hence, the total number of nodes is O ( | S | ) . The total number of nodes is O ( | S | ) , does this mean that the space requirement will be O ( | S | ) also? Since there are O ( | S | ) edges, each of them labeled with a string O ( | S | ) long, implies O ( | S | 2 ) space!

  71. . Boyer-Moore . . . . . . . . . Preamble Notation Suffjx Trees . Preamble Notation Boyer-Moore Suffjx Trees Size of the tree : memory usage it forces the creation of at most one internal node (sometimes, the algorithm only necessitates adding a branch out of an existing The presentation suggests that the labels on the arcs of the tree are strings themselves. [ Hackers, can you do better? ] Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . The total number of nodes is O ( | S | ) . Consider the naïve algorithm. The n = | S | suffjxes are added one by one. When a suffjx is added, node). Therefore, the maximum number of internal is O ( | S | ) , and the number of leaves is O ( | S | ) . Hence, the total number of nodes is O ( | S | ) . The total number of nodes is O ( | S | ) , does this mean that the space requirement will be O ( | S | ) also? Since there are O ( | S | ) edges, each of them labeled with a string O ( | S | ) long, implies O ( | S | 2 ) space!

  72. . Suffjx Trees . . . . . . . . Preamble Notation Boyer-Moore Preamble . Notation Boyer-Moore Suffjx Trees Edge label compression We know that the total number of nodes for the suffjx two numbers, start and ending position of the label within the original string , allows us to use a constant amount of space per label, and therefore the space requirement is linear. In practice, this can cause a lot of paging, if the string and tree cannot be fjtted together in main memory. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics tree of a terminated string is | S | − 1, and therefore the number of edges is | S | − 2, representing each edge with

  73. 9 monotonous$ 11 $ us$ 1 r 10 monotonous$ s$ no o tonous$ 5 us$ us$ no tonous$ 8 7 tonous$ 3 us$ tonous$ 4 6 2 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  74. 9 monotonous$ 11 11,1 9,3 r 1 10 1,11 10,2 3,2 2,1 5,7 5 9,3 9,3 3,2 5,7 8 7 5,7 3 9,3 5,7 4 6 2 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  75. labels on the unique path from the root to that node, e.g. The path label of a node is the concatenation of all the edge path-label(z) = TTA . CATTATTAGGA$ 9 GA$ 12 r v 10 $ G A$ A T 11 u y w 7 $ GGA$ A TA TTA TTAGGA$ GGA$ z 4 8 x GGA$ CATTATTAGGA$ TTAGGA$ GGA$ 6 TTAGGA$ 3 5 2 1 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  76. string-depth(z) = length(path-label(z)) = 3 . The string depth of a node is the length of its path label, e.g. CATTATTAGGA$ 9 GA$ 12 r v 10 $ G A$ A T 11 u w y 7 $ A GGA$ TTA TA TTAGGA$ GGA$ z 4 8 x GGA$ CATTATTAGGA$ TTAGGA$ GGA$ 6 TTAGGA$ 3 5 2 1 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  77. . Linear Time Construction . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Notation Boyer-Moore Suffjx Trees Several algorithms exist for the linear time construction of . suffjx trees. Ukkonen’s algorithm is often considered the method of choice. Ukkonen (1995); this linear algorithm also allows for online left-to-right processing and is conceptually easier to understand than the previous two methods. The presentation of the linear time algorithm is beyond the scope of the course. The library presented in the next few slides has an implementation of Ukkonen’s algorithm. Note : In the lecture notes, the convention used in most textbooks and publications denoting the fjrst index of a string by 1 is used, but for the the Java implementation, the fjrst index is 0. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  78. . Boyer-Moore . . . . . . . . . Preamble Notation Suffjx Trees . Preamble Notation Boyer-Moore Suffjx Trees Suffjx Tree Library On the course web site, you will fjnd a Java library that implements a suffjx tree data structure. It was developed by Daniela Cernea in 2003 for her Honours project. The next slides present an overview of the classes involved. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  79. . Notation . . . . . . . Preamble Notation Boyer-Moore Suffjx Trees Preamble Boyer-Moore . Suffjx Trees Suffjx Tree Library Creating a suffjx tree: SuffixTree t r e e = new SuffixTree ( ” acgt ” ) ; where “acgt” is the alphabet. For adding strings to the tree, we will need a builder: TreeBuilder b u i l d e r = new TreeBuilder ( t r e e ) ; The method addToken is used to insert a string into an existing tree: b u i l d e r . addToken ( ” cattattagga ” ) ; Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  80. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  81. tam am m $ 6 tam$ $ tam$ $ 2 5 tam$ $ 1 4 coordinate 0 3 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  82. tamtam$ 0 1 2 3 4 5 6 firstChild tam am m $ rightSybling 6 tam$ $ tam$ $ 2 5 tam$ $ 1 4 0 3 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  83. tamtam$ 0 1 2 3 4 5 6 tam am m $ leftIndex,length 0,3 1,2 2,1 6,1 6 tam$ $ tam$ $ 3 ,4 6,1 2 5 tam$ $ 3 ,4 6,1 1 4 3 ,4 6,1 0 3 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  84. tamtam$ 0 1 2 3 4 5 6 0,3 1,2 2,1 6,1 6 3 ,4 6,1 2 5 3 ,4 6,1 1 4 3 ,4 6,1 0 3 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  85. tamtam$ 0 1 2 3 4 5 6 tam am m $ 0,3 1,2 2,1 6,1 6 tam$ $ tam$ $ 3 ,4 6,1 2 5 tam$ $ 3 ,4 6,1 1 4 3 ,4 6,1 0 3 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

Recommend


More recommend