csi5126 algorithms in bioinformatics
play

CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte - PowerPoint PPT Presentation

. LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte School of


  1. n 4 , n 3 , n 2 , . LCE . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Repeats Preamble . Generalized Suffjx Tree More Repeats LCA LCE Naïve algorithm Imagine an algorithm to fjnd the longest repeated substring without using a suffjx tree. What is its time complexity ? n ? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  2. . More Repeats . . . . . . . . Preamble Repeats Generalized Suffjx Tree LCA . LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Naïve algorithm Imagine an algorithm to fjnd the longest repeated substring without using a suffjx tree. What is its time complexity ? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics O ( n 4 ) , O ( n 3 ) , O ( n 2 ) , O ( n ) ?

  3. . Generalized Suffjx Tree . . . . . . . . . Preamble Repeats More Repeats . LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Longest repeated substring extension of the suffjxes i and j of S longest repeat problem Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Let C [ i , j ] be the length of the longest common Clearly, the largest C [ i , j ] value is the solution to the

  4. Base conditions. m i s s i s s i p p i m 0 i 1 s 0 s 0 i 1 s 0 s 0 i 1 p 0 p 0 i Let C [ i , | S | ] = 1 if S ( i ) = S ( | S | ) , 1 ≤ i < | S | Let C [ i , | S | ] = 0 if S ( i ) ̸ = S ( | S | ) , 1 ≤ i < | S | . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  5. General case. m i s s i s s i p p i m 0 0 0 0 0 0 0 0 0 0 i 0 0 4 0 0 1 0 0 1 s 1 0 3 1 0 0 0 0 s 0 1 2 0 0 0 0 i 0 0 1 0 0 1 s 1 0 0 0 0 s 0 0 0 0 i 0 0 1 p 1 0 p 0 i C [ i , j ] = 0 if S ( i ) ̸ = S ( j ) , 1 ≤ i < j < | S | C [ i , j ] = 1 + C [ i + 1 , j + 1 ] if S ( i ) = S ( j ) , 1 ≤ i < j < | S | . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  6. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE Exercise (easy) Solve the longest common substring using dynamic programming. Problem: Given as input two strings, S and T , the longest common substring consists in fjnding the longest substrings that are common to both, S and T . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  7. . Generalized Suffjx Tree . . . . . . . . . Preamble Repeats More Repeats . LCA LCE Preamble . Generalized Suffjx Tree More Repeats LCA LCE Suffjx tree -based algorithm Outline a suffjx tree based algorithm for fjnding repeats? What characterizes a repeat? Marcel Turcotte . Repeats . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . CATTATTAGGA$ 9 GA$ 12 $ r G v A$ 10 A T 11 $ u w y 7 A GGA$ TTA TA TTAGGA$ GGA$ z 4 8 x GGA$ CATTATTAGGA$ TTAGGA$ GGA$ 6 TTAGGA$ 3 5 2 1

  8. . LCA . . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCE . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Defjnition Let’s defjne a branching node , sometimes called fork , as a node having two or more children. The path-label of a node is the concatenation of all the edge labels along the path from the root to the node. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . δ a b η γ i j S δ a δ b

  9. . LCA . . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCE . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Finding the longest repeated substring Let’s defjne a branching node , sometimes called fork , as a node having two or more children. It suffjce to traverse the tree and fjnd a node 1) which is a fork node and 2) which has the longest path-label. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . δ a b η γ i j S δ a δ b Finding the longest repeated substring takes O ( | T | ) .

  10. x a b x a c 1 b c a x 4 a b c c c x a 3 6 5 c 2 1 2 3 4 5 6 x a b x a c . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  11. tamtam$ 0 1 2 3 4 5 6 tam am m $ 0,3 1,2 2,1 6,1 6 tam$ $ tam$ $ 3 ,4 6,1 2 5 tam$ $ 3 ,4 6,1 1 4 3 ,4 6,1 0 3 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  12. public class Annotation implements Info { private static void addPathLength( int prefix, NodeInterface node) { } } addPathLength(prefix, (NodeInterface) node.getRightSybling()); private int pathLength; if (node instanceof InternalNode) node.setInfo( new Annotation(pathLength)); int pathLength = prefix + node.getLength(); return ; if (node == null ) addPathLength( pathLength , (NodeInterface) ((InternalNode) node).getFirstChild()); } addPathLength(0, (NodeInterface) root.getFirstChild()); if ( root != null ) InternalNode root = (InternalNode) tree.getRoot(); public static void addPathLength(SuffixTree tree) { public int getPathLength() { return pathLength; } public void setNextInfo(Info next) { this .next = next; } public Info getNextInfo() { return next; } Annotation( int pathLength) { this .pathLength = pathLength; } private Info next; . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  13. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE Longest repeated substring algorithm Build a suffjx tree for S , the input string. Top-down traversal of the tree , adding path-label information to each node. Record the longest path-label so far. Report the longest path-label recorded. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  14. S 1 S 2 S K . . Repeats LCE LCA More Repeats Generalized Suffjx Tree Preamble To fjnd the longest common substring of a set of LCE LCA More Repeats Generalized Suffjx Tree Repeats Preamble . Generalized suffjx tree suffjx tree. strings, we need to introduce the concept of generalized . A generalized suffjx tree represents all the suffjxes of a set of strings In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple , with a fjrst index indicating the string this suffjx belongs to, 1 k , and the second index indicating the starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  15. . LCE . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA Generalized suffjx tree . belongs to, 1 k , and the second index indicating the Marcel Turcotte each string so that a leaf designates a unique suffjx. Alternatively, a unique terminator can be appended to suffjx, some leaves might contain more than one tuple. Because some of the k strings might have a common starting position. a tuple , with a fjrst index indicating the string this suffjx To fjnd the longest common substring of a set of In a generalized suffjx tree, the leaves are labeled with with the starting position of the suffjx within the string. In the suffjx tree for a single sequence, leaves are labeled A generalized suffjx tree represents all the suffjxes of a suffjx tree. strings, we need to introduce the concept of generalized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics set of strings { S 1 , S 2 , . . . , S K } .

  16. . LCE . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA Generalized suffjx tree . belongs to, 1 k , and the second index indicating the Marcel Turcotte each string so that a leaf designates a unique suffjx. Alternatively, a unique terminator can be appended to suffjx, some leaves might contain more than one tuple. Because some of the k strings might have a common starting position. a tuple , with a fjrst index indicating the string this suffjx To fjnd the longest common substring of a set of In a generalized suffjx tree, the leaves are labeled with with the starting position of the suffjx within the string. In the suffjx tree for a single sequence, leaves are labeled A generalized suffjx tree represents all the suffjxes of a suffjx tree. strings, we need to introduce the concept of generalized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics set of strings { S 1 , S 2 , . . . , S K } .

  17. . LCE . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA Generalized suffjx tree . To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple , with a fjrst index indicating the string this suffjx starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics set of strings { S 1 , S 2 , . . . , S K } . belongs to, 1 .. k , and the second index indicating the

  18. . LCE . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA Generalized suffjx tree . To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple , with a fjrst index indicating the string this suffjx starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics set of strings { S 1 , S 2 , . . . , S K } . belongs to, 1 .. k , and the second index indicating the

  19. . LCE . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA Generalized suffjx tree . To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple , with a fjrst index indicating the string this suffjx starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics set of strings { S 1 , S 2 , . . . , S K } . belongs to, 1 .. k , and the second index indicating the

  20. . Repeats . . . . . . . . . . Preamble Generalized Suffjx Tree . More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree . LCA LCE Generalized suffjx tree: an example Marcel Turcotte . More Repeats . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . a x b a x b $ 1,1 x b b $ b $ a x 1,4 2,4 $ a b x $ b b $ 1,5 b 2,3 $ $ x 1,2 a b 2,2 b 1,6 $ 2,5 $ 1,3 2,1 S 1 = axbaxb and S 2 = bxbab

  21. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE (Generalized) Substring Problem Defjnition. A set of strings , or database, is known in advanced and fjxed . After spending a linear amount of time pre-processing the input database, the algorithm will be presented a collection of strings and for each string the algorithm should be able to tell if the string is present in one or more strings from the input. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  22. . LCA . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCE . Application DNA identifjcation. The U.S. army sequences a portion of the DNA of each member of its personnel. The sequence is selected so that 1) it is easy to retrieve that exact sequence and 2) it is unique to each individual. In the case of a severe casualty, this particular DNA sequence can be used to identify uniquely a person. Solution. A generalized suffjx tree is built that contains all the input sequences. This takes time proportional to the sum of the lengths. To identify a person takes time proportional to length of the sequence identifjer. The solution would also work if the sequence identifjer can only be partially identifjed (in extreme cases). Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  23. . Preamble . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Repeats . Generalized Suffjx Tree More Repeats LCA LCE Longest Common Substring (LCS) Finding the longest common substring of a set of strings is a recurring problem, and one which has many applications in bioinformatics. In 1970, Donald Knuth conjectured that it would be impossible to fjnd a linear time algorithm to solve this problem. using generalized suffjx trees. How? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics The longest common substring of S 1 = axbaxb and S 2 = bxbab , is xba . This problem can be elegantly solve in O ( | S 1 | + | S 2 | )

  24. . Generalized Suffjx Tree . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats More Repeats . LCA LCE Longest Common Substring Algorithm Let’s consider the case of two sequences, the generalization to k strings is trivial, 2. In linear time, traverse the tree and label each node with (1) , (2) or (1,2) if the subtree underneath the node contains only leaves from the fjrst string, only leaves from the second string or a mixture of the two; ( hint : use a bottom-up traversal) 3. In linear time, fjnd the node such that 1) it’s labeled (1,2) and 2) it has the longest path-label . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics 1. Construct a generalized suffjx tree for S 1 and S 2 ;

  25. . Generalized Suffjx Tree . . . . . . . . . Preamble Repeats More Repeats . LCA LCE Preamble Repeats Generalized Suffjx Tree . LCA LCE Longest Common Substring label) that has descendants in both strings. Marcel Turcotte . More Repeats . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . a x b a x b $ 1,1 x b b $ b $ a x 1,4 2,4 $ a b x $ b b $ 1,5 b 2,3 $ $ x 1,2 a b 2,2 b 1,6 $ 2,5 $ 1,3 2,1 ⇒ The node with prefjx xba is the deepest node (longest path

  26. . Repeats . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Generalized Suffjx Tree . More Repeats LCA LCE DNA contamination problem A host organism can be used to store foreign DNA molecules. Clone library . A foreign DNA segment can be inserted in a host organism in a way that makes it easy to retrieve the segment for later uses. The host will be selected for its ability to rapidly replicate, yeast for example, and therefore to make an endless number of copies of the original information. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  27. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE DNA contamination problem It sometimes occur that the retrieved segments are contaminated with DNA from the host. The DNA contamination problem consists in fjnding all the substrings that are common to the host, S 1 , and the segment, S 2 , and are at least l nucleotides long. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  28. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE DNA contamination problem the tree and annotate all the nodes whose subtree contains leaves from both sequences; this takes a linear amount of time. Traverse the tree and for each node annotated with 1 and 2, such that the string length of the path is greater than l , print the string and locations, the traversal of the tree takes a linear amount of time. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Solution: build a generalized suffjx tree for S 1 and S 2 . Traverse

  29. . Generalized Suffjx Tree . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats More Repeats . LCA LCE String Repeats Repetitive sequences (strings) constitute a large fraction of the genomes. Transposable elements represent: 35.0–50% of the Homo sapiens (Human genome) 50.0% Zea mays (maize, corn) 15.0% Drosophila melanogaster (fruit fmy) 2.0% Arabidopsis thaliana (a fmowering plant) 1.8% Caenorhabditis elegans (a nematode, round worm) 3.1% Saccharomyces cerevisiae (baker’s yeast) molecular evolution. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ Certain repeats have been related to diseases, regulation and

  30. . Repeats . . . . . . . . . . Preamble . . More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Human Genome Organization Marcel Turcotte . Generalized Suffjx Tree . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . Human genome (3 Gb) Genes Extragenic DNA (900 Mb) (2.1 Gb) Coding DNA Noncoding DNA Unique and Repetitive DNA (90 Mb) (810 Mb) low copy number (420 Mb) (1.6 Gb) Genes Introns, leaders Pseudogenes Tandem repeats Interspersed fragments and trailers Satellite LTRs LINEs Minisatellite SINEs Microsatellite Transposons

  31. . Preamble . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Repeats . Generalized Suffjx Tree More Repeats LCA LCE Sequence repeats — classifjcation Satellites: located near the centromeres or telomeres, up to one million bp long. Microsatellite: 2 to 5 bp, 100 copies, found at the end of the eukaryotic chromosomes (telomeres), in humans hundreds of copies of TTAGGG. Minisatellite: up to 25 bp, 30 to 2,000 copies Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  32. . LCA . . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCE . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Sequence repeats — classifjcation transposable elements: sequences that have the ability to move from one location of the genome to another, play an important role in evolution, they are classifjed according to their mechanism of transposition. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  33. class I: RNA mediated. of the genome), copies of elements of this type. class II: DNA mediated. Human genome has ca. 200,000 long terminal repeat (LTR): retrotransposons genome genome contains LINE1, the human particular family is called elements, 6-800 Kbp, one LINES: long interspersed nuclear 593,000 copies (14.6%). 300,000 copies, i.e ca. 5% (other sources say million copies (10%), genome, there are 1.2 Alu, in the human particular family is called elements, 80-300 bp, one SINES: short interspersed nuclear (related to retroviruses), . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  34. class III: has features of class I and class II, MITES miniature inverted repeat transposable elements, 400 bp, discovered in fmowering plants, frequently associated with regulatory regions of genes. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  35. . More Repeats . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree LCA . LCE Sequence repeats >ALU Human ALU interspersed repetitive sequence - a consensus. ggccgggcgcggtggctcacgcctgtaatcccagcactttgggaggccgaggcgggaggatcacttgagc ccaggagttcgagaccagcctgggcaacatagtgaaaccccgtctctacaaaaaatacaaaaattagccg ggcgtggtggcgcgcgcctgtagtcccagctactcgggaggctgaggcaggaggatcgcttgagcccggg aggtcgaggctgcagtgagccgtgatcgcgccactgcactccagcctgggcgacagagcgagaccctgtc tcaaaaaaaa The Alu itself is constituted of repeats of length aprox. 40. Often fmanked by a tandem repeat, length 7-10, such that the left and right sequence are complementary palindromes. the genome. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics 300 , 000 + nearly, but not identical, copies dispersed throughout

  36. n 4 possible pairs — 8 1 . Generalized Suffjx Tree . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats More Repeats . LCA LCE Finding all repetitive structures (one of length n , two of length n 1, three of length n 2 … n substrings of length 1). There are therefore 10 37 possible pairs in the case of the human genome! We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics For a given string of length n , there are Θ( n 2 ) substrings

  37. n 4 possible pairs — 8 1 . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE Finding all repetitive structures There are therefore 10 37 possible pairs in the case of the human genome! We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . For a given string of length n , there are Θ( n 2 ) substrings (one of length n , two of length n − 1, three of length n − 2 … n substrings of length 1).

  38. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE Finding all repetitive structures — 8 1 10 37 possible pairs in the case of the human genome! We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . For a given string of length n , there are Θ( n 2 ) substrings (one of length n , two of length n − 1, three of length n − 2 … n substrings of length 1). There are therefore Θ( n 4 ) possible pairs

  39. . More Repeats . . . . . . . . Preamble Repeats Generalized Suffjx Tree LCA . LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Finding all repetitive structures possible pairs in the case of the human genome! We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . For a given string of length n , there are Θ( n 2 ) substrings (one of length n , two of length n − 1, three of length n − 2 … n substrings of length 1). There are therefore Θ( n 4 ) possible pairs — 8 . 1 × 10 37

  40. . More Repeats . . . . . . . . Preamble Repeats Generalized Suffjx Tree LCA . LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Finding all repetitive structures possible pairs in the case of the human genome! We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . For a given string of length n , there are Θ( n 2 ) substrings (one of length n , two of length n − 1, three of length n − 2 … n substrings of length 1). There are therefore Θ( n 4 ) possible pairs — 8 . 1 × 10 37

  41. . Generalized Suffjx Tree . . . . . . . . . Preamble Repeats More Repeats . LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Defjnition Defjnition. A maximal pair (or maximal repeat pair ) is a pair the left or to the right without causing a mismatch, in other words, Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics of identical substrings α and β that cannot be extended either to the character to the immediate left of α is difgerent than the one to the immediate left of β , and similarly to the right, the characters immediately following α and β are difgerent. ⇒ A maximal pair will be denoted ( p α , p β , n ′ ) where p α and p β are the starting positions and n ′ their length. The set of all the maximal pairs of S will be noted R ( S ) .

  42. . Preamble . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Repeats . Generalized Suffjx Tree More Repeats LCA LCE Maximal pairs xyzbcdeeebcdxyzbcd The fjrst and second occurrences of bcd form a maximal pair, Are the two occurrences of xyzbcd forming a maximal pair? To ensure that suffjxes and prefjxes can participate to maximal pairs a terminator is added at both ends. Our defjnition does not prevent overlapping substrings, and this is fjne. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . ( 4 , 10 , 3 ) , the second and third occurrences form a maximal pair, ( 10 , 16 , 3 ) , but not occurrences one and three. $ xyzbcdeeebcdxyzbcd $

  43. Repeats are found at internal nodes , so let . Generalized Suffjx Tree Construct a suffjx tree for S . Where to fjnd maximal pairs? LCE LCA More Repeats Repeats current internal node under consideration. Let Preamble LCE LCA More Repeats Generalized Suffjx Tree Repeats Preamble be the denote . pair of suffjxes i j , S i Marcel Turcotte constrained problem. Still many possible pairs of suffjxes! Let’s consider a more 1. S j 1 How would you take care of the left hand side? For every its path-label . . pair is from a distinct child of pairs of suffjxes such that each of the two elements of the cannot be extended on the right ? Select sure that Let’s take care of the right hand side. How can you make What next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  44. Repeats are found at internal nodes , so let . Generalized Suffjx Tree Construct a suffjx tree for S . Where to fjnd maximal pairs? LCE LCA More Repeats Repeats current internal node under consideration. Let Preamble LCE LCA More Repeats Generalized Suffjx Tree Repeats Preamble be the denote . pair of suffjxes i j , S i Marcel Turcotte constrained problem. Still many possible pairs of suffjxes! Let’s consider a more 1. S j 1 How would you take care of the left hand side? For every its path-label . . pair is from a distinct child of pairs of suffjxes such that each of the two elements of the cannot be extended on the right ? Select sure that Let’s take care of the right hand side. How can you make What next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  45. . LCE LCE LCA More Repeats Generalized Suffjx Tree Repeats Preamble LCA Construct a suffjx tree for S . More Repeats Generalized Suffjx Tree Repeats Preamble . . Where to fjnd maximal pairs? its path-label . . pair of suffjxes i j , S i Marcel Turcotte constrained problem. Still many possible pairs of suffjxes! Let’s consider a more 1. S j 1 How would you take care of the left hand side? For every What next? . pair is from a distinct child of pairs of suffjxes such that each of the two elements of the cannot be extended on the right ? Select sure that Let’s take care of the right hand side. How can you make . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . Repeats are found at internal nodes , so let V be the current internal node under consideration. Let α denote

  46. . LCE LCE LCA More Repeats Generalized Suffjx Tree Repeats Preamble LCA Construct a suffjx tree for S . More Repeats Generalized Suffjx Tree Repeats Preamble . . Where to fjnd maximal pairs? its path-label . . pair of suffjxes i j , S i Marcel Turcotte constrained problem. Still many possible pairs of suffjxes! Let’s consider a more 1. S j 1 How would you take care of the left hand side? For every What next? . pair is from a distinct child of pairs of suffjxes such that each of the two elements of the cannot be extended on the right ? Select sure that Let’s take care of the right hand side. How can you make . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . Repeats are found at internal nodes , so let V be the current internal node under consideration. Let α denote

  47. . Where to fjnd maximal pairs? . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Construct a suffjx tree for S . . pair of suffjxes i j , S i Marcel Turcotte constrained problem. Still many possible pairs of suffjxes! Let’s consider a more 1. S j 1 How would you take care of the left hand side? For every its path-label . . pair is from a distinct child of pairs of suffjxes such that each of the two elements of the Select Let’s take care of the right hand side. How can you make What next? . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . Repeats are found at internal nodes , so let V be the current internal node under consideration. Let α denote sure that α cannot be extended on the right ?

  48. . LCA . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCE . Where to fjnd maximal pairs? Construct a suffjx tree for S . its path-label . What next? Let’s take care of the right hand side. How can you make pairs of suffjxes such that each of the two elements of the How would you take care of the left hand side? For every pair of suffjxes i j , S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . Repeats are found at internal nodes , so let V be the current internal node under consideration. Let α denote sure that α cannot be extended on the right ? Select pair is from a distinct child of V .

  49. . LCE . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA Where to fjnd maximal pairs? . Construct a suffjx tree for S . its path-label . What next? Let’s take care of the right hand side. How can you make pairs of suffjxes such that each of the two elements of the How would you take care of the left hand side? For every pair of suffjxes i j , S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . Repeats are found at internal nodes , so let V be the current internal node under consideration. Let α denote sure that α cannot be extended on the right ? Select pair is from a distinct child of V .

  50. . Repeats . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Generalized Suffjx Tree . More Repeats LCA LCE Where to fjnd maximal pairs? Construct a suffjx tree for S . its path-label . What next? Let’s take care of the right hand side. How can you make pairs of suffjxes such that each of the two elements of the How would you take care of the left hand side? For every Still many possible pairs of suffjxes! Let’s consider a more constrained problem. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Repeats are found at internal nodes , so let V be the current internal node under consideration. Let α denote sure that α cannot be extended on the right ? Select pair is from a distinct child of V . pair of suffjxes i , j , S [ i ] − 1 ̸ = S [ j ] − 1.

  51. . Repeats . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Generalized Suffjx Tree . More Repeats LCA LCE Where to fjnd maximal pairs? Construct a suffjx tree for S . its path-label . What next? Let’s take care of the right hand side. How can you make pairs of suffjxes such that each of the two elements of the How would you take care of the left hand side? For every Still many possible pairs of suffjxes! Let’s consider a more constrained problem. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Repeats are found at internal nodes , so let V be the current internal node under consideration. Let α denote sure that α cannot be extended on the right ? Select pair is from a distinct child of V . pair of suffjxes i , j , S [ i ] − 1 ̸ = S [ j ] − 1.

  52. . Preamble . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Repeats . Generalized Suffjx Tree More Repeats LCA LCE Maximum Unique Pairs (MUM) Algorithms to compare biological sequences (to be presented later) run in quadratic time and space . In the case of complete genomic sequences this is not feasible. To circumvent this limitation, algorithms have been developed that fjrst fjnd a set of mums that are used as a starting point, anchors, for further processing by conventional sequence alignment techniques. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  53. . Generalized Suffjx Tree . . . . . . . . . Preamble Repeats More Repeats . LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Maximum Unique Pairs (MUM) match is a string u such that: Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Given 2 sequences S 1 and S 2 ∈ A ∗ and l > 0, a maximal unique | u | ≥ l u occurs exactly once in S 1 and exactly once in S 2 ∀ a ∈ A , nor au or ua occurs simultaneously in S 1 and S 2 .

  54. . Repeats . . . . . . . . . . Preamble Generalized Suffjx Tree . More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Maximum Unique Pairs (MUM) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ACAAGTCTTCTATCAGACTCCAGAAAAGTATCAGAGAGCAATGAA CCACACTGCCTACCAGGTGTATCAGACCCACAAGTCCTTCTTAGA

  55. . Repeats . . . . . . . . . . Preamble Generalized Suffjx Tree . More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Maximum Unique Pairs (MUM) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ACAAGTCTTCTATCAGACTCCAGAAAAGTATCAGAGAGCAATGAA CCACACTGCCTACCAGGTGTATCAGACCCACAAGTCCTTCTTAGA

  56. . Repeats . . . . . . . . . . Preamble Generalized Suffjx Tree . More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Maximum Unique Pairs (MUM) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ACAAGTCTTCTATCAGACTCCAGAAAAGTATCAGAGAGCAATGAA CCACACTGCCTACCAGGTGTATCAGACCCACAAGTCCTTCTTAGA

  57. . Repeats . . . . . . . . . . Preamble Generalized Suffjx Tree . More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Maximum Unique Pairs (MUM) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ACAAGTCTTCTATCAGACTCCAGAAAAGTATCAGAGAGCAATGAA CCACACTGCCTACCAGGTGTATCAGACCCACAAGTCCTTCTTAGA

  58. Construct a generalized suffjx tree for S 1 and S 2 Where to look for MUMs? . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE . . the paths from Marcel Turcotte has exactly 2 children that are leaves has to be an internal node that So, we have that that u occurs more than once in one or both input strings to a leaf? No, again it would mean Is it possible that there are internal nodes along one of Repeats and common substrings are found at internal strings mean that u occurs more than once in one or both input have more than 2 children? No, this would Can and S 2 , let’s call it nodes , look for an internal node that has children in S 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  59. . LCE . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA Where to look for MUMs? . the paths from Marcel Turcotte has exactly 2 children that are leaves has to be an internal node that So, we have that that u occurs more than once in one or both input strings to a leaf? No, again it would mean Is it possible that there are internal nodes along one of Repeats and common substrings are found at internal strings mean that u occurs more than once in one or both input have more than 2 children? No, this would Can and S 2 , let’s call it nodes , look for an internal node that has children in S 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Construct a generalized suffjx tree for S 1 and S 2

  60. . LCE . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA Where to look for MUMs? . Repeats and common substrings are found at internal nodes , look for an internal node that has children in S 1 Can have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Construct a generalized suffjx tree for S 1 and S 2 and S 2 , let’s call it V

  61. . LCA . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCE . Where to look for MUMs? Repeats and common substrings are found at internal nodes , look for an internal node that has children in S 1 No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Construct a generalized suffjx tree for S 1 and S 2 and S 2 , let’s call it V Can V have more than 2 children?

  62. . More Repeats . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree LCA . LCE Where to look for MUMs? Repeats and common substrings are found at internal nodes , look for an internal node that has children in S 1 mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Construct a generalized suffjx tree for S 1 and S 2 and S 2 , let’s call it V Can V have more than 2 children? No, this would

  63. . Generalized Suffjx Tree . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats More Repeats . LCA LCE Where to look for MUMs? Repeats and common substrings are found at internal nodes , look for an internal node that has children in S 1 mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Construct a generalized suffjx tree for S 1 and S 2 and S 2 , let’s call it V Can V have more than 2 children? No, this would the paths from V to a leaf?

  64. . Generalized Suffjx Tree . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats More Repeats . LCA LCE Where to look for MUMs? Repeats and common substrings are found at internal nodes , look for an internal node that has children in S 1 mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Construct a generalized suffjx tree for S 1 and S 2 and S 2 , let’s call it V Can V have more than 2 children? No, this would the paths from V to a leaf? No, again it would mean

  65. . Preamble . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Repeats . Generalized Suffjx Tree More Repeats LCA LCE Where to look for MUMs? Repeats and common substrings are found at internal nodes , look for an internal node that has children in S 1 mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of that u occurs more than once in one or both input strings has exactly 2 children that are leaves Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . Construct a generalized suffjx tree for S 1 and S 2 and S 2 , let’s call it V Can V have more than 2 children? No, this would the paths from V to a leaf? No, again it would mean So, we have that V has to be an internal node that

  66. the string u in S 1 and S 2 , therefore it suffjce to compare S 1 i 1 and S 2 j . More Repeats Generalized Suffjx Tree Repeats Preamble LCE LCA Repeats Generalized Suffjx Tree LCA Preamble . . . More Repeats Where to look for MUMs? LCE . Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal . u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of 1 Time and space complexity? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  67. the string u in S 1 and S 2 , therefore it suffjce to compare S 1 i 1 and S 2 j . More Repeats Generalized Suffjx Tree Repeats Preamble LCE LCA Repeats Generalized Suffjx Tree LCA Preamble . . . More Repeats Where to look for MUMs? LCE . Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal . u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of 1 Time and space complexity? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  68. the string u in S 1 and S 2 , therefore it suffjce to compare S 1 i 1 and S 2 j . More Repeats Generalized Suffjx Tree Repeats Preamble LCE LCA Repeats Generalized Suffjx Tree LCA Preamble . . . More Repeats Where to look for MUMs? LCE . Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal . u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of 1 Time and space complexity? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  69. the string u in S 1 and S 2 , therefore it suffjce to compare S 1 i 1 and S 2 j . More Repeats Generalized Suffjx Tree Repeats Preamble LCE LCA Repeats Generalized Suffjx Tree LCA Preamble . . . More Repeats Where to look for MUMs? LCE . Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal . u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of 1 Time and space complexity? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  70. the string u in S 1 and S 2 , therefore it suffjce to compare S 1 i 1 and S 2 j . More Repeats Repeats Preamble LCE LCA Preamble Generalized Suffjx Tree Repeats More Repeats . . . Generalized Suffjx Tree Where to look for MUMs? LCA LCE . Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal . u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of 1 Time and space complexity? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  71. the string u in S 1 and S 2 , therefore it suffjce to compare S 1 i 1 and S 2 j . More Repeats Repeats Preamble LCE LCA Preamble Generalized Suffjx Tree Repeats More Repeats . . . Generalized Suffjx Tree Where to look for MUMs? LCA LCE . Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal . u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of 1 Time and space complexity? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  72. . Preamble . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Repeats . Generalized Suffjx Tree More Repeats LCA LCE Where to look for MUMs? Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal . u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? Time and space complexity? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics The leaves beneath V contains the starting positions of the string u in S 1 and S 2 , therefore it suffjce to compare S 1 [ i − 1 ] and S 2 [ j − 1 ]

  73. . Preamble . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Repeats . Generalized Suffjx Tree More Repeats LCA LCE Where to look for MUMs? Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal . u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? Time and space complexity? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics The leaves beneath V contains the starting positions of the string u in S 1 and S 2 , therefore it suffjce to compare S 1 [ i − 1 ] and S 2 [ j − 1 ]

  74. GATCG$ S2,5 S1,4 CTCGT& $ & ATCG$ S1,2 r C T G & G CG & $ S2,5 T& S2,5 TCGT& ATCG$ $ $ S1,5 T& T& S2,4 S1,3 S1,4 S1,1 S2,2 S2,3 S2,1 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  75. all_mums ( node v ) l e f t d i f f e r s from the char to the i f of u in S2 and the path i s long enough then d i s p l a y mum information else all_mums ( c h i l d l e f t ) all_mums ( r i g h t c h i l d ) else for each c h i l d of v all_mums ( c h i l d ) of u in S1 l e f t the l e a f v i s a l e a f return i f # c h i l d r e n i s char to i f l e f t c h i l d i s a 2 and r i g h t c h i l d i s a l e a f set u to the path label of the path i f . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  76. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE Genome Alignment The subsequent steps of a complete algorithm for the alignment of two genomic sequences involve: Finding the longest sequence of MUMs occurring in the same order the two sequences. Apply an alignment algorithm (to be presented later) on the pairs of regions in between two MUMs. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  77. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE Lowest Common Ancestor Defjnition. The lowest common ancestor ( lca ) of any two nodes nodes. The lca of 5 and 7 is 6, the lca of 1 and 3 is 2, and so on. * A node u is an ancestor of a node v if u is a node that occurs on the unique path from the root to v . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . of a rooted tree is the deepest node which is an ancestor * of both 2 1 4 3 6 5 7

  78. . Generalized Suffjx Tree . . . . . . . . . Preamble Repeats More Repeats . LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Lowest Common Ancestor How would you fjnd the lowest common ancestor ? What is the time complexity ? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . 2 1 4 3 6 5 7

  79. . LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE Starting at node i , visit all the parents nodes until reaching the root of the tree, each visited node is pushed onto S i Repeat the same operations starting at node j , this time, each visited node is pushed onto S j Whilst the top nodes are identical, pop( S i ) and pop( S j ) The last identical node is the lowest common ancestor Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . Lowest Common Ancestor in O ( 3 n ) time Using two stacks S i and S j .

  80. . More Repeats . . . . . . . . Preamble Repeats Generalized Suffjx Tree LCA . LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Lowest Common Ancestor Problem (Overview) Given an input tree with n nodes. Let’s assume that written or used as an address in constant time. Words of 32 bits. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics n < 4 , 294 , 967 , 296 nodes. In the unit-cost RAM model, O ( log n ) bits can be read ,

  81. . Generalized Suffjx Tree . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats More Repeats . LCA LCE Lowest Common Ancestor Problem (Overview) (Although not necessary) Let’s also make the following assumptions. multiplied , or divided in constant time. performed in constant time, including AND , OR , XOR , 1s, and fjnding the position of the left-most or right-most 1. It can be shown, but we will not, that after a linear amount of time pre-processing the input tree , linear w.r.t. the number of nodes, the lca of any two nodes can be found in constant Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics 1. O ( log n ) bits can be compared , added , subtracted , 2. bit-level operations on O ( log n ) bits numbers can be left or right shift by up to O ( log n ) bits, creating masks of time ! See (Gusfjeld 1997) § 8.

  82. . Generalized Suffjx Tree . . . . . . . . . . Repeats More Repeats . LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE LCA Algorithm : Overview For this overview of the lca algorithm, let’s consider the case of a complete rooted binary tree. This tree has p leaves and n nodes, Marcel Turcotte . Preamble . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . 1000 8 0100 1100 4 12 0010 0110 1010 1110 2 6 10 14 1 3 5 7 9 11 13 15 0001 0011 0101 0111 1001 1011 1101 1111 where n = 2 p − 1.

  83. . LCE . . . . . . . Preamble Repeats . More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE LCA Algorithm : Overview Furthermore, consider the in order (Left-Root-Right) labelling of the tree and its interpretation as binary numbers. How much does it cost to label this tree? n time. This is the pre-processing step/time. Marcel Turcotte . Generalized Suffjx Tree . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . 1000 8 0100 1100 4 12 0010 0110 1010 1110 2 6 10 14 1 3 5 7 9 11 13 15 0001 0011 0101 0111 1001 1011 1101 1111

  84. . LCE . . . . . . . Preamble . Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE LCA Algorithm : Overview Furthermore, consider the in order (Left-Root-Right) labelling of the tree and its interpretation as binary numbers. How much does it cost to label this tree? n time. This is the pre-processing step/time. Marcel Turcotte . Repeats . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . 1000 8 0100 1100 4 12 0010 0110 1010 1110 2 6 10 14 1 3 5 7 9 11 13 15 0001 0011 0101 0111 1001 1011 1101 1111

  85. . LCA . . . . . . . . Preamble . Generalized Suffjx Tree More Repeats LCE . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE LCA Algorithm : Overview Furthermore, consider the in order (Left-Root-Right) labelling of the tree and its interpretation as binary numbers. How much does This is the pre-processing step/time. Marcel Turcotte . Repeats . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . 1000 8 0100 1100 4 12 0010 0110 1010 1110 2 6 10 14 1 3 5 7 9 11 13 15 0001 0011 0101 0111 1001 1011 1101 1111 it cost to label this tree? O ( n ) time.

  86. . More Repeats . . . . . . . . Preamble . Generalized Suffjx Tree LCA . LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE LCA Algorithm : Overview Furthermore, consider the in order (Left-Root-Right) labelling of the tree and its interpretation as binary numbers. How much does step/time. Marcel Turcotte . Repeats . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . 1000 8 0100 1100 4 12 0010 0110 1010 1110 2 6 10 14 1 3 5 7 9 11 13 15 0001 0011 0101 0111 1001 1011 1101 1111 it cost to label this tree? O ( n ) time. This is the pre-processing

  87. . LCE . . . . . . . Preamble Repeats . More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE LCA Algorithm : Overview The number of edges on any path from the root to any leaf is Let’s now interpret the numbers (labels) as d 1 bit path numbers , i.e. starting from the left hand side of the number, each bit represents a direction, 0 = left, 1 = right. Marcel Turcotte . Generalized Suffjx Tree . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . 1000 8 0100 1100 4 12 0010 0110 1010 1110 2 6 10 14 1 3 5 7 9 11 13 15 0001 0011 0101 0111 1001 1011 1101 1111 d = log 2 p .

  88. . LCA . . . . . . . . Preamble . Generalized Suffjx Tree More Repeats LCE . Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE LCA Algorithm : Overview The number of edges on any path from the root to any leaf is path numbers , i.e. starting from the left hand side of the number, each bit represents a direction, 0 = left, 1 = right. Marcel Turcotte . Repeats . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . 1000 8 0100 1100 4 12 0010 0110 1010 1110 2 6 10 14 1 3 5 7 9 11 13 15 0001 0011 0101 0111 1001 1011 1101 1111 d = log 2 p . Let’s now interpret the numbers (labels) as d + 1 bit

Recommend


More recommend