algorithms in bioinformatics a practical introduction
play

Algorithms in Bioinformatics: A Practical Introduction Genome - PowerPoint PPT Presentation

Algorithms in Bioinformatics: A Practical Introduction Genome Alignment Complete genomes About 1000 bacterial genomes plus a few dozen higher organisms are sequenced It is expected many genomes will be available in the future. How


  1. Algorithms in Bioinformatics: A Practical Introduction Genome Alignment

  2. Complete genomes  About 1000 bacterial genomes plus a few dozen higher organisms are sequenced  It is expected many genomes will be available in the future.  How can we study the similarities and differences of two or more related genomes?

  3. Can we compare two genomes using Smith-Waterman (SW) algorithm?  SW algorithm is designed for comparing two genes or two proteins.  SW algorithm is too slow for comparing genomes  Also, SW algorithm just reports locally similar regions (conserved regions) between two genomes. It cannot report the overall similarity.  We need better solutions for comparing genomes.

  4. Existing tools for comparing genomes  MUMmer, Mutation Sensitive Alignment, SSAHA, AVID, MGA, BLASTZ, and LAGAN  All existing genome alignment methods assume the two genomes should share some conserved regions.  They align the conserved regions first; then, they extend the alignment to cover the non- conserved regions.

  5. General Framework Phase 1: Identifies potential anchors  (highly similar regions). Phase 2: Identify a set of co-linear  anchors, which forms the basis of the alignment (conserved regions). Phase 3: Close the gaps between  the anchors to obtain the final alignment.

  6. Agenda  MUMmer (Versions 1 to 3)  Mutation Sensitive Alignment

  7. MUMmer 1  Phase 1: Identifying anchors based on Maximal Unique Match (MUM).  Phase 2: Identify a set of co-linear MUMs based on longest common subsequence.  Phase 3: Fill-in the gaps

  8. PHASE 1: I DENTI FYI NG ANCHORS (MUM)

  9. What is an anchor? Although conserved regions rarely contain the same entire  sequence, they usually share some short common substrings which are unique in the two genomes. For example, consider below two regions.  GCTA is a common substring that is unique.  GCTAC is a maximal common substring that is unique  TAC is a common substring that is not unique.  GCTACT is not a common substring.  Genome1: ACGACTCAGCTACTGGTCAGCTATTACTTACCGC Genome2: ACTTCTCTGCTACGGTCAGCTATTCACTTACCGC

  10. Maximal Unique Match (MUM)  In this lecture, we discuss MUM, a type of anchor.  MUM is a common substring of two genomes of length at least some threshold d such that  The common substring is maximal, that is, the substring cannot be extended on either end without incurring a mismatch  The common substring is unique in both sequences  For instance, almost all conserved genes share some short common substrings which are unique since they need to preserve functionality. In our experiment, we set d= 20.

  11. Examples of finding MUMs  S= acgactcagctactggtcagctattacttaccgc#  T= acttctctgctacggtcagctattcacttaccgc$  There are four MUMs: ctc, gctac, ggtcagctatt, acttaccgc.  Consider S= acgat# , T= cgta$  There are two MUMs: cg and t

  12. How to find MUMs?  Brute-force approach: I nput : Two genome sequences S[1..m 1 ] and T[1..m 2 ]  For every position i in S  For every position j in T  Find the longest common prefix P of S[i..m 1 ] and T[j..m 2 ]  Check whether |P| ≥ d and whether P is unique in both genomes. If yes, report it as a MUM.  This solution requires at least O(m 1 m 2 ) time. Too slow!

  13. Finding MUMs by suffix tree! MUMs can be found in O(m 1 + m 2 ) time by suffix tree!  Build a generalized suffix tree for S and T 1. Mark all the internal nodes that have exactly two 2. leaf children, which represent both suffixes of S and T. For each marked node, WLOG, suppose it 3. represents the i-th suffix S i of S and the j-th suffix T j of T. We check whether S[i-1]= T[j-1]. If not, the path label of this marked node is an MUM.

  14. Example of finding MUMs (I)  Consider S= acgat# , T= cgta$  Step 1: Build the generalized suffix tree for S and T. c a # g t $ g 5 6 c t a a t t a g t $ a a t a # # t $ # $ # $ # 4 4 2 1 3 2 1 5 3

  15. Example of finding MUMs (II)  Step 2: Mark all the internal nodes that have exactly two leaf children, which represent both suffixes of S and T. c a # g t $ g 5 6 c t a a t t a g t $ t a a t a # # $ # $ # $ # 4 4 2 1 3 2 1 5 3

  16. Example of finding MUMs (III) Step 3: For each marked node, WLOG, suppose it  represents the i-th suffix S i of S and the j-th suffix T j of T. We check whether S[i-1]= T[j-1]. If not, the path label of this marked node is an MUM. Output: c a # g t $ g cg, t 5 6 c t a a t t a g t $ a t a # a # $ S= acgat# , t # $ # $ # T= cgta$ 4 4 2 1 3 2 1 5 3

  17. Phase 1: Time analysis  Step 1: Building suffix tree takes O(m 1 + m 2 ) time.  Step 2: Mark internal nodes takes O(m 1 + m 2 ) time.  Step 3: Extracting MUM also takes O(m 1 + m 2 ) time.  In total, O(m 1 + m 2 ) time.

  18. Validating MUMs  Mouse and human are Mouse Human # of Published Chr No. Chr No. Gene Pairs closely related species. They 2 15 51 share a lot of gene pairs. 7 19 192 14 3 23  Do those gene pairs share 14 8 38 15 12 80 MUMs? 15 22 72 16 16 31 16 21 64 16 22 30  Good news! 17 6 150  MUMs appear in nearly 100% 17 16 46 17 19 30 of the known conserved gene 18 5 64 pairs. 19 9 22 19 11 93 Data is extracted from http://www.ncbi.nlm.nih.gov/Homology

  19. Are all MUMs correspond to conserved regions?  It seems that the answer is No! Mouse Human # of Published # of Chr No. Chr No. Gene Pairs MUMs 2 15 51 96,473 7 19 192 52,394 14 3 23 58,708 14 8 38 38,818 15 12 80 88,305 15 22 72 71,613 No. of MUMs > > no. of gene pairs! 16 16 31 66,536 16 21 64 51,009 There are many noise! 16 22 30 61,200 17 6 150 94,095 17 16 46 29,001 How can we extract the right MUMs? 17 19 30 56,536 18 5 64 131,850 19 9 22 62,296 19 11 93 29,814

  20. PHASE 2: I DENTI FYI NG CO-LI NEAR MUMS

  21. Why identify co-linear MUMs ?  Two related species should preserve the ordering of most conserved genes!  Example:  Conserved genes in Mouse Chromosome 16 and Human Chromosome 16

  22. How to identify co-linear MUMs?  Instead of reporting all MUMs,  we compute the longest common subsequence (LCS) of all MUMs.  We report only the MUMs on the LCS.  Hopefully, this approach will not report a lot of noise.

  23. Example of LCS 12345678 12345678 41325768 41325768

  24. The longest common subsequence (LCS) problem  Suppose there are n MUMs.  Input: two sequences A[1..n]= a 1 a 2 … a n and B[1..n]= b 1 b 2 … b n of n distinct characters.  Output: the longest common subsequence.

  25. How to find LCS in O(n 2 ) time?  Let C i [j] be the length of the longest common subsequence of A[1..i] and B[1..j].  Let δ (i) be the index of the character in B such that a i = b δ (i)  Note that C i [0]= C 0 [j]= 0 for all 0 ≤ j ≤ n; and  [ ] C j − = 1 i  [ ] max C j + δ − ≥ δ i  1 [ ( ) 1 ] ( ) C i if j i − i 1  Note that the length of the LCS(A,B) = C n [n].  By dynamic programming, the problem is solved in O(n 2 ) time.

  26. Example (I) C 0 1 2 3 4 5 6 7 8  A[1..8]= 12345678 0  B[1..8]= 41325768 1 2 3 4 5 6 7 8

  27. Example (II) C 0 1 2 3 4 5 6 7 8  A[1..8]= 12345678 0 0 0 0 0 0 0 0 0 0  B[1..8]= 41325768 1 0 2 0  Base cases: 3 0  C i [0]= 0 and C 0 [j]= 0 4 0 5 0 6 0 7 0 8 0

  28. Example (III) A[1..8]= 12345678 C 0 1 2 3 4 5 6 7 8  B[1..8]= 41325768  0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 Fill the table row by row with  2 0 0 1 1 the recursive formula:  [ ] C j − = i 1  [ ] max 3 0 C j + δ − ≥ δ i  1 [ ( ) 1 ] ( ) C i if j i − i 1 4 0 5 0 C 2 [4]= LCS(A[1..2],B[1..4]). 6 0 7 0 By the recursive formula, C 2 [4]= max{ C 1 [4], 8 0 1+ C 1 [3]} = 2 (Note: δ (2)= 4 )

  29. Example (IV) A[1..8]= 12345678 C 0 1 2 3 4 5 6 7 8  B[1..8]= 41325768  0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 Fill the table row by row with  2 0 0 1 1 2 2 2 2 2 the recursive formula:  [ ] C j − = i 1  [ ] max 3 0 0 1 2 2 2 2 2 2 C j + δ − ≥ δ i  1 [ ( ) 1 ] ( ) C i if j i − i 1 4 0 1 1 2 2 2 2 2 2 5 0 1 1 2 2 3 3 3 3 6 0 1 1 2 2 3 3 4 4 LCS(A,B)= C 8 [8]= 5. 7 0 1 1 2 2 3 4 4 4 8 0 1 1 2 2 3 4 4 5

  30. We can extract more information  As a side-product, we can retrieve LCS(A[1..i], B[1..j]), where 1 ≤ i,j ≤ n, in O(1) time.

  31. Sparsification  Observation:  C i [1] ≤ C i [2] ≤ … ≤ C i [n]  Instead of storing C i [0..n] explicitly, we can store only the boundaries at which the values change.  Precisely, we store (j,C i [j]) for all j such that C i [j]> C i [j-1].  Example: suppose C i [0..9]= 0001122233.  We store (3,1), (5,2), (8,3).  We can store all the tuples (j,C i [j]) in a binary search tree T i . Then, every search, insert, delete operation takes O(log n) time.

  32.  Given T i-1 , we can compute C i [j] in O(log n) time as follows.  [ ] C j − = 1 i  [ ] max C j + δ − ≥ δ i  1 [ ( ) 1 ] ( ) C i if j i − 1 i

Recommend


More recommend