csci 490
play

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the biological problem Algorithm issues Intro to dynamic programming Global sequence alignment Local sequence alignment More efficient


  1. Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 A -3 3

  2. Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3

  3. Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  4. Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A j = 0 0 -1 -2 -3 -4 Optimal Alignment: F(4,3) = 2 1 A -1 1 0 -1 -2 This only tells us the 2 T -2 0 0 1 0 best score 3 A -3 -1 -1 0 2

  5. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 A 2 T -2 0 0 1 0 A 3 A -3 -1 -1 0 2

  6. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 T A 2 T -2 0 0 1 0 T A 3 A -3 -1 -1 0 2

  7. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 G T A 2 T -2 0 0 1 0 - T A 3 A -3 -1 -1 0 2

  8. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 A G T A 2 T -2 0 0 1 0 A - T A 3 A -3 -1 -1 0 2

  9. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 AGTA 2 T -2 0 0 1 0 A − TA 3 A -3 -1 -1 0 2

  10. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 2 T -2 3 A -3

  11. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 3 A -3

  12. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3

  13. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  14. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  15. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  16. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  17. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 AGTA 2 T -2 0 0 1 0 A − TA 3 A -3 -1 -1 0 2

  18. The Needleman-Wunsch Algorithm 1. Initialization. a. F(0, 0) = 0 = - j  d b. F(0, j) = - i  d c. F(i, 0) 2. Main Iteration. Filling in scores For each i = 1……M a. For each j = 1……N F(i-1,j) – d [case 1] F(i, j-1) – d F(i, j) = max [case 2] F(i-1, j-1) + σ (x i , y j ) [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3] 3. Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

  19. Complexity • Time: O(NM) • Space: O(NM) • Linear-space algorithms do exist (with the same time complexity)

  20. Equivalent graph problem S1 = G A T A (0,0) → : a gap in the 2 nd sequence 1 1 S2 = A  : a gap in the 1 st sequence 1 : match / mismatch T Value on vertical/horizontal line: -d 1 1 Value on diagonal: m or -s A (3,4) • Number of steps: length of the alignment • Path length: alignment score • Optimal alignment: find the longest path from (0, 0) to (3, 4) • General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

  21. Question • If we change the scoring scheme, will the optimal alignment be changed? – Old: Match = 1, mismatch = gap = -1 – New: match = 2, mismatch = gap = 0 – New: Match = 2, mismatch = gap = -2?

  22. Question • What kind of alignment is represented by these paths? A A A A A B B B B B C C C C C A- A-- --A -A- -A BC -BC BC- B-C BC Alternating gaps are impossible if – s > -2d

  23. A variant of the basic algorithm Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Score = -7 Seq2: ---CAGCGTGG-------- Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG Score = -2 The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1)

  24. A variant of the basic algorithm • Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- • Then, we don ’ t want to penalize gaps in the ends

  25. The Overlap Detection variant Changes: x 1 ……………………………… x M y N ……………………………… y 1 1. Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2. Termination max i F(i, N) F OPT = max max j F(M, j)

  26. Different types of overlaps x x y y

  27. The local alignment problem X = x 1 ……x M , Given two strings Y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c -de x y

  28. Why local alignment • Conserved regions may be a small part of the whole – Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes C D B A D B A C

  29. Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n 2 ) choices for A, O(m 2 ) for B, O(nm) for DP, so O(n 3 m 3 ) total.

  30. Reminder • The overlap detection algorithm – We do not give penalty to gaps at either end Free gap Free gap

  31. The local alignment idea • Do not penalize the unaligned regions (gaps or mismatches) • The alignment can start anywhere and ends anywhere • Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment – By resetting alignment score to zero

  32. The Smith-Waterman algorithm Initialization : F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d Iteration : F(i, j) = max F(i, j – 1) – d F(i – 1, j – 1) +  (x i , y j )

  33. The Smith-Waterman algorithm Termination : 1. If we want the best local alignment… F OPT = max i,j F(i, j) 2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

  34. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 Gap: -1 b 0 c 0 x 0 d 0 e 0 x 0

  35. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 x 0 d 0 e 0 x 0

  36. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 d 0 e 0 x 0

  37. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 e 0 x 0

  38. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 x 0

  39. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0

  40. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0 2 2 2 1 1 4

  41. Trace back x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0 2 2 2 1 1 4

  42. Trace back x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 cxde | || c 0 0 0 0 2 1 0 c-de x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 x-de | || e 0 0 0 0 0 2 5 xcde x 0 2 2 2 1 1 4

  43. • No negative values in local alignment DP array • Optimal local alignment will never have a gap on either end • Local alignment: “ Smith-Waterman ” • Global alignment: “ Needleman-Wunsch ”

  44. Analysis • Time: – O(MN) for finding the best alignment – Time to report all alignments depends on the number of sub-opt alignments • Memory: – O(MN) – O(M+N) possible

  45. More efficient alignment algorithms

  46. • Given two sequences of length M, N • Time: O(MN) – Ok, but still slow for long sequences • Space: O(MN) – bad – 1Mb seq x 1Mb seq = 1TB memory • Can we do better?

  47. Bounded alignment Good alignment should appear near the diagonal

  48. Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k x i | i – j | < k Then, | implies y j

  49. Bounded Dynamic Programming x 1 ………………………… x M Initialization: y N ………………………… y 1 F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+  (x i , y j ) F(i, j – 1) – d, if j > i – k F(i, j) = max F(i – 1, j) – d, if j < i + k Termination: same k

  50. Analysis • Time: O(kM) << O(MN) • Space: O(kM) with some tricks => M M 2k 2k

  51. • Given two sequences of length M, N • Time: O(MN) – ok • Space: O(MN) – bad – 1mb seq x 1mb seq = 1TB memory • Can we do better?

  52. Linear space algorithm • If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment?

  53. Linear space algorithm • When we finish, we know how we have aligned the ends of the sequences X M Y N Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))

  54. (0, 0) M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

  55. (0,0) (3,2) (3,0) (3,4) (3,6) (6,6) • Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k))

  56. Hirschberg’s idea • Divide and conquer! Y X Forward algorithm Align x 1 x 2 …x M/2 with Y M/2 F(M/2, k) represents the best alignment between x 1 x 2 …x M/2 and y 1 y 2 …y k

  57. Backward Algorithm Y X Backward algorithm Align reverse(x M/2+1 …x M ) with reverse(Y) M/2 B(M/2, k) represents the best alignment between reverse(x M/2+1 …x M ) and reverse(y k y k+1 …y N )

  58. Linear-space alignment Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2

  59. Linear-space alignment Now, we can find k * maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k * Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Recommend


More recommend