sequence comparison
play

Sequence Comparison: Dynamic Programming Genome 373 Genomic - PowerPoint PPT Presentation

Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein Mission: Find the best alignment between two sequences. A search algorithm for A method for finding the alignment scoring with the best


  1. Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein

  2. Mission: Find the best alignment between two sequences. A “search” algorithm for A method for finding the alignment scoring with the best score alignments ?

  3. Scoring Aligned Bases • • Substitution matrix: Gap penalty: • Linear gap penalty A C G T • Affine gap penalty A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17

  4. Exhaustive search • Align the two sequences: GAATC and CATAC GAATC GAAT-C -GAAT-C GAAT-C CATAC C-ATAC C-A-TAC C-ATAC GAATC- GAAT-C GA-ATC GAAT-C CA-TAC CA-TAC CATA-C CA-TAC Simple (exhaustive search) algorithm 1) Construct all possible alignments 2) Use the substitution matrix and gap penalty to score each alignment 3) Pick the alignment with the best score

  5. Exhaustive search • Align the two sequences: GAATC and CATAC GAATC GAAT-C -GAAT-C GAAT-C CATAC C-ATAC C-A-TAC C-ATAC GAATC- GAAT-C GA-ATC GAAT-C Complexity? CA-TAC CA-TAC CATA-C CA-TAC Simple (exhaustive search) algorithm 1) Construct all possible alignments 2) Use the substitution matrix and gap penalty to score each alignment 3) Pick the alignment with the best score

  6. Mission: Find the best alignment between two sequences. A “search” algorithm for A method for finding the alignment scoring with the best score alignments ? The Needleman – Wunsch Algorithm

  7. The Needleman – Wunsch Algorithm • An algorithm for global alignment on two sequences • A Dynamic Programming (DP) approach – Yes , it’s a weird name. – DP is closely related to recursion and to mathematical induction • We can prove that the resulting score is optimal.

  8. DP matrix j 0 1 2 3 etc. i G A A T C 0 C 1 A 2 T 3 A 4 5 C

  9. DP matrix j 0 1 2 3 etc. i G A A T C 0 C 1 A 2 T 3 A 4 5 C initial row and column

  10. Best alignment DP matrix of GA to CA j 0 1 2 3 etc. i G A A T C 0 C 1 Which value are we interested in? A 5 2 T 3 The value at ( i,j ) is the score of the A 4 best alignment of the first i characters 5 C of one sequence versus the first j characters of the other sequence.

  11. DP matrix j 0 1 2 3 etc. i G A A T C 0 C 1 A 5 2 The score of the T 3 best alignment of the two sequences. A 4 5 C

  12. Moving in the DP matrix G A A T C C A 5 T A C

  13. GAA DP matrix CA- G A A T C C A 5 1 T Moving horizontally in the A matrix introduces a gap in the sequence along the left edge. C

  14. GA- CAT DP matrix G A A T C Moving vertically in the matrix introduces a gap in the sequence along the C top edge. A 5 T 1 A C

  15. GAA CAT DP matrix G A A T C Moving diagonally in the C matrix aligns two residues A 5 T 0 A C

  16. Initialization G A A T C C A T A C

  17. Initialization G A A T C 0 C A T A C

  18. G - Introducing a gap G A A T C 0 -4 C A T A C

  19. - C Introducing a gap G A A T C 0 -4 C -4 A T A C

  20. Complete first row ----- CATAC and column G A A T C 0 -4 -8 -12 -16 -20 C -4 A -8 T -12 A -16 C -20

  21. What about i=1 , j=1 j 0 1 2 3 etc. i G A A T C 0 -4 -8 -12 -16 -20 0 C -4 ? 1 A -8 2 T -12 3 A -16 4 5 C -20

  22. Three ways to get G- to i=1 , j=1 -C j 0 1 2 3 etc. i G A A T C 0 -4 -8 -12 -16 -20 0 C -4 -8 1 A -8 2 T -12 3 A -16 4 5 C -20

  23. Three ways to get -G C- to i=1 , j=1 j 0 1 2 3 etc. i G A A T C 0 -4 -8 -12 -16 -20 0 C -4 -8 1 A -8 2 T -12 3 A -16 4 5 C -20

  24. Three ways to get G C to i=1 , j=1 j 0 1 2 3 etc. i G A A T C 0 -4 -8 -12 -16 -20 0 C -4 -5 1 A -8 2 T -12 3 A -16 4 5 C -20

  25. Accept the highest scoring of the three G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 Then simply repeat the T -12 same rule progressively across the matrix A -16 C -20

  26. DP matrix G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 ? T -12 A -16 C -20

  27. -G G- --G CA CA CA- DP matrix -4+0=-4 -5+-4=-9 -8+-4=-12 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 0 -4 -4 A -8 ? T -12 A -16 C -20

  28. -G G- --G CA CA CA- DP matrix -4+0=-4 -5+-4=-9 -8+-4=-12 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 0 -4 -4 A -8 -4 T -12 A -16 C -20

  29. DP matrix G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 -4 T -12 ? A -16 ? C -20 ?

  30. DP matrix G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 -4 T -12 -8 A -16 -12 C -20 -16

  31. DP matrix G A A T C 0 -4 -8 -12 -16 -20 ? C -4 -5 A -8 -4 ? T -12 -8 ? A -16 -12 ? C -20 -16 ?

  32. Traceback G A A T C 0 -4 -8 -12 -16 -20 What is the alignment -9 C -4 -5 associated with this entry? A -8 -4 5 T -12 -8 1 A -16 -12 2 C -20 -16 -2

  33. Traceback G A A T C 0 -4 -8 -12 -16 -20 What is the alignment -9 C -4 -5 associated with this entry? A -8 -4 5 Just follow the arrows back - this is called the traceback T -12 -8 1 -G-A A -16 -12 2 CATA C -20 -16 -2

  34. Full Alignment G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 A -8 -4 5 Continue and find the optimal global T -12 -8 1 alignment, and its score. A -16 -12 2 C -20 -16 -2 ?

  35. Full Alignment G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  36. Full Alignment G A A T C 0 -4 -8 -12 -16 -20 Best alignment starts at C -4 -5 -9 -13 -12 -6 bottom right and follows A -8 -4 5 1 -3 -7 traceback arrows to top left T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  37. GA-ATC One best traceback CATA-C G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  38. GAAT-C -CATAC Another best traceback G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  39. GAAT-C GA-ATC -CATAC CATA-C G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  40. Multiple solutions GA-ATC • When a program returns a single CATA-C sequence alignment, it may not be the only best alignment but it is GAAT-C guaranteed to be one of them. CA-TAC • In our example, all of the alignments GAAT-C at the left have equal scores. C-ATAC GAAT-C -CATAC

  41. What’s the complexity of this algorithm?

  42. Practice problem: Find a best pairwise alignment of GAATC and AATTC G A A T C 0 A A T T C

  43. DP in equation form • Align sequence x and y . • F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.    F 0 , 0 0         F i j s x y 1 , 1 , i j          F i j F i j d , max 1 ,       F i j d , 1

  44. DP equation graphically     ,    F i j 1 F i 1 , j 1   d s x i y , j      F i 1 , j d F , i j take the max of these three

Recommend


More recommend