sequence comparison
play

Sequence Comparison: Dynamic Programming Genome 373 Genomic - PowerPoint PPT Presentation

Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein GAATC CATAC Mission: Find the best alignment between two sequences. A search algorithm for A method for finding the alignment scoring with


  1. Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein

  2. GAATC CATAC Mission: Find the best alignment between two sequences. A “search” algorithm for A method for finding the alignment scoring with the best score alignments  Dynamic programming  Substitution matrix  Gap penalties

  3. Scoring Aligned Bases • • Substitution matrix: Gap penalty: A C G T • Linear gap penalty • Affine gap penalty A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17

  4. Exhaustive search GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C

  5. How many possibilities? GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C • How many different possible alignments of two sequences of length n exist?

  6. How many possibilities? GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C • How many different possible alignments of two sequences of length n exist? 5 2.5x10 2 10 1.8x10 5 20 1.4x10 11 30 1.2x10 17 40 1.1x10 23

  7. The Needleman – Wunsch Algorithm • An algorithm for global alignment on two sequences • A Dynamic Programming (DP) approach – Yes , it’s a weird name. – DP is closely related to recursion and to mathematical induction • We can prove that the resulting score is optimal.

  8. A C G T GA DP matrix A 10 -5 0 -5 CA C -5 10 -5 0 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 G A A T C i 0 C 1 A 5 2 T 3 A 4 The value at ( i,j ) is the score of the 5 C best alignment of the first i characters of one sequence versus the first j characters of the other sequence. initial row and column

  9. A C G T A 10 -5 0 -5 GAA DP matrix C -5 10 -5 0 CA- G 0 -5 10 -5 T -5 0 -5 10 G A A T C C A 5 1 T Moving horizontally in the A matrix introduces a gap in the sequence along the C left edge.

  10. A C G T GA- A 10 -5 0 -5 CAT DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C Moving vertically in the matrix introduces a gap in C the sequence along the top edge. A 5 T 1 A C

  11. A C G T GAA A 10 -5 0 -5 CAT DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C C Moving diagonally in the matrix aligns two residues A 5 T 0 A C

  12. A C G T A 10 -5 0 -5 Initialization Start at top left and C -5 10 -5 0 move progressively G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 C A T A C

  13. A C G T G A 10 -5 0 -5 - Introducing a gap C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 C A T A C

  14. A C G T - A 10 -5 0 -5 C Introducing a gap C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 C -4 A T A C

  15. A C G T Complete first row ----- A 10 -5 0 -5 C -5 10 -5 0 CATAC and column G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 A -8 T -12 A -16 C -20

  16. A C G T Three ways to get A 10 -5 0 -5 G- C -5 10 -5 0 to i=1 , j=1 -C G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 G A A T C i 0 -4 0 C -8 1 A 2 T 3 A 4 5 C

  17. A C G T Three ways to get -G A 10 -5 0 -5 C -5 10 -5 0 C- to i=1 , j=1 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 G A A T C i 0 0 C -4 -8 1 A 2 T 3 A 4 5 C

  18. A C G T Three ways to get G A 10 -5 0 -5 C to i=1 , j=1 C -5 10 -5 0 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 G A A T C i 0 0 C -5 1 A 2 T 3 A 4 5 C

  19. A C G T Accept the highest scoring A 10 -5 0 -5 C -5 10 -5 0 of the three G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 Then simply repeat the T -12 same rule progressively across the matrix A -16 C -20

  20. A C G T A 10 -5 0 -5 DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 ? T -12 A -16 C -20

  21. A C G T -G G- --G A 10 -5 0 -5 CA CA CA- DP matrix C -5 10 -5 0 G 0 -5 10 -5 -4+0=-4 -5+-4=-9 -8+-4=-12 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 0 -4 -4 A -8 ? T -12 A -16 C -20

  22. A C G T -G G- --G A 10 -5 0 -5 CA CA CA- DP matrix C -5 10 -5 0 G 0 -5 10 -5 -4+0=-4 -5+-4=-9 -8+-4=-12 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 0 -4 -4 A -8 -4 T -12 A -16 C -20

  23. A C G T A 10 -5 0 -5 DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 -4 T -12 ? A -16 ? C -20 ?

  24. A C G T A 10 -5 0 -5 DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 -4 T -12 -8 A -16 -12 C -20 -16

  25. A C G T A 10 -5 0 -5 DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 ? A -8 -4 ? T -12 -8 ? A -16 -12 ? C -20 -16 ?

  26. A C G T A 10 -5 0 -5 Traceback C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 What is the alignment associated with this entry? A -8 -4 5 Just follow the arrows back - this is called the traceback T -12 -8 1 -G-A A -16 -12 2 CATA C -20 -16 -2

  27. A C G T A 10 -5 0 -5 Full Alignment C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 A -8 -4 5 Continue and find the optimal global T -12 -8 1 alignment, and its score. A -16 -12 2 C -20 -16 -2 ?

  28. A C G T A 10 -5 0 -5 Full Alignment C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  29. A C G T A 10 -5 0 -5 Full Alignment C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 Best alignment starts at bottom right and follows A -8 -4 5 1 -3 -7 traceback arrows to top left T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  30. A C G T GA-ATC A 10 -5 0 -5 One best traceback C -5 10 -5 0 CATA-C G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  31. A C G T A 10 -5 0 -5 GAAT-C -CATAC Another best traceback C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  32. A C G T GAAT-C GA-ATC A 10 -5 0 -5 C -5 10 -5 0 -CATAC CATA-C G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  33. Multiple solutions GA-ATC • When a program returns a single CATA-C sequence alignment, it may not be the only best alignment but it is GAAT-C guaranteed to be one of them. CA-TAC • In our example, all of the alignments GAAT-C at the left have equal scores. C-ATAC GAAT-C -CATAC

  34. Practice problem: Find a best pairwise alignment of GAATC and AATTC A C G T G A A T C A 10 -5 0 -5 C -5 10 -5 0 0 G 0 -5 10 -5 T -5 0 -5 10 A d = -4 A T T C

Recommend


More recommend