algorithms in bioinformatics a practical introduction
play

Algorithms in Bioinformatics: A Practical Introduction Multiple - PowerPoint PPT Presentation

Algorithms in Bioinformatics: A Practical Introduction Multiple Sequence Alignment Multiple Sequence Alignment Given k sequences S = { S 1 , S 2 , , S k } . A multiple alignment of S is a set of k equal- length sequences { S 1 ,


  1. Algorithms in Bioinformatics: A Practical Introduction Multiple Sequence Alignment

  2. Multiple Sequence Alignment  Given k sequences S = { S 1 , S 2 , …, S k } .  A multiple alignment of S is a set of k equal- length sequences { S’ 1 , S’ 2 , …, S’ k } .  where S’ i is obtained by inserting gaps in to S i .  The multiple sequence alignment problem aims to  find a multiple alignment which optimize certain score.

  3. Example: multiple alignment of 4 sequences  S 1 = ACG--GAGA  S 2 = -CGTTGACA  S 3 = AC-T-GA-A  S 4 = CCGTTCAC-

  4. Applications of multiple sequence alignment  Align the domains of proteins  Align the same genes/proteins from multiple species  Help predicting protein structure

  5. Sum-of-Pair (SP) Score  Consider the multiple alignment S’ of S.  SP-score(a 1 , …, a k ) = Σ 1 ≤ i< j ≤ k δ (a i ,a j )  where a i can be any character or a space.  The SP-score of S’ is  Σ x SP-score(S’ 1 [x], …, S’ k [x]).

  6. Example: multiple alignment of 4 sequences  S 1 = ACG--GAGA  S 2 = -CGTTGACA  S 3 = AC-T-GA-A  S 4 = CCGTTCAC-  Assume score of  match and mismatch/insert/delete are 2 and -2, respectively.  For position 1,  SP-score(A,-,A,C) = 2 δ (A,-) + 2 δ (A,C) + δ (A,A) + δ (C,-) = -8  SP-score= -8+ 12+ 0+ 0–6+ 0+ 12–10+ 0 = 0

  7. Sum-of-Pair (SP) distance  Equivalently, we have SP-dist.  Consider the multiple alignment S’ of S.  SP-dist(a 1 , …, a k ) = Σ 1 ≤ i< j ≤ k δ (a i ,a j )  where a i can be any character or a space.  The SP-dist of S’ is  Σ x SP-dist(S’ 1 [x], …, S’ k [x]).

  8. Agenda  Exact result  Dynamic Programming  Approximation algorithm  Center star method  Heuristics  ClustalW --- Progressive alignment  MUSCLE --- Iterative method

  9. Dynamic Programming for aligning two sequences  Recall that the optimal alignment for two sequences can be found as follows.  Let V(i 1 , i 2 ) be the score of the optimal alignment between S 1 [1..i 1 ] and S 2 [1..i 2 ]. − − + δ  V ( i 1 , i 1 ) ( S [ i ], S [ i ]) 1 2 1 1 2 2  = − + δ  V ( i , i ) max V ( i 1 , i ) ( S [ i ], _) 1 2 1 2 1 1  − + δ  V ( i , i 1 ) (_, S [ i ]) 1 2 2 2  The equation can be rephased as { } = − − + δ V ( i , i ) max V ( i b , i b ) ( S [ i b ], S [ i b ]) 1 2 1 1 2 2 1 1 1 2 2 2 ∈ − 2 ( b , b ) { 0 , 1 } {( 0 , 0 )} 1 2

  10. Dynamic Programming for aligning k sequences (I)  Let V(i 1 , i 2 , …, i k ) = the SP-score of the optimal alignment of S 1 [1..i 1 ], S 2 [1..i 2 ], …, S k [1..i k ].  Observation: The last column of the optimal alignment should be either S j [i j ] or ‘-’.  Hence, the score for the last column should be SP-score(S 1 [b 1 i 1 ], S 2 [b 2 i 2 ], …, S k [b k i k ])  For (b 1 , b 2 , …, b k ) ∈ { 0,1} k .  (Assume that S j [0] = ‘-’.)

  11. Dynamic programming for aligning k sequences (II)  Based on the observation, we have  V(i 1 , i 2 , …, i k ) = max (b1, b2, …, bk) ∈ { 0,1} k { V(i 1 -b 1 , …, i k -b k ) + SP-score(S 1 [b 1 i 1 ], …, S k [b k i k ]) }  The SP-score of the optimal multiple alignment of S= { S 1 , S 2 , …, S k } is  V(n 1 , n 2 , …, n k )  where n i is the length of S i .

  12. Dynamic Programming for aligning k sequences (III)  By filling-in the dynamic programming table,  We compute V(n 1 , n 2 , …, n k ).  By back-tracing,  We recover the multiple alignment.

  13. Complexity  Time:  The table V has n 1 n 2 …n k entries.  Filling in one entry takes 2 k k 2 time.  Total running time is O(2 k k 2 n 1 n 2 …n k ).  Space:  O(n 1 n 2 …n k ) space to store the table V.  Dynamic programming is expensive in both time and space. It is rarely used for aligning more than 3 or 4 sequences.

  14. Center star method  Computing optimal multiple alignment takes exponential time.  Can we find a good approximation using polynomial time?  We introduce Center star method, which minimizes Sum-of-Pair distance.

  15. Idea  Find a string S c .  Align all other strings with respect to S c .  Illustrate by an example:

  16. Converting pair-wise alignment to multiple alignment

  17. Detail algorithm for center star method Step 1 Step 2 Step 3 Step 4

  18. Running time of center star method Assume all k sequences are of length n.  Step 1 takes O(k 2 n 2 ) time.  Step 2 takes O(k 2 ) time to find the center string S c .  Step 3 takes O(kn 2 ) time to compute the alignment between S c and S i for all  i. Step 4 introduces space into the multiple alignment, which takes O(k 2 n)  time. In total, the running time is O(k 2 n 2 ).  Step 1 Step 2 Step 3 Step 4 O(k 2 n 2 ) O(k 2 ) O(kn 2 ) O(k 2 n)

  19. Why center star method is good? (I)  Let M* be the optimal alignment.  The SP-dist of M*

  20. Why center star method is good? (II)  The SP-dist of M  The SP-dist of M is at most twice of that of M* (the optimal alignment).

  21. Progress alignment  Progress alignment is first proposed by Feng and Doolittle (1987).  It is a heuristics to get a good multiple alignment.  Basic idea:  Align the two most closest sequences  Progressive align the most closest related sequences until all sequences are aligned.  Examples of Progress alignment method include:  ClustalW, T-coffee, Probcons  Probcons is currently the most accurate MSA algorithm.  ClustalW is the most popular software.

  22. Basic algorithm Computing pairwise distance scores 1. for all pairs of sequences Generate the guide tree which ensures 2. similar sequences are nearer in the tree Aligning the sequences one by one 3. according to the guide tree

  23. ClustalW  A popular progressive alignment method to globally align a set of sequences.  Input: a set of sequences  Output: the multiple alignment of these sequences

  24. Step 1: pairwise distance scores  Example: For S 1 and S 2 , the global alignment is  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  There are 9 non-gap positions and 8 match positions.  The distance is 1 – 8/9 = 0.111 S 1 S 2 S 3 S 4 S 5 S 1 : PPGVKSDCAS S 1 0 0.111 0.25 0.555 0.444 S 2 : PADGVKDCAS S 2 0 0.375 0.222 0.111 S 3 : PPDGKSDS S 3 0 0.5 0.5 S 4 : GADGKDCCS S 4 0 0.111 S 5 : GADGKDCAS S 5 0

  25. Step 2: generate guide tree  By neighbor-joining, generate the guide tree. S 1 S 2 S 3 S 4 S 5 S 1 0 0.111 0.25 0.555 0.444 S 2 0 0.375 0.222 0.111 S 3 0 0.5 0.5 S 4 0 0.111 s 1 s 2 s 3 s 4 s 5 S 5 0

  26. Step 3: align the sequences according to the guide tree (I)  Aligning S1 and S2, we get  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  Aligning S4 and S5, we get  S 4 =GADGKDCCS  S 5 =GADGKDCAS s 1 s 2 s 3 s 4 s 5

  27. Step 3: align the sequences according to the guide tree (II) Aligning (S1, S2) with S3, we Aligning (S1, S2, S3) with   get (S4, S5), we get S 1 =P-PGVKSDCAS S 1 =P-PGVKSDCAS   S 2 =PADGVK-DCAS S 2 =PADGVK-DCAS   S 3 =PPDG-KSD--S S 3 =PPDG-KSD--S   S 4 =GADG-K-DCCS  S 5 =GADG-K-DCAS  S 1 : P-PGVKSDCAS S 2 : PADGVK-DCAS S 3 : PPDG-KSD--S S 4 : GADG-K-DCCS s 1 s 2 s 3 s 4 s 5 S 5 : GADG-K-DCAS

  28. Summary S 1 S 2 S 3 S 4 S 5 S 1 : PPGVKSDCAS S 1 0 0.111 0.25 0.555 0.444 S 2 : PADGVKDCAS S 2 0 0.375 0.222 0.111 S 3 : PPDGKSDS S 3 0 0.5 0.5 S 4 : GADGKDCCS S 4 0 0.111 S 5 : GADGKDCAS S 5 0 S 1 : P-PGVKSDCAS S 2 : PADGVK-DCAS S 3 : PPDG-KSD--S S 4 : GADG-K-DCCS s 1 s 2 s 3 s 4 s 5 S 5 : GADG-K-DCAS

  29. Detail of Profile-Profile alignment (I)  Given two aligned sets of sequences A 1 and A 2 .  Example:  A 1 is a length-11 alignment of S 1 , S 2 , S 3  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  S 3 =PPDG-KSD--S  A 2 is a length-9 alignment of S 4 , S 5  S 4 =GADGKDCCS  S 5 =GADGKDCAS  Similar to the sequence alignment,  the profile-profile alignment introduces gaps to A 1 and A 2 so that both of them have the same length.

  30. Detail of Profile-Profile Alignment (II)  To determine the alignment, we need a scoring function PSP(A 1 [i], A 2 [j]).  In clustalW, the score is defined as follows. j δ (x,y)  PSP(A 1 [i],A 2 [j]) = Σ x,y g x i g y i is the observed frequency of amino acid x where g x in column i.  This is a natural scoring for maximizing the SP-score.  Our aim is to find an alignment between A 1 and A 2 to maximizes the PSP score.

  31. Example  A 1 [1..11] is the alignment of S 1 , S 2 , S 3  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  S 3 =PPDG-KSD--S  A 2 [1..9] is the alignment of S 4 , S 5  S 4 =GADGKDCCS  S 5 =GADGKDCAS  PSP(A 1 [3],A 2 [3]) = 1x2x δ (P,D)+ 2x2x δ (D,D)  PSP(A 1 [9],A 2 [8]) = 2 δ (C,C)+ 2 δ (C,A)+ δ (-,C)+ δ (-,A)

  32. Dynamic Programming  Let V(i,j) = the score of the best alignment between A 1 [1..i] and A 2 [1..j].  We have V(i,j) = maximum of  V(i-1,j-1)+ PSP(A 1 [i],A 2 [j])  V(i-1,j)+ PSP(A 1 [i],-)  V(i,j-1)+ PSP(-,A 2 [j])  By fill-in the dynamic programming table, we can find the optimal alignment.  Time complexity: O(k 1 n 1 + k 2 n 2 + n 1 n 2 ) time.

  33. Example  By profile-profile alignment, we have  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  S 3 =PPDG-KSD--S  S 4 =GADG-K-DCCS  S 5 =GADG-K-DCAS

Recommend


More recommend