Algorithms in Bioinformatics: A Practical Introduction Multiple Sequence Alignment
Multiple Sequence Alignment Given k sequences S = { S 1 , S 2 , …, S k } . A multiple alignment of S is a set of k equal- length sequences { S’ 1 , S’ 2 , …, S’ k } . where S’ i is obtained by inserting gaps in to S i . The multiple sequence alignment problem aims to find a multiple alignment which optimize certain score.
Example: multiple alignment of 4 sequences S 1 = ACG--GAGA S 2 = -CGTTGACA S 3 = AC-T-GA-A S 4 = CCGTTCAC-
Applications of multiple sequence alignment Align the domains of proteins Align the same genes/proteins from multiple species Help predicting protein structure
Sum-of-Pair (SP) Score Consider the multiple alignment S’ of S. SP-score(a 1 , …, a k ) = Σ 1 ≤ i< j ≤ k δ (a i ,a j ) where a i can be any character or a space. The SP-score of S’ is Σ x SP-score(S’ 1 [x], …, S’ k [x]).
Example: multiple alignment of 4 sequences S 1 = ACG--GAGA S 2 = -CGTTGACA S 3 = AC-T-GA-A S 4 = CCGTTCAC- Assume score of match and mismatch/insert/delete are 2 and -2, respectively. For position 1, SP-score(A,-,A,C) = 2 δ (A,-) + 2 δ (A,C) + δ (A,A) + δ (C,-) = -8 SP-score= -8+ 12+ 0+ 0–6+ 0+ 12–10+ 0 = 0
Sum-of-Pair (SP) distance Equivalently, we have SP-dist. Consider the multiple alignment S’ of S. SP-dist(a 1 , …, a k ) = Σ 1 ≤ i< j ≤ k δ (a i ,a j ) where a i can be any character or a space. The SP-dist of S’ is Σ x SP-dist(S’ 1 [x], …, S’ k [x]).
Agenda Exact result Dynamic Programming Approximation algorithm Center star method Heuristics ClustalW --- Progressive alignment MUSCLE --- Iterative method
Dynamic Programming for aligning two sequences Recall that the optimal alignment for two sequences can be found as follows. Let V(i 1 , i 2 ) be the score of the optimal alignment between S 1 [1..i 1 ] and S 2 [1..i 2 ]. − − + δ V ( i 1 , i 1 ) ( S [ i ], S [ i ]) 1 2 1 1 2 2 = − + δ V ( i , i ) max V ( i 1 , i ) ( S [ i ], _) 1 2 1 2 1 1 − + δ V ( i , i 1 ) (_, S [ i ]) 1 2 2 2 The equation can be rephased as { } = − − + δ V ( i , i ) max V ( i b , i b ) ( S [ i b ], S [ i b ]) 1 2 1 1 2 2 1 1 1 2 2 2 ∈ − 2 ( b , b ) { 0 , 1 } {( 0 , 0 )} 1 2
Dynamic Programming for aligning k sequences (I) Let V(i 1 , i 2 , …, i k ) = the SP-score of the optimal alignment of S 1 [1..i 1 ], S 2 [1..i 2 ], …, S k [1..i k ]. Observation: The last column of the optimal alignment should be either S j [i j ] or ‘-’. Hence, the score for the last column should be SP-score(S 1 [b 1 i 1 ], S 2 [b 2 i 2 ], …, S k [b k i k ]) For (b 1 , b 2 , …, b k ) ∈ { 0,1} k . (Assume that S j [0] = ‘-’.)
Dynamic programming for aligning k sequences (II) Based on the observation, we have V(i 1 , i 2 , …, i k ) = max (b1, b2, …, bk) ∈ { 0,1} k { V(i 1 -b 1 , …, i k -b k ) + SP-score(S 1 [b 1 i 1 ], …, S k [b k i k ]) } The SP-score of the optimal multiple alignment of S= { S 1 , S 2 , …, S k } is V(n 1 , n 2 , …, n k ) where n i is the length of S i .
Dynamic Programming for aligning k sequences (III) By filling-in the dynamic programming table, We compute V(n 1 , n 2 , …, n k ). By back-tracing, We recover the multiple alignment.
Complexity Time: The table V has n 1 n 2 …n k entries. Filling in one entry takes 2 k k 2 time. Total running time is O(2 k k 2 n 1 n 2 …n k ). Space: O(n 1 n 2 …n k ) space to store the table V. Dynamic programming is expensive in both time and space. It is rarely used for aligning more than 3 or 4 sequences.
Center star method Computing optimal multiple alignment takes exponential time. Can we find a good approximation using polynomial time? We introduce Center star method, which minimizes Sum-of-Pair distance.
Idea Find a string S c . Align all other strings with respect to S c . Illustrate by an example:
Converting pair-wise alignment to multiple alignment
Detail algorithm for center star method Step 1 Step 2 Step 3 Step 4
Running time of center star method Assume all k sequences are of length n. Step 1 takes O(k 2 n 2 ) time. Step 2 takes O(k 2 ) time to find the center string S c . Step 3 takes O(kn 2 ) time to compute the alignment between S c and S i for all i. Step 4 introduces space into the multiple alignment, which takes O(k 2 n) time. In total, the running time is O(k 2 n 2 ). Step 1 Step 2 Step 3 Step 4 O(k 2 n 2 ) O(k 2 ) O(kn 2 ) O(k 2 n)
Why center star method is good? (I) Let M* be the optimal alignment. The SP-dist of M*
Why center star method is good? (II) The SP-dist of M The SP-dist of M is at most twice of that of M* (the optimal alignment).
Progress alignment Progress alignment is first proposed by Feng and Doolittle (1987). It is a heuristics to get a good multiple alignment. Basic idea: Align the two most closest sequences Progressive align the most closest related sequences until all sequences are aligned. Examples of Progress alignment method include: ClustalW, T-coffee, Probcons Probcons is currently the most accurate MSA algorithm. ClustalW is the most popular software.
Basic algorithm Computing pairwise distance scores 1. for all pairs of sequences Generate the guide tree which ensures 2. similar sequences are nearer in the tree Aligning the sequences one by one 3. according to the guide tree
ClustalW A popular progressive alignment method to globally align a set of sequences. Input: a set of sequences Output: the multiple alignment of these sequences
Step 1: pairwise distance scores Example: For S 1 and S 2 , the global alignment is S 1 =P-PGVKSDCAS S 2 =PADGVK-DCAS There are 9 non-gap positions and 8 match positions. The distance is 1 – 8/9 = 0.111 S 1 S 2 S 3 S 4 S 5 S 1 : PPGVKSDCAS S 1 0 0.111 0.25 0.555 0.444 S 2 : PADGVKDCAS S 2 0 0.375 0.222 0.111 S 3 : PPDGKSDS S 3 0 0.5 0.5 S 4 : GADGKDCCS S 4 0 0.111 S 5 : GADGKDCAS S 5 0
Step 2: generate guide tree By neighbor-joining, generate the guide tree. S 1 S 2 S 3 S 4 S 5 S 1 0 0.111 0.25 0.555 0.444 S 2 0 0.375 0.222 0.111 S 3 0 0.5 0.5 S 4 0 0.111 s 1 s 2 s 3 s 4 s 5 S 5 0
Step 3: align the sequences according to the guide tree (I) Aligning S1 and S2, we get S 1 =P-PGVKSDCAS S 2 =PADGVK-DCAS Aligning S4 and S5, we get S 4 =GADGKDCCS S 5 =GADGKDCAS s 1 s 2 s 3 s 4 s 5
Step 3: align the sequences according to the guide tree (II) Aligning (S1, S2) with S3, we Aligning (S1, S2, S3) with get (S4, S5), we get S 1 =P-PGVKSDCAS S 1 =P-PGVKSDCAS S 2 =PADGVK-DCAS S 2 =PADGVK-DCAS S 3 =PPDG-KSD--S S 3 =PPDG-KSD--S S 4 =GADG-K-DCCS S 5 =GADG-K-DCAS S 1 : P-PGVKSDCAS S 2 : PADGVK-DCAS S 3 : PPDG-KSD--S S 4 : GADG-K-DCCS s 1 s 2 s 3 s 4 s 5 S 5 : GADG-K-DCAS
Summary S 1 S 2 S 3 S 4 S 5 S 1 : PPGVKSDCAS S 1 0 0.111 0.25 0.555 0.444 S 2 : PADGVKDCAS S 2 0 0.375 0.222 0.111 S 3 : PPDGKSDS S 3 0 0.5 0.5 S 4 : GADGKDCCS S 4 0 0.111 S 5 : GADGKDCAS S 5 0 S 1 : P-PGVKSDCAS S 2 : PADGVK-DCAS S 3 : PPDG-KSD--S S 4 : GADG-K-DCCS s 1 s 2 s 3 s 4 s 5 S 5 : GADG-K-DCAS
Detail of Profile-Profile alignment (I) Given two aligned sets of sequences A 1 and A 2 . Example: A 1 is a length-11 alignment of S 1 , S 2 , S 3 S 1 =P-PGVKSDCAS S 2 =PADGVK-DCAS S 3 =PPDG-KSD--S A 2 is a length-9 alignment of S 4 , S 5 S 4 =GADGKDCCS S 5 =GADGKDCAS Similar to the sequence alignment, the profile-profile alignment introduces gaps to A 1 and A 2 so that both of them have the same length.
Detail of Profile-Profile Alignment (II) To determine the alignment, we need a scoring function PSP(A 1 [i], A 2 [j]). In clustalW, the score is defined as follows. j δ (x,y) PSP(A 1 [i],A 2 [j]) = Σ x,y g x i g y i is the observed frequency of amino acid x where g x in column i. This is a natural scoring for maximizing the SP-score. Our aim is to find an alignment between A 1 and A 2 to maximizes the PSP score.
Example A 1 [1..11] is the alignment of S 1 , S 2 , S 3 S 1 =P-PGVKSDCAS S 2 =PADGVK-DCAS S 3 =PPDG-KSD--S A 2 [1..9] is the alignment of S 4 , S 5 S 4 =GADGKDCCS S 5 =GADGKDCAS PSP(A 1 [3],A 2 [3]) = 1x2x δ (P,D)+ 2x2x δ (D,D) PSP(A 1 [9],A 2 [8]) = 2 δ (C,C)+ 2 δ (C,A)+ δ (-,C)+ δ (-,A)
Dynamic Programming Let V(i,j) = the score of the best alignment between A 1 [1..i] and A 2 [1..j]. We have V(i,j) = maximum of V(i-1,j-1)+ PSP(A 1 [i],A 2 [j]) V(i-1,j)+ PSP(A 1 [i],-) V(i,j-1)+ PSP(-,A 2 [j]) By fill-in the dynamic programming table, we can find the optimal alignment. Time complexity: O(k 1 n 1 + k 2 n 2 + n 1 n 2 ) time.
Example By profile-profile alignment, we have S 1 =P-PGVKSDCAS S 2 =PADGVK-DCAS S 3 =PPDG-KSD--S S 4 =GADG-K-DCCS S 5 =GADG-K-DCAS
Recommend
More recommend