BINF6201/8201 Sequence alignment algorithms 10-04-2016
What is an efficient algorithm ? Ø Our first goal is to make a pairwise alignment between two sequences that reveals the evolutionary relationship between them using a computational method or an algorithm. Ø An algorithm is a series of executable step-by-step instructions that lead to a solution to a problem. Ø For some problems, there are no algorithms to solve them, therefore they cannot be solved on any computer. Ø For some other problems, there are algorithms to solve them. Ø The speed of an algorithm can increase quickly or slowly when the size of the problem increases. Ø If the speed of an algorithm increases quickly when the size increases, the problem cannot be solved in a practical period of time when its size is large. Ø Computer scientists strive to find a faster algorithm for computer- solvable problems.
Sorting algorithms Ø The sorting problem: given a list of numbers, put them in a ascending/descending order. Ø The insertion sorting algorithm: For each number from the second number to the last number, do Compare it with the numbers before it; If the number before it is greater than it, swap the two numbers; If the number before it is less than it, stop comparison. Ø If there are n numbers in the list, in the worst case, the algorithm needs to do (1+2+…+n) comparisons, which is in the order of n 2 . Ø We say the program has O ( n 2 ) time complexity.
Merger two sorted lists in one sorted list Ø There are other sorting algorithms running at the same speed as or slower, or faster speed than the insertion sorting algorithm. The known fastest one is the merger sorting algorithm. Ø An algorithm to merger two sorted lists of numbers in a larger sorted list: 9 9 27 18 Merge 22 18 35 27 35 22 49 49 71 71 82 82 Merger algorithm: When two lists are not empty do: Compare the first two numbers in the two lists; put the smaller in the third list in order; Copy the number in the non-empty list to the third list in order.
Sorting algorithms Ø The merger sorting algorithm: First, divide the unsorted list into two equal parts repetitively until only one or no number remains, then apply the merger sorting procedure to each part starting from the smallest. Ø If there are n numbers 71 22 9 18 35 82 27 49 in the list, in the worst case, the algorithm needs to do nlog 2 n 71 22 9 18 35 82 27 49 comparisons. Ø We say that the 71 22 9 18 35 82 27 49 program has O ( nlog 2 n ) time complexity. 71 9 35 27 22 18 82 49 Ø Therefore the merger 22 71 9 18 35 82 27 49 sorting algorithm is faster than the 9 18 22 71 27 35 49 82 insertion sorting algorithm when n is 9 18 22 27 35 49 71 82 big.
Polynomial and Non-polynomial algorithms Ø Both insertion sorting and merger sorting algorithms belong to a class of algorithms called the polynomial algorithms (P) , because they run in the order of O ( n r ) , where r is a constant. Ø All practical algorithms belong to P. Ø For many theoretically computer-solvable problems, a polynomial algorithm has not been found. Therefore, they can only be exactly solved in a very small scale using a non-polynomial algorithm. A subset of these problems are called NP-hard problems Ø The travelling salesman problem (TSP) is one of NP-hard problems: Find the shortest path for a salesman to visit each of n cities exactly once and then return to his starting city. Ø Often, we can use a heuristic algorithm to solve these problems approximately without guarantee of an optimal solution.
Pairwise sequence alignment algorithms Ø An example of a pairwise alignment between two sequences: a: CAGT-AGATATTTACGGCAGTATC---- b: CAATCAGGATTTT--GGCAGACTGGTTG Ø Here, in order to achieve a better alignment between the two sequences, indels or gaps have to be included. Ø We already know how to score a pairwise gap-free alignment, using l the formula, ∑ S ( a , b ) S ( a , b ) . = alignemnt k k k 1 = Ø When a gap is introduced in the alignment, a penalty score must be given to reflect such a potentially deleterious mutational event. Ø The commonly used penalty scoring functions include, Linear penalty : W ( l ) gl , and = Affine gap penalty : W ( l ) g g ( l 1 ). = + − open ext Where l is the length of the gap; g , penalty for one space; g open , penalty for opening a gap; g ext , penalty for extending a gap by one space.
Pairwise sequence alignment algorithms Ø The affine gap penalty function is close enough to the ideal gap penalty function shown in the figure, which should increases fast when l is small, and slowly when l is large. CAGT-AGATATTTACGGCAGTATC---- CAATCAGGATTTT--GGCAGACTGGTTG Ø Given a scoring matrix such as PAM250 or BLOSUM50, and a gap penalty function, we can compute the score of an alignment as, S ( a , b ) alignemnt ∑ ∑ S ( a , b ) W ( l ) . = − k k aligned gaps pairs
Pairwise sequence alignment algorithms Ø The pairwise alignment problem can be formulated as follows. Given a scoring system for matches and gaps, find the alignment of two sequences that has the maximum score among all possible ways to align the two sequences. ∑ ∑ S ( a , b ) S ( a , b ) W ( l ) . = − k k aligned gaps pairs Ø If sequence a has a length m , and sequence b has a length n , then they are ways to align them. m n m n + + ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ n m ⎝ ⎠ ⎝ ⎠ Ø To derive this formula, think the ways that we can intercalate the two sequences while preserving their respective order. Sequence a: This alignment corresponds to CAGT-AGATA the following intercalation: CAATCAGGAT Sequence b: CCAAGATTCAAGGAGTAAT Intercalation:
Pairwise sequence alignment algorithm Ø Therefore, if we want to find the optimal alignment by evaluating all of these possible alignments, the solution is not polynomial, in stead, it is exponential, because if m = n , then, m n 2 n ( 2 n )! 2 + ⎛ ⎞ ⎛ ⎞ 2 n . ⎜ ⎟ ⎜ ⎟ = = ≈ ⎜ ⎟ ⎜ ⎟ 2 n n ( n ! ) n π ⎝ ⎠ ⎝ ⎠ Ø Fortunately, a polynomial algorithm ( O ( mn )) was found to align two sequences globally--- the Needleman and Wunsch algorithm (1970), using a dynamic programming method. Ø Dynamic programming is a type of algorithms to find an optimal value of a function based on previously computed optimal values. Ø A dynamic programming algorithm speeds up the computation by avoiding repetitive calculations.
Needleman-Wunsch global alignment algorithm Ø Let a and b be two sequences of length m and n , and H ( s, t ) be the score to align the first s letters in a with the first t letter in b : H ( s, t ) a : a 1 a 2 ... a s − 1 a s ... a m s m b : b 1 b 2 ... b t − 1 b t ... b n n t Ø Using linear gap penalty function, we can compute H ( i, j ) based on the three possible ways to add the i-th and j-th letter to the previous alignments, and choose the alignment that maximizes H ( i, j ) . H ( i-1, j-1 ) S ( a i b , ) j a i align with b j i-1 i H ( i , j ) H ( i 1 , j 1 ) S ( a i b , ) = − − + j j-1 j a i align with a gap b j align with a gap g − g H ( i-1, j ) H ( i, j-1 ) − H ( i , j ) H ( i , j ) i-1 i i H ( i 1 , j ) g H ( i , j 1 ) g j j-1 j = − − = − −
Needleman-Wunsch global alignment algorithm Ø If we use the linear gap penalty function, then mathematically, we have the following recursion relations : H ( i 1 , j 1 ) S ( a , b ) − − + ⎧ i j ⎪ H ( i , j ) max H ( i 1 , j ) g = − − ⎨ ⎪ H ( i , j 1 ) g − − ⎩ with the initial condition H (0,0)=0, H ( i ,0)=- ig , and H (0, j )=- jg. Ø This corresponds to filling out the matrix H with the elements being H ( i,j ). H is called the dynamic programming matrix of a and b . Ø We fill the matrix H from the upper-left corner to the bottom-right corner. Ø To recover the optimal alignment later, we use another matrix to keep the track on how each H ( i,j ) is computed. Ø As H has mn elements to fill, this algorithm runs in the order of O ( mn ) .
Needleman-Wunsch global alignment algorithm Ø Initialize the first row and column, and then compute each cell from the upper left corner to the bottom right corner. b : b b b ... b b ... b 0 1 2 j 1 j n − a : a 0 - g - 2 g ... - ( j- 1 ) g - jg ... - ng 0 s ( a 1 b , ) + g − 1 a - g g H ( 1 , 1 ) 1 − a - 2 g 2 ... ... a - ( i- 1 ) g H ( i 1 , j ) H ( i 1 , j 1 ) − − − i 1 − g − S ( a i b , ) + j a - ig H ( i , j ) H ( i , j 1 ) − i g − ... ... a - mg m
Needleman-Wunsch global alignment algorithm Ø Let’s use an example to illustrate this matrix filling process. Ø We want to align the following two amino acid sequences using the PAM250 scoring matrix and a linear gap penalty function W( l ) = -6 l. a: SHAKE b: SPEARE Amino acid alignment scores taken from PAM250 2 1 1 0
Recommend
More recommend