Algorithms in Bioinformatics: A Practical Introduction Multiple - PowerPoint PPT Presentation

Algorithms in Bioinformatics: A Practical Introduction Multiple Sequence Alignment

Multiple Sequence Alignment  Given k sequences S = { S 1 , S 2 , …, S k } .  A multiple alignment of S is a set of k equal- length sequences { S’ 1 , S’ 2 , …, S’ k } .  where S’ i is obtained by inserting gaps in to S i .  The multiple sequence alignment problem aims to  find a multiple alignment which optimize certain score.

Example: multiple alignment of 4 sequences  S 1 = ACG--GAGA  S 2 = -CGTTGACA  S 3 = AC-T-GA-A  S 4 = CCGTTCAC-

Applications of multiple sequence alignment  Align the domains of proteins  Align the same genes/proteins from multiple species  Help predicting protein structure

Sum-of-Pair (SP) Score  Consider the multiple alignment S’ of S.  SP-score(a 1 , …, a k ) = Σ 1 ≤ i< j ≤ k δ (a i ,a j )  where a i can be any character or a space.  The SP-score of S’ is  Σ x SP-score(S’ 1 [x], …, S’ k [x]).

Example: multiple alignment of 4 sequences  S 1 = ACG--GAGA  S 2 = -CGTTGACA  S 3 = AC-T-GA-A  S 4 = CCGTTCAC-  Assume score of  match and mismatch/insert/delete are 2 and -2, respectively.  For position 1,  SP-score(A,-,A,C) = 2 δ (A,-) + 2 δ (A,C) + δ (A,A) + δ (C,-) = -8  SP-score= -8+ 12+ 0+ 0–6+ 0+ 12–10+ 0 = 0

Sum-of-Pair (SP) distance  Equivalently, we have SP-dist.  Consider the multiple alignment S’ of S.  SP-dist(a 1 , …, a k ) = Σ 1 ≤ i< j ≤ k δ (a i ,a j )  where a i can be any character or a space.  The SP-dist of S’ is  Σ x SP-dist(S’ 1 [x], …, S’ k [x]).

Agenda  Exact result  Dynamic Programming  Approximation algorithm  Center star method  Heuristics  ClustalW --- Progressive alignment  MUSCLE --- Iterative method

Dynamic Programming for aligning two sequences  Recall that the optimal alignment for two sequences can be found as follows.  Let V(i 1 , i 2 ) be the score of the optimal alignment between S 1 [1..i 1 ] and S 2 [1..i 2 ]. − − + δ  V ( i 1 , i 1 ) ( S [ i ], S [ i ]) 1 2 1 1 2 2  = − + δ  V ( i , i ) max V ( i 1 , i ) ( S [ i ], _) 1 2 1 2 1 1  − + δ  V ( i , i 1 ) (_, S [ i ]) 1 2 2 2  The equation can be rephased as { } = − − + δ V ( i , i ) max V ( i b , i b ) ( S [ i b ], S [ i b ]) 1 2 1 1 2 2 1 1 1 2 2 2 ∈ − 2 ( b , b ) { 0 , 1 } {( 0 , 0 )} 1 2

Dynamic Programming for aligning k sequences (I)  Let V(i 1 , i 2 , …, i k ) = the SP-score of the optimal alignment of S 1 [1..i 1 ], S 2 [1..i 2 ], …, S k [1..i k ].  Observation: The last column of the optimal alignment should be either S j [i j ] or ‘-’.  Hence, the score for the last column should be SP-score(S 1 [b 1 i 1 ], S 2 [b 2 i 2 ], …, S k [b k i k ])  For (b 1 , b 2 , …, b k ) ∈ { 0,1} k .  (Assume that S j [0] = ‘-’.)

Dynamic programming for aligning k sequences (II)  Based on the observation, we have  V(i 1 , i 2 , …, i k ) = max (b1, b2, …, bk) ∈ { 0,1} k { V(i 1 -b 1 , …, i k -b k ) + SP-score(S 1 [b 1 i 1 ], …, S k [b k i k ]) }  The SP-score of the optimal multiple alignment of S= { S 1 , S 2 , …, S k } is  V(n 1 , n 2 , …, n k )  where n i is the length of S i .

Dynamic Programming for aligning k sequences (III)  By filling-in the dynamic programming table,  We compute V(n 1 , n 2 , …, n k ).  By back-tracing,  We recover the multiple alignment.

Complexity  Time:  The table V has n 1 n 2 …n k entries.  Filling in one entry takes 2 k k 2 time.  Total running time is O(2 k k 2 n 1 n 2 …n k ).  Space:  O(n 1 n 2 …n k ) space to store the table V.  Dynamic programming is expensive in both time and space. It is rarely used for aligning more than 3 or 4 sequences.

Center star method  Computing optimal multiple alignment takes exponential time.  Can we find a good approximation using polynomial time?  We introduce Center star method, which minimizes Sum-of-Pair distance.

Idea  Find a string S c .  Align all other strings with respect to S c .  Illustrate by an example:

Converting pair-wise alignment to multiple alignment

Detail algorithm for center star method Step 1 Step 2 Step 3 Step 4

Running time of center star method Assume all k sequences are of length n.  Step 1 takes O(k 2 n 2 ) time.  Step 2 takes O(k 2 ) time to find the center string S c .  Step 3 takes O(kn 2 ) time to compute the alignment between S c and S i for all  i. Step 4 introduces space into the multiple alignment, which takes O(k 2 n)  time. In total, the running time is O(k 2 n 2 ).  Step 1 Step 2 Step 3 Step 4 O(k 2 n 2 ) O(k 2 ) O(kn 2 ) O(k 2 n)

Why center star method is good? (I)  Let M* be the optimal alignment.  The SP-dist of M*

Why center star method is good? (II)  The SP-dist of M  The SP-dist of M is at most twice of that of M* (the optimal alignment).

Progress alignment  Progress alignment is first proposed by Feng and Doolittle (1987).  It is a heuristics to get a good multiple alignment.  Basic idea:  Align the two most closest sequences  Progressive align the most closest related sequences until all sequences are aligned.  Examples of Progress alignment method include:  ClustalW, T-coffee, Probcons  Probcons is currently the most accurate MSA algorithm.  ClustalW is the most popular software.

Basic algorithm Computing pairwise distance scores 1. for all pairs of sequences Generate the guide tree which ensures 2. similar sequences are nearer in the tree Aligning the sequences one by one 3. according to the guide tree

ClustalW  A popular progressive alignment method to globally align a set of sequences.  Input: a set of sequences  Output: the multiple alignment of these sequences

Step 1: pairwise distance scores  Example: For S 1 and S 2 , the global alignment is  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  There are 9 non-gap positions and 8 match positions.  The distance is 1 – 8/9 = 0.111 S 1 S 2 S 3 S 4 S 5 S 1 : PPGVKSDCAS S 1 0 0.111 0.25 0.555 0.444 S 2 : PADGVKDCAS S 2 0 0.375 0.222 0.111 S 3 : PPDGKSDS S 3 0 0.5 0.5 S 4 : GADGKDCCS S 4 0 0.111 S 5 : GADGKDCAS S 5 0

Step 2: generate guide tree  By neighbor-joining, generate the guide tree. S 1 S 2 S 3 S 4 S 5 S 1 0 0.111 0.25 0.555 0.444 S 2 0 0.375 0.222 0.111 S 3 0 0.5 0.5 S 4 0 0.111 s 1 s 2 s 3 s 4 s 5 S 5 0

Step 3: align the sequences according to the guide tree (I)  Aligning S1 and S2, we get  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  Aligning S4 and S5, we get  S 4 =GADGKDCCS  S 5 =GADGKDCAS s 1 s 2 s 3 s 4 s 5

Step 3: align the sequences according to the guide tree (II) Aligning (S1, S2) with S3, we Aligning (S1, S2, S3) with   get (S4, S5), we get S 1 =P-PGVKSDCAS S 1 =P-PGVKSDCAS   S 2 =PADGVK-DCAS S 2 =PADGVK-DCAS   S 3 =PPDG-KSD--S S 3 =PPDG-KSD--S   S 4 =GADG-K-DCCS  S 5 =GADG-K-DCAS  S 1 : P-PGVKSDCAS S 2 : PADGVK-DCAS S 3 : PPDG-KSD--S S 4 : GADG-K-DCCS s 1 s 2 s 3 s 4 s 5 S 5 : GADG-K-DCAS

Summary S 1 S 2 S 3 S 4 S 5 S 1 : PPGVKSDCAS S 1 0 0.111 0.25 0.555 0.444 S 2 : PADGVKDCAS S 2 0 0.375 0.222 0.111 S 3 : PPDGKSDS S 3 0 0.5 0.5 S 4 : GADGKDCCS S 4 0 0.111 S 5 : GADGKDCAS S 5 0 S 1 : P-PGVKSDCAS S 2 : PADGVK-DCAS S 3 : PPDG-KSD--S S 4 : GADG-K-DCCS s 1 s 2 s 3 s 4 s 5 S 5 : GADG-K-DCAS

Detail of Profile-Profile alignment (I)  Given two aligned sets of sequences A 1 and A 2 .  Example:  A 1 is a length-11 alignment of S 1 , S 2 , S 3  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  S 3 =PPDG-KSD--S  A 2 is a length-9 alignment of S 4 , S 5  S 4 =GADGKDCCS  S 5 =GADGKDCAS  Similar to the sequence alignment,  the profile-profile alignment introduces gaps to A 1 and A 2 so that both of them have the same length.

Detail of Profile-Profile Alignment (II)  To determine the alignment, we need a scoring function PSP(A 1 [i], A 2 [j]).  In clustalW, the score is defined as follows. j δ (x,y)  PSP(A 1 [i],A 2 [j]) = Σ x,y g x i g y i is the observed frequency of amino acid x where g x in column i.  This is a natural scoring for maximizing the SP-score.  Our aim is to find an alignment between A 1 and A 2 to maximizes the PSP score.

Example  A 1 [1..11] is the alignment of S 1 , S 2 , S 3  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  S 3 =PPDG-KSD--S  A 2 [1..9] is the alignment of S 4 , S 5  S 4 =GADGKDCCS  S 5 =GADGKDCAS  PSP(A 1 [3],A 2 [3]) = 1x2x δ (P,D)+ 2x2x δ (D,D)  PSP(A 1 [9],A 2 [8]) = 2 δ (C,C)+ 2 δ (C,A)+ δ (-,C)+ δ (-,A)

Dynamic Programming  Let V(i,j) = the score of the best alignment between A 1 [1..i] and A 2 [1..j].  We have V(i,j) = maximum of  V(i-1,j-1)+ PSP(A 1 [i],A 2 [j])  V(i-1,j)+ PSP(A 1 [i],-)  V(i,j-1)+ PSP(-,A 2 [j])  By fill-in the dynamic programming table, we can find the optimal alignment.  Time complexity: O(k 1 n 1 + k 2 n 2 + n 1 n 2 ) time.

Example  By profile-profile alignment, we have  S 1 =P-PGVKSDCAS  S 2 =PADGVK-DCAS  S 3 =PPDG-KSD--S  S 4 =GADG-K-DCCS  S 5 =GADG-K-DCAS

Algorithms in Bioinformatics: A Practical Introduction Multiple - PowerPoint PPT Presentation

Algorithms in Bioinformatics: A Practical Introduction Multiple Sequence Alignment Multiple Sequence Alignment Given k sequences S = { S 1 , S 2 , , S k } . A multiple alignment of S is a set of k equal- length sequences { S 1 ,

Practical Bioinformatics Mark Voorhies 5/15/2015 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/11/2015 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 4/9/2018 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/12/2015 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 6/3/2013 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/ 24/ 2013 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/23/2019 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/21/2019 Mark Voorhies Practical Bioinformatics Change

Practical Bioinformatics Mark Voorhies 5/29/2019 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 4/20/2011 Mark Voorhies Practical Bioinformatics Review

Practical Bioinformatics Mark Voorhies 5/21/2013 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Practical Bioinformatics Mark Voorhies 4/2/2018 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 4/24/2017 Mark Voorhies Practical Bioinformatics

Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff Research NIPS 2017 Workshop on

Slides from: Elena Tsiporkova What is Special about Time Series Data? Gene expression time series

Differential Slicing: Identifying Causal Execution Differences for Security Applications Noah M.

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

RTCP Extension For Time Alignment draft-taylor-avt-time-align-00.txt Tom Taylor et al IETF 66

Aims of Session Understand the concept of constructive alignment Identify the benefits

Binary Foreground Map Evaluation Deng-Ping Fan Nankai University of Media Computing Lab

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring