Introduction CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple CSCE 471/871 Lecture 6: Multiple Sequence Sequence Start with a set of sequences Alignments Alignments Sequence Alignments In each column, residues are homolgous Stephen Scott Stephen Scott Residues occupy similar positions in 3D structure Introduction Introduction Residues diverge from a common ancestral residue Scoring Scoring Figure 6.1 Stephen Scott Multidimensional Multidimensional Can be done manually, but requires expertise and is DP DP Progressive Progressive very tedious Alignments Alignments Often there is no single, unequivocally “correct” MA via Profile MA via Profile HMMs HMMs alignment Problems from low sequence identity & structural evolution sscott@cse.unl.edu 1 / 33 2 / 33 Outline Scoring a Multiple Alignment CSCE CSCE Scoring a multiple alignment Ideally, is based in evolution, as in e.g., PAM and 471/871 471/871 Minimum entropy scoring Lecture 6: Lecture 6: BLOSUM matrices Multiple Multiple Sum of pairs (SP) scoring Sequence Sequence Contrasts with pairwise alignments: Alignments Alignments Multidimenisonal dynamic programming Position-specific scoring (some positions more Stephen Scott Stephen Scott 1 Standard MDP algorithm conserved than others) MSA Introduction Introduction Ideally, need to consider entire phylogenetic tree to Progressive alignment methods 2 Scoring Scoring explain evolution of entire family Feng-Doolittle Minimum Entropy Multidimensional Sum of Pairs Profile alignment I.e., build complete probabilistic model of evolution DP Multidimensional CLUSTALW Progressive Not enough data to parameterize such a model DP Alignments Iterative refinement ⇒ use approximations Progressive MA via Profile Alignments Multiple alignment via profile HMMs HMMs Assume columns statistically independent: MA via Profile Multiple alignment with known profile HMM HMMs Profile HMM training from unaligned sequences X S ( m ) = G + S ( m i ) Initial model Baum-Welch i Avoiding local maxima m i is column i of MA m , G is (affine) score of gaps in m Model surgery 3 / 33 4 / 33 Scoring a Multiple Alignment Scoring a Multiple Alignment Minimum Entropy Scoring Minimum Entropy Scoring (2) CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple Sequence Sequence Set score to be S ( m i ) = − log P ( m i ) = − P Alignments Alignments a c ia log p ia Stephen Scott Stephen Scott m j Propotional to Shannon entropy i = symbol in column i in sequence j , c ia = observed Define optimal alignment as count of residue a in column i Introduction Introduction Scoring Scoring (X ) Assume sequences are statistically independent, i.e., Minimum Entropy Minimum Entropy m ⇤ = argmin S ( m i ) Sum of Pairs Sum of Pairs residues independent within columns m Multidimensional Multidimensional m i 2 m Then probability of column m i is P ( m i ) = Q a p c ia ia , where DP DP Independence assumption valid only if all evolutionary Progressive Progressive p ia = probability of a in column i Alignments Alignments subfamilies are represented equally; otherwise bias MA via Profile MA via Profile HMMs HMMs skews results 5 / 33 6 / 33
Scoring a Multiple Alignment Scoring a Multiple Alignment Sum of Pairs (SP) Scores SP Problem CSCE CSCE 471/871 471/871 Given an alignment with only “L ” in column i , using Lecture 6: Lecture 6: � N � Multiple Multiple Treat multiple alignment as pairwise alignments BLOSUM50 yields an SP score of Sequence Sequence 2 Alignments Alignments � N � If s ( a , b ) = substitution score from e.g., PAM or S 1 = 5 = 5 N ( N − 1 ) / 2 2 Stephen Scott Stephen Scott BLOSUM: If one “L ” is replaced with “G”, then SP score is X s ( m k i , m ` Introduction S ( m i ) = i ) Introduction S 2 = S 1 − 9 ( N − 1 ) Scoring Scoring k < ` Problem: Minimum Entropy Minimum Entropy Sum of Pairs Sum of Pairs Caveat: s ( a , b ) was derived for pairwise comparisons, Multidimensional Multidimensional 9 ( N − 1 ) S 2 5 N ( N − 1 ) / 2 = 1 − 18 not N -way comparisons DP DP = 1 − 5 N , S 1 Progressive Progressive Alignments Alignments correct SP MA via Profile MA via Profile i.e., as N increases, S 2 / S 1 → 1 z }| { z }| { HMMs HMMs p abc log p ab + log p bc + log p ac = log p ab p bc p ac log vs. But large N should give more support for “L ” in m i q 2 a q 2 b q 2 q a q b q c q a q b q b q c q a q c relative to S 2 , not less (i.e., should have S 2 / S 1 c decreasing) 7 / 33 8 / 33 Multidimensional Dynamic Programming Multidimensional Dynamic Programming (2) CSCE CSCE Generalization of DP for pairwise alignments 471/871 471/871 Lecture 6: Lecture 6: Assume statistical independence of columns and linear Multiple Multiple Sequence Sequence gap penalty (can also handle affine gap penalties) Alignments Alignments S ( m ) = P i S ( m i ) , and ↵ i 1 , i 2 ,..., i N = max score of Stephen Scott Stephen Scott alignment of subsequences x 1 1 ... i 1 , x 2 1 ... i 2 , . . . , x N Assume all N sequences are of length L 1 ... i N Introduction Introduction Scoring Scoring Space complexity = Θ ( ) 8 � � x 1 i 1 , x 2 i 2 , x 3 i 3 , . . . , x N ↵ i 1 � 1 , i 2 � 1 , i 3 � 1 ,..., i N � 1 + S , Multidimensional > i N Multidimensional > � � Time complexity = Θ ( ) − , x 2 i 2 , x 3 i 3 , . . . , x N DP > ↵ i 1 , i 2 � 1 , i 3 � 1 ,..., i N � 1 + S , DP > > i N > � � Algorithm Algorithm x 1 i 1 , − , x 3 i 3 , . . . , x N > ↵ i 1 � 1 , i 2 , i 3 � 1 ,..., i N � 1 + , S Is it practical? > MSA MSA > i N > . < Progressive . Progressive ↵ i 1 , i 2 ,..., i N = max . Alignments Alignments � � > x 1 i 1 , x 2 i 2 , x 3 ↵ i 1 � 1 , i 2 � 1 , i 3 � 1 ,..., i N + S i 3 , . . . , − , > MA via Profile MA via Profile > > � � HMMs > HMMs − , − , x 3 i 3 , . . . , x N ↵ i 1 , i 2 , i 3 � 1 ,..., i N � 1 + S , > > i N > > . > . : . In each column, take all gap-residue combinations except 100% gaps 9 / 33 10 / 33 MSA [Carrillo & Lipman 88; Lipman et al. 89] MSA (2) CSCE CSCE 471/871 471/871 Assume we have lower bound � ( a ⇤ ) on score of optimal Lecture 6: Lecture 6: Multiple Multiple alignment a ⇤ : Sequence Sequence Uses MDP , but eliminates many entries from Alignments Alignments X consideration to save time Stephen Scott Stephen Scott � ( a ⇤ ) ≤ S ( a ⇤ ) = S ( a ⇤ k ` ) Can optimally solve problems with L = 300 and N = 7 k < ` Introduction Introduction (old numbers), L = 150 and N = 50 , L = 500 and X X Scoring Scoring S ( a ⇤ k 0 ` 0 ) ≤ S ( a ⇤ k ` ) + a k 0 ` 0 ) = S ( a ⇤ k ` ) + S (ˆ N = 25 , and L = 1000 and N = 10 (newer numbers) Multidimensional Multidimensional k 0 < ` 0 k 0 < ` 0 DP DP Uses SP scoring: S ( a ) = P k < ` S ( a k ` ) , where a is any ( k 0 , ` 0 ) 6 =( k , ` ) ( k 0 , ` 0 ) 6 =( k , ` ) Algorithm Algorithm MSA MSA MA and a k ` is PA between x k and x ` induced by a Progressive Progressive Alignments Alignments Thus S ( a ⇤ k ` ) ≥ � k ` = � ( a ⇤ ) − P a k ` is optimal PA between x k and x ` (easily computed), a k 0 ` 0 ) S (ˆ If ˆ k 0 < ` 0 MA via Profile MA via Profile ( k 0 , ` 0 ) 6 =( k , ` ) HMMs then S ( a k ` ) ≤ S (ˆ a k ` ) for all k and ` HMMs When filling in matrix, only need to consider PAs that score at least � k ` (Figure 6.3) Can get � ( a ⇤ ) from other (heuristic) alignment methods 11 / 33 12 / 33
Recommend
More recommend