0. Multiple Sequence Alignment using Profile HMM based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana R˘ at ¸oi, Diana Popovici
1. PLAN 1. From profiles to Profile HMMs 2. Setting the parameters of a profile HMM; the optimal (MAP) model construction 3. Basic algorithms for profile HMMs 4. Profile HMM training from unaligned sequences: Getting the model and the multiple alignment simultane- ously 5. Profile HMM variants for non-global alignments 6. Weighting the training sequences
2. 1 From profiles to Profile HMMs Problem Given a multiple alignment (obtained either manually or using one of the methods presented in Ch. 6 and Ch. 7), and the profile associated to a set of marked ( X = match) columns, design a HMM that would perform sequence alignments to that profile. Example X X . . . X 1 2 . . . 3 bat A G - - - C A 4/5 0 0 rat A - A G - C C 0 0 4/5 cat A G - A A - G 0 3/5 0 gnat - - A A A C T 0 0 0 goat A G - - - C - 1/5 2/5 1/5
3. Building up a solution At first sight, not taking into account gaps: M Begin End j What about insert residues? I j M Begin End j
4. What about gaps? M Begin End j A better treatment of gaps: D j M Begin End j
5. Could we put it all together? D j I j Begin M End j Transition structure of a profile HMM
6. X X . . . X bat A G - - - C rat A - A G - C Does it work? cat A G - A A - gnat - - A A A C goat A G - - - C 1 2 . . . 3 D D D I I I I M M M Begin End 0 1 2 3 4
7. Any resemblance to pair HMMs? δ X ε 1−2δ−τ δ q x i τ 1−ε−τ τ Begin p M End x y 1 i j 1−2δ−τ 1−ε−τ τ δ Y ε q y δ j τ Doesn’t seem so...
8. However, remember... An example of the state assignments for global alignment using the affine gap model: V L S P A D − K H L − − A E S K I I I x I I I I I x x x x x x x M M M M M M M M I y I y I y I y I y I y I I y y When making the extension to multiple sequence alignment, think of generating only one string (instead of a pair of strings); use I x for inserting residues, and I y to produce a gap; use one triplet of states ( M, I x , I y ) for each column in the alignment; finally define (appropriate) edges in the resulting FSA.
9. Consequence It shouldn’t be difficult to re-write the basic HMM algorithms for profile HMMs! one moment, though...
10. 2 Setting the parameters of a Profile HMM 2.1 Using Maximum Likelihood Estimates (MLE) for transition and emission probabilities For instance — assuming a given multiple alignment with match states marked ( X ) —, the emission probabilities are computed as c ja e M j ( a ) = � a ′ c ja ′ where c ja is the observed frequency of residue a in the column j of the multiple alignment.
11. Counts for our example 0 1 2 3 _ A 4 0 0 match X X . . . X _ C 0 0 4 emissions bat A G - - - C _ G 0 3 0 rat A - A G - C _ T 0 0 0 cat A G - A A - gnat - - A A A C A 0 0 6 0 goat A G - - - C insert C 0 0 0 0 1 2 . . . 3 emissions G 0 0 1 0 T 0 0 0 0 M−M 4 3 2 4 D D D M−D 1 1 0 0 M−I 0 0 1 0 I−M 0 0 2 0 I I I state I transitions I−D 0 0 1 0 I−I 0 0 4 0 _ D−M 0 0 1 Begin M M M End _ D−D 1 0 0 0 1 2 3 4 _ D−I 0 2 0
12. What about zero counts, i.e. unseen emissions/transitions? One solution: use pseudocounts (generalising Laplace’s law...): c ja + Aq a e M j ( a ) = � a ′ c ja ′ + A where A is a weight put on pseudocounts as compared to real counts ( c ja ), and q a is the frequency with which a appears in a random model. A = 20 works well for protein alignments. Note: At the intuitive level, using pseudocount makes a lot of sense: • e M j ( a ) is approximately equal to q a if very little data is available, i.e. all real counts are very small compared to A ; • when a large amount of data is available, e M j ( a ) is essentially equal to the maximum likelihood solution. For other solutions (e.g. Dirichlet mixtures, substitution matrix mix- tures, estimation based on an ancestor), you may see Durbin et al., 1998, Section 6.5.
13. 2.2 Setting the L parameter • The process of model construction represents a way to decide which columns of the alignment should be assigned to match states, and which to insert states. • There are 2 L combinations of marking for alignment of L columns, and hence 2 L different profile HMMs to choose from. In a marked column, symbols are assigned to match states and gaps are assigned to delete states In an unmarked column, symbols are assigned to insert states and gaps are ignored. • There are at least tree ways to determine the marking: ◦ manual construction: the user marks alignment columns by hand; ◦ heuristic construction: e.g. a column might be marked when the proportion of gap symbols in it is below a certain threshold; ◦ Maximum A Posteriori (MAP) model construction: next slides.
14. The MAP (maximum a posteriori) model construction • Objective: we search for the model µ that maximises the likelihood of the given data, namely: argmax P ( C | µ ) µ where C is a set of aligned sequences. Note: The sequences in C are assumed to be statistically independent. • Idea: The MAP model construction algorithm recursively calculates S j , the log probability of the optimal model for the alignment up to and including column j , assuming that the column j is marked. • More specifically: S j is calculated from smaller subalignments ending at a marked column i , by incrementing S i with the summed log prob- ability of the transitions and emissions for the columns between i and j .
15. MAP model construction algorithm: Notations • c xy — the observed state transition counts • a xy — the transition probabilities, estimated from the c xy in the usual fashion (MLE) c xy a xy = � y c xy • S j — the log probability of the optimal model for the alignment up to and including column j , assuming that column j is marked • T ij — the summed log probability of all the state transitions between marked columns i and j � T ij = c xy log a xy x,y ∈ M,D,I • M j — the log probability contribution for match state symbol emis- sions in the column j • L i,j — the log probability contribution for the insert state symbol emissions for the columns i + 1 , . . ., j − 1 (for j − i > 1 ).
16. The MAP model construction algorithm Initialization: S 0 = 0 , M L +1 = 0 Recurrence: for j = 1 , . . . , L + 1 S j = max 0 ≤ i<j S i + T ij + M j + L i +1 ,j − 1 + λ σ j = arg max 0 ≤ i<j S i + T ij + M j + L i +1 ,j − 1 + λ Traceback: from j = σ L +1 , while j > 0 : mark column j as a match column; j = σ j Complexity: O ( L ) in memory and O ( L 2 ) in time for an alignment of L columns... with some care in implementation! Note: λ is a penalty used to favour models with fewer match states. In Bayesian terms, λ is the log of the prior probability of marking each col- umn. It implies a simple but adequate exponentially decreasing prior dis- tribution over model lengths.
17. 3 Basic algorithms for Profile HMMs Notations • v M j ( i ) — the probability of the best path matching the subsequence x 1 ...i to the (profile) submodel up to the column j , ending with x i being emitted by the state M j ; v I j ( i ) — the probability of the best path ending in x i being emitted by I j ; v D j ( i ) — the probability of the best path ending in D j ( x i being the last character emitted before D j ). • V M j ( i ) , V I j ( i ) , V D j ( i ) — the log-odds scores corresponding respectively to v M j ( i ) , v I j ( i ) , v D j ( i ) . • f M j ( i ) — the combined probability of all alignments up to x i that end in state M j , and similarly f I j ( i ) , f D j ( i ) . • b M j ( i ) , b I j ( i ) , b D j ( i ) — the corresponding backward probabilities.
18. The Viterbi algorithm for profile HMM Initialization: rename the Begin state as M 0 , and set v M 0 (0) = 1 ; rename the End state as M L +1 Recursion: v M j − 1 ( i − 1) a M j − 1 M j v I j − 1 ( i − 1) a I j − 1 M j v M j ( i ) = e M j ( x i ) max v D j − 1 ( i − 1) a D j − 1 M j v M j ( i − 1) a M j I j , v I j ( i ) = e I j ( x i ) max v I j ( i − 1) a I j I j v D j ( i − 1) a D j I j v M j − 1 ( i ) a M j − 1 D j v I j − 1 ( i ) a I j − 1 D j v D j ( i ) = max v D j − 1 ( i ) a D j − 1 D j Termination: the final score is v M L +1 ( n ) , calculated using the top recursion relation.
19. The Viterbi algorithm for profile HMMs: log-odds version Initialization: V M 0 (0) = 0 ; (the Begin state is M 0 , and the End state is M L +1 ) Recursion: V M j − 1 ( i − 1) + log a M j − 1 M j e Mj ( x i ) V M j ( i ) = log + max V I j − 1 ( i − 1) + log a I j − 1 M j q xi V D j − 1 ( i − 1) + log a D j − 1 M j V M j ( i − 1) + log a M j I j , e Ij ( x i ) V I j ( i − 1) + log a I j I j V I j ( i ) = log + max q xi V D j ( i − 1) + log a D j I j V M j − 1 ( i ) + log a M j − 1 D j V D j ( i ) = max V I j − 1 ( i ) + log a I j − 1 D j V D j − 1 ( i ) + log a D j − 1 D j Termination: the final score is V M L +1 ( n ) , calculated using the top recursion relation.
Recommend
More recommend