CSCE 970 Lecture 3: HMM Application: Biological Sequence Analysis Stephen D. Scott 1
Introduction • Idea: Given a collection S of related biological sequences, build a (pro- file) hidden Markov model M to generate the sequences • Then test new sequence X against M using (which algorithm?) to pre- dict X ’s membership in S • Can also align X against M using (which algorithm?) to see how X matches up position by position against sequences in S 2
Introduction (cont’d) • Will build M based on a multiple alignment of sequences in S : ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... • In alignments, will differentiate matches, insertions, and deletions 3
Outline • Ungapped regions • Insert and delete states • Deriving profile HMMs from multiple alignments • Searching with profile HMMs • Variants for non-global alignments • Estimating probabilities 4
Organization of a Profile HMM • Start with a trivial HMM M (not really hidden at this point) 1 1 1 1 1 M 1 M B E i • Each match state has its own set of emission probs, so we can com- pute prob of a new sequence x being part of this family: L � P ( x | M ) = e i ( x i ) i =1 5
Organization of a Profile HMM (cont’d) • But this assumes ungapped alignments! • To handle gaps, consider insertions and deletions – Insertion: part of x that doesn’t match anything in multiple align- ment (use insert states) I i M i 6
Organization of a Profile HMM (cont’d) - Deletion: parts of multiple alignment not matched by any residue (sym- bol) in x (use silent delete states) D i M i 7
General Profile HMM Structure B E 8
Handling non-Global Alignments • Original profile HMMs model entire sequence • Add flanking model states (or free insertion modules) to generate non- local residues B E 9
Building a Model • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... 10
Building a Model (cont’d) • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? – Heuristic: if more than half of characters in a column are non-gaps, include a match state for that column ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... 11
Building a Model (cont’d) • Now, find parameters • Multiple alignment + HMM structure → state sequence Non-gap in match column -> M1 D3 I3 match state ... V G A - - H A G E Y ... Gap in match column -> ... V - - - - N V D E V ... delete state ... V E A - - D V A G H ... Non-gap in insert column -> ... V K G - - - - - - D ... insert state ... V Y S - - T Y E T S ... Gap in insert column -> ... F N A - - N I P K H ... ignore ... I A G A D N G A G V ... Durbin Fig 5.4, p. 109 12
Building a Model (cont’d) • Count number of transitions and emissions and compute: A kl a kl = � l ′ A kl ′ E k ( b ) e k ( b ) = b ′ E k ( b ′ ) � • Still need to beware of some counts = 0 13
Weighted Pseudocounts • Let c ja = observed count of residue a in position j of multiple align- ment c ja + Aq a e M j ( a ) = � a ′ c ja ′ + A • q a = background probability of a , A = weight placed on pseudo- counts (sometimes use A ≈ 20 ) • Also called a prior distribution 14
Dirichlet Mixtures • Can be thought of a mixture of pseudocounts • The mixture has different components, each representing a different context of a protein sequence – E.g. in parts of a sequence folded near protein’s surface, more weight (higher q a ) can be given to hydrophilic residues (ones that readily bind with water) • Will find a different mixture for each position of the alignment based on the distribution of residues in that column 15
Dirichlet Mixtures (cont’d) α k (so α k • Each component k consists of a vector of pseudocounts � a corresponds to Aq a ) and a mixture coefficient ( m k , for now) that is the probability that component k is selected • Pseudocount model k is the “correct” one with probability m k • We’ll set the mixture coefficients for each column based on which vec- tors best fit the residues in that column – E.g. first column of our example alignment is dominated by V, so α k that favors V will get a higher m k any vector � 16
Dirichlet Mixtures (cont’d) • Let � c j be vector of counts in column j c ja + α k a � � � e M j ( a ) = P k | � c j � � c ja ′ + α k � a ′ k a ′ � � • P k | � c j are the posterior mixture coefficients, which are easily com- puted [Sj¨ olander et al. 1996], yielding: X a e M j ( a ) = a ′ X a ′ , � where α k c ja + � a � � α k � � α k �� � � , X a = m k 0 exp ln B � a + � c j − ln B � a � c ja ′ + α k � a ′ k a ′ � � ln B ( � x ) = ln Γ( x i ) − ln Γ x i i i 17
Dirichlet Mixtures (cont’d) • Γ is gamma function, and ln Γ is computed via lgamma and related functions in C • m k 0 is prior probability of component k ( = q in Sj¨ olander Table 1): . . . 18
Searching for Homologues • Score a candidate match x by using log-odds: – P ( x, π ∗ | M ) is probability that x came from model M via most likely path π ∗ ⇒ Find using Viterbi – Pr ( x | M ) is probability that x came from model M summed over all possible paths ⇒ Find using forward algorithm – score ( x ) = log( P ( x | M ) /P ( x | φ )) ∗ φ is a “null model”, which is often the distribution of amino acids in the training set or AA distribution over each individual column ∗ If x matches M much better than φ , then score is large and positive 19
Viterbi Equations • V M ( i ) = log-odds score of best path matching x 1 ...i to the model, j where x i emitted by state M j (similarly define V I j ( i ) and V D j ( i ) ) • Rename B as M 0 , V M 0 (0) = 0 , rename E as M L +1 ( V M L +1 = final) V M j − 1 ( i − 1) + log a M j − 1 M j � e M j ( x i ) � V M V I j − 1 ( i − 1) + log a I j − 1 M j ( i ) = log + max j q x i V D j − 1 ( i − 1) + log a D j − 1 M j V M ( i − 1) + log a M j I j j � e I j ( x i ) � V I V I j ( i − 1) + log a I j I j j ( i ) = log + max q x i V D j ( i − 1) + log a D j I j V M j − 1 ( i ) + log a M j − 1 D j V D V I j − 1 ( i ) + log a I j − 1 D j j ( i ) = max V D j − 1 ( i ) + log a D j − 1 D j 20
Forward Equations � e M j ( x i ) � F M � � F M � j ( i ) = log + log a M j − 1 M j exp j − 1 ( i − 1) + q x i � F I � � F D �� a I j − 1 M j exp j − 1 ( i − 1) + a D j − 1 M j exp j − 1 ( i − 1) � e I j ( x i ) � F I � � F M � j ( i ) = log + log a M j I j exp j ( i − 1) + q x i � F I � � F D �� a I j I j exp j ( i − 1) + a D j I j exp j ( i − 1) F D � � F M � � F I � j ( i ) = log a M j − 1 D j exp j − 1 ( i ) + a I j − 1 D j exp j − 1 ( i ) � F D �� + a D j − 1 D j exp j − 1 ( i ) • exp( · ) needed to use sums and logs 21
Aligning a Sequence with a Model (Multiple Alignment) • Given a string x , use Viterbi to find most likely path π ∗ and use the state sequence as the alignment • More detail in Durbin, Section 6.5 – Also discusses building an initial multiple alignment and HMM si- multaneously via Baum-Welch 22
Recommend
More recommend