i
play

I i M i M i 5 6 Handling non-Global Alignments Original profile - PDF document

Introduction CSCE 471/871 Lecture 4: Profile Hidden Markov Models Designed to model (profile) a multiple alignment of a protein family (e.g. p. 102) Gives a probabilistic model of the proteins in the family Stephen D. Scott Useful for


  1. Introduction CSCE 471/871 Lecture 4: Profile Hidden Markov Models • Designed to model (profile) a multiple alignment of a protein family (e.g. p. 102) • Gives a probabilistic model of the proteins in the family Stephen D. Scott • Useful for searching databases for more homologues and for aligning strings to the family 1 2 Organization of a Profile HMM Outline • Start with a trivial HMM M (not really hidden at this point) • Organization of a profile HMM 1 1 1 1 1 B M 1 M E i – Ungapped regions – Insert and delete states • Each match state has its own set of emission probs, so we can com- pute prob of a new sequence x being part of this family: L • Building a model P ( x | M ) = Y e i ( x i ) i =1 • Searching with HMMs • Can, as usual, convert probabilities to log-odds score 3 4 Organization of a Profile HMM (cont’d) Organization of a Profile HMM • But this assumes ungapped alignments! (cont’d) • To handle gaps, consider insertions and deletions - Deletion: parts of multiple alignment not matched by any residue in x (use silent delete states) – Insertion: part of x that doesn’t match anything in multiple align- ment (use insert states) D i I i M i M i 5 6

  2. Handling non-Global Alignments • Original profile HMMs model entire sequence • Add flanking model states (or free insertion modules) to generate non- General Profile HMM Structure local residues B E B E 7 8 Building a Model Outline • Given a multiple alignment, how to build an HMM? • Organization of a profile HMM – General structure defined, but how many match states? ... V G A - - H A G E Y ... • Building a model ... V - - - - N V D E V ... – Structure ... V E A - - D V A G H ... ... V K G - - - - - - D ... – Estimating probabilities ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... • Searching with HMMs ... I A G A D N G A G V ... 9 10 Building a Model Building a Model (cont’d) (cont’d) • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? • Now, find parameters – Heuristic: if more than half of characters in a column are non-gaps, include a match state for that column • Multiple alignment + HMM structure ! state sequence ... V G A - - H A G E Y ... Non-gap in match column -> ... V - - - - N V D E V ... M1 D3 I 3 match state ... V E A - - D V A G H ... ... V G A - - H A G E Y ... Gap in match column -> ... V - - - - N V D E V ... ... V K G - - - - - - D ... delete state ... V E A - - D V A G H ... Non-gap in insert column -> ... V Y S - - T Y E T S ... ... V K G - - - - - - D ... insert state ... F N A - - N I P K H ... ... V Y S - - T Y E T S ... Gap in insert column -> ... F N A - - N I P K H ... ... I A G A D N G A G V ... ignore ... I A G A D N G A G V ... Durbin Fig 5.4, p. 10 9 11 12

  3. Weighted Pseudocounts Building a Model (cont’d) • Let c ja = observed count of residue a in position j of multiple alignment • Count number of transitions and emissions and compute: c ja + Aq a e M j ( a ) = A kl P a 0 c ja 0 + A a kl = P l 0 A kl 0 E k ( b ) • q a = background probability of a , A = weight placed on e k ( b ) = P b 0 E k ( b 0 ) pseudocounts (sometimes use A ⇡ 20 ) • Still need to beware of some counts = 0 • Background probabilities also called a prior distribution 13 14 Dirichlet Mixtures Dirichlet Mixtures (cont’d) • Can be thought of as a mixture of pseudocounts ↵ k (so ↵ k • Each component k consists of a vector of pseudocounts ~ a corresponds to Aq a ) and a mixture coefficient ( m k , for now) that is the • The mixture has different components, each representing a different probability that component k is selected context of a protein sequence – E.g. in parts of a sequence folded near protein’s surface, more • Pseudocount model k is the “correct” one with probability m k weight (higher q a ) can be given to hydrophilic residues – But in other regions, may want to give more weight to hydrophobic • We’ll set the mixture coefficients for each column based on which vec- residues tors best fit the residues in that column – E.g. first column of alignment on slide 10 is dominated by V, so any • Will find a different mixture for each position of the alignment based on ↵ k that favors V will get a higher m k vector ~ the distribution of residues in that column 15 16 Dirichlet Mixtures Dirichlet Mixtures (cont’d) (cont’d) • Let ~ c j be vector of counts in column j c ja + ↵ k a ⇣ ⌘ X e M j ( a ) = P k | ~ c j • Γ is gamma function, and ln Γ is computed via lgamma and related ⇣ ⌘ c ja 0 + ↵ k P k a 0 a 0 functions in C ⇣ ⌘ • m k 0 is prior probability of component k ( = q in Sj¨ olander Table 1): • P k | ~ c j are the posterior mixture coefficients, which are easily com- puted [Sj¨ olander et al. 1996], yielding: X a e M j ( a ) = a 0 X a 0 , P where ↵ k c ja + ~ ↵ k + ~ a ⇣ ⇣ ⌘ ⇣ ↵ k ⌘⌘ X a = X m k 0 exp ln B � ln B ⌘ , ~ c j ~ ⇣ c ja 0 + ↵ k P k a 0 a 0 0 1 X @X ln B ( ~ x ) = ln Γ ( x i ) � ln Γ x i A . . . i i 17 18

  4. Searching for Homologues • Score a candidate match x by using log-odds: – P ( x, ⇡ ⇤ | M ) is probability that x came from model M via most Outline likely path ⇡ ⇤ ) Find using Viterbi • Organization of a profile HMM – Pr ( x | M ) is probability that x came from model M summed over all possible paths • Building a model ) Find using forward algorithm – score ( x ) = log( P ( x | M ) /P ( x | � )) • Searching with HMMs ⇤ � is a “null model”, which is often the distribution of amino acids in the training set or AA distribution over each individual column ⇤ If x matches M much better than � , then score is large and positive 19 20 Forward Equations Viterbi Equations • V M ( i ) = log-odds score of best path matching x 1 ...i to the model, j where x i emitted by state M j (similarly define V I j ( i ) and V D e M j ( x i ) j ( i ) ) ! F M h ⇣ F M ⌘ j ( i ) = log + log a M j � 1 M j exp j � 1 ( i � 1) + • Rename B as M 0 , V M 0 (0) = 0 , rename E as M L +1 ( V M L +1 = final) q x i V M 8 j � 1 ( i � 1) + log a M j � 1 M j ⇣ F I ⌘ ⇣ F D ⌘i a I j � 1 M j exp j � 1 ( i � 1) + a D j � 1 M j exp j � 1 ( i � 1) e M j ( x i ) ! > > V M < V I ( i ) = log + max j � 1 ( i � 1) + log a I j � 1 M j j q x i > V D j � 1 ( i � 1) + log a D j � 1 M j > e I j ( x i ) : ! F I h ⇣ F M ⌘ j ( i ) = log + log a M j I j exp j ( i � 1) + V M 8 ( i � 1) + log a M j I j q x i j e I j ( x i ) > ! > V I < V I j ( i ) = log + max j ( i � 1) + log a I j I j ⇣ F I ⌘ ⇣ F D ⌘i a I j I j exp j ( i � 1) + a D j I j exp j ( i � 1) q x i V D > j ( i � 1) + log a D j I j > : F D h ⇣ F M ⌘ ⇣ F I ⌘ V M 8 j � 1 ( i ) + log a M j � 1 D j j ( i ) = log a M j � 1 D j exp j � 1 ( i ) + a I j � 1 D j exp j � 1 ( i ) > > V D < V I j � 1 ( i ) + log a I j � 1 D j j ( i ) = max ⇣ F D ⌘i + a D j � 1 D j exp j � 1 ( i ) V D > > j � 1 ( i ) + log a D j � 1 D j : • Similar to Chapter 2’s gapped alignment, but with position-specific • exp( · ) needed to use sums and logs (can still be fast; see p. 78) scoring scheme 21 22 Aligning a Sequence with a Model (Multiple Alignment) • Given a string x , use Viterbi to find most likely path ⇡ ⇤ and use the state sequence as the alignment Topic summary due in 1 week! • More detail in Durbin, Section 6.5 – Also discusses building an initial multiple alignment and HMM si- multaneously via Baum-Welch 23 24

Recommend


More recommend