CSCE 970 Lecture 3: HMM Application: Biological Sequence Analysis - PowerPoint PPT Presentation

CSCE 970 Lecture 3: HMM Application: Biological Sequence Analysis Stephen D. Scott 1

Introduction • Idea: Given a collection S of related biological sequences, build a (profile) hidden Markov model M to generate the sequences • Then test new sequence X against M using (which algorithm?) to pre- dict X ’s membership in S • Can also align X against M using (which algorithm?) to see how X matches up position by position against sequences in S 2

Introduction (cont’d) • Will build M based on a multiple alignment of sequences in S : ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... • In alignments, will differentiate matches, insertions, and deletions 3

Outline • Ungapped regions • Insert and delete states • Deriving profile HMMs from multiple alignments • Searching with profile HMMs • Variants for non-global alignments • Estimating probabilities 4

Organization of a Profile HMM • Start with a trivial HMM M (not really hidden at this point) 1 1 1 1 1 M 1 M B E i • Each match state has its own set of emission probs, so we can compute prob of a new sequence x being part of this family: L � P ( x | M ) = e i ( x i ) i =1 5

Organization of a Profile HMM (cont’d) • But this assumes ungapped alignments! • To handle gaps, consider insertions and deletions – Insertion: part of x that doesn’t match anything in multiple alignment (use insert states) I i M i 6

Organization of a Profile HMM (cont’d) - Deletion: parts of multiple alignment not matched by any residue (sym- bol) in x (use silent delete states) D i M i 7

General Profile HMM Structure B E 8

Handling non-Global Alignments • Original profile HMMs model entire sequence • Add flanking model states (or free insertion modules) to generate non- local residues B E 9

Building a Model • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... 10

Building a Model (cont’d) • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? – Heuristic: if more than half of characters in a column are non-gaps, include a match state for that column ... V G A - - H A G E Y ... ... V - - - - N V D E V ... ... V E A - - D V A G H ... ... V K G - - - - - - D ... ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... ... I A G A D N G A G V ... 11

Building a Model (cont’d) • Now, find parameters • Multiple alignment + HMM structure → state sequence Non-gap in match column -> M1 D3 I3 match state ... V G A - - H A G E Y ... Gap in match column -> ... V - - - - N V D E V ... delete state ... V E A - - D V A G H ... Non-gap in insert column -> ... V K G - - - - - - D ... insert state ... V Y S - - T Y E T S ... Gap in insert column -> ... F N A - - N I P K H ... ignore ... I A G A D N G A G V ... Durbin Fig 5.4, p. 109 12

Building a Model (cont’d) • Count number of transitions and emissions and compute: A kl a kl = � l ′ A kl ′ E k ( b ) e k ( b ) = b ′ E k ( b ′ ) � • Still need to beware of some counts = 0 13

Weighted Pseudocounts • Let c ja = observed count of residue a in position j of multiple alignment c ja + Aq a e M j ( a ) = � a ′ c ja ′ + A • q a = background probability of a , A = weight placed on pseudocounts (sometimes use A ≈ 20 ) • Also called a prior distribution 14

Dirichlet Mixtures • Can be thought of a mixture of pseudocounts • The mixture has different components, each representing a different context of a protein sequence – E.g. in parts of a sequence folded near protein’s surface, more weight (higher q a ) can be given to hydrophilic residues (ones that readily bind with water) • Will find a different mixture for each position of the alignment based on the distribution of residues in that column 15

Dirichlet Mixtures (cont’d) α k (so α k • Each component k consists of a vector of pseudocounts � a corresponds to Aq a ) and a mixture coefficient ( m k , for now) that is the probability that component k is selected • Pseudocount model k is the “correct” one with probability m k • We’ll set the mixture coefficients for each column based on which vec- tors best fit the residues in that column – E.g. first column of our example alignment is dominated by V, so α k that favors V will get a higher m k any vector � 16

Dirichlet Mixtures (cont’d) • Let � c j be vector of counts in column j c ja + α k a � � � e M j ( a ) = P k | � c j � � c ja ′ + α k � a ′ k a ′ � � • P k | � c j are the posterior mixture coefficients, which are easily computed [Sj¨ olander et al. 1996], yielding: X a e M j ( a ) = a ′ X a ′ , � where α k c ja + � a � � α k � � α k �� , X a = m k 0 exp ln B � a + � c j − ln B � a � c ja ′ + α k � a ′ k a ′   � � ln B ( � x ) = ln Γ( x i ) − ln Γ x i  i i 17

Dirichlet Mixtures (cont’d) • Γ is gamma function, and ln Γ is computed via lgamma and related functions in C • m k 0 is prior probability of component k ( = q in Sj¨ olander Table 1): . . . 18

Searching for Homologues • Score a candidate match x by using log-odds: – P ( x, π ∗ | M ) is probability that x came from model M via most likely path π ∗ ⇒ Find using Viterbi – Pr ( x | M ) is probability that x came from model M summed over all possible paths ⇒ Find using forward algorithm – score ( x ) = log( P ( x | M ) /P ( x | φ )) ∗ φ is a “null model”, which is often the distribution of amino acids in the training set or AA distribution over each individual column ∗ If x matches M much better than φ , then score is large and positive 19

Viterbi Equations • V M ( i ) = log-odds score of best path matching x 1 ...i to the model, j where x i emitted by state M j (similarly define V I j ( i ) and V D j ( i ) ) • Rename B as M 0 , V M 0 (0) = 0 , rename E as M L +1 ( V M L +1 = final) V M  j − 1 ( i − 1) + log a M j − 1 M j � e M j ( x i )  �  V M  V I j − 1 ( i − 1) + log a I j − 1 M j ( i ) = log + max j q x i V D  j − 1 ( i − 1) + log a D j − 1 M j   V M  ( i − 1) + log a M j I j j � e I j ( x i ) �    V I V I j ( i − 1) + log a I j I j j ( i ) = log + max q x i V D  j ( i − 1) + log a D j I j   V M  j − 1 ( i ) + log a M j − 1 D j    V D V I j − 1 ( i ) + log a I j − 1 D j j ( i ) = max V D  j − 1 ( i ) + log a D j − 1 D j   20

Forward Equations � e M j ( x i ) � F M � � F M � j ( i ) = log + log a M j − 1 M j exp j − 1 ( i − 1) + q x i � F I � � F D �� a I j − 1 M j exp j − 1 ( i − 1) + a D j − 1 M j exp j − 1 ( i − 1) � e I j ( x i ) � F I � � F M � j ( i ) = log + log a M j I j exp j ( i − 1) + q x i � F I � � F D �� a I j I j exp j ( i − 1) + a D j I j exp j ( i − 1) F D � � F M � � F I � j ( i ) = log a M j − 1 D j exp j − 1 ( i ) + a I j − 1 D j exp j − 1 ( i ) � F D �� + a D j − 1 D j exp j − 1 ( i ) • exp( · ) needed to use sums and logs 21

Aligning a Sequence with a Model (Multiple Alignment) • Given a string x , use Viterbi to find most likely path π ∗ and use the state sequence as the alignment • More detail in Durbin, Section 6.5 – Also discusses building an initial multiple alignment and HMM si- multaneously via Baum-Welch 22

CSCE 970 Lecture 3: HMM Application: Biological Sequence Analysis - PowerPoint PPT Presentation

CSCE 970 Lecture 3: HMM Application: Biological Sequence Analysis Stephen D. Scott 1 Introduction Idea: Given a collection S of related biological sequences, build a (pro- file) hidden Markov model M to generate the sequences Then test

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Introduction Out with the old ... CSCE 970 CSCE 970 Lecture 8: Lecture 8: Structured

Introduction CSCE 970 CSCE 970 Lecture 4: Lecture 4: Convolutional Convolutional Neural

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

Introduction CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple CSCE

CSCE 970 Lecture 8: Prediction Stephen Scott Structured Prediction and Vinod Variyam

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Outline CSCE CSCE 471/871 471/871 Lecture 5: Lecture 5: Building Building CSCE 471/871

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Class Overview 1 Shell CSCE 314 TAMU CSCE 314: Programming Languages Course Homepage:

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 9: Lecture 9: word2vec and word2vec and To

RNA Search and Motif Discovery Lectures 18-19 CSE 527 Autumn 2007 The Human Parts List, circa

Remembering Erich Lehmann Willem van Zwet, University of Leiden 4 th Lehmann Symp. Houston , May

Estimation of Autoregressive Processes with Sparse Parameters Abbas Kazemipour MAST Group

Program control constructs Branching using if endif and select case loops (repeated

Particle algorithm for McKean SDE: a short review on numerical analysis Mireille Bossy Sophia

Linear Regression Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa

Classical setup: Linear state space models (SSMs) robKalman a package on Robust Kalman

st t t r t