csi5126 algorithms in bioinformatics
play

CSI5126 . Algorithms in bioinformatics Probabilistic Sequence Motifs - PowerPoint PPT Presentation

. Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation CSI5126


  1. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation . Motivational example (contd) Let’s create a new node (state) to model [ACGT]* . Sequences 1 and 4 do not need to visit that state. Sequences 3 and 5 visit this state once, whilst the second sequence visits the state three times. Make sure to understand how the (emission) probabilities for that state are computed. Marcel Turcotte . Estimation . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . A C A − − − A T G T C A A C T A T C A C A C − − A G C A .2 A G A − − − A T C A C C G − − A T C C .4 G .2 T .2 A .8 A A .8 A 1 A A C C .8 C .2 C C C .8 G G .2 G G G .2 G .2 T .2 T T T T .8 T

  2. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivational example (contd) . 2 out of 5 sequences do not visit state 4 and are going directly the probabilities is 1. Marcel Turcotte . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A C A − − − A T G T C A A C T A T C A C A C − − A G C A .2 A G A − − − A T C A C C G − − A T C C .4 G .2 T .2 0.6 A .8 A A .8 A 1 A A 1.0 1.0 0.4 C C .8 C .2 C C C .8 G G .2 G G G .2 G .2 T .2 T T T T .8 T Transitions 1 → 2, and 2 → 3 occur with probability 1.0. from state 3 to state 5. Let’s assign a (transition) probability 2 5 to that edge, and 3 5 for the other outgoing edge so that the sum of all

  3. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . Sequences 3 and 5 are making a transition immediately to state 5 Marcel Turcotte and 3 to state 5. 5 is made. Therefore, 2 out of 5 transitions are made to state 4 state 4 are made, matching C and T, before the transition to state sequence 2, after the fjrst A has been matched, two transitions to after the C an G has been emitted/matched. In the case of Once in state 4, 5 events occur before state 5 is reached. Example . Motivational example (contd) Estimation Model Specifjcation Likelihood Decoding . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A C A − − − A T G T C A A C T A T C A C A C − − A G C A .2 A G A − − − A T C A C C G − − A T C C .4 G .2 T .2 0.6 A .8 A A .8 A 1 A A 1.0 1.0 0.4 C C .8 C .2 C C C .8 G G .2 G G G .2 G .2 T .2 T T T T .8 T

  4. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivational example (contd) . Finally, transitions from states 5 to 6, and 6 to 7 occur with probability 1. This is the basic idea behind Hidden Markov Models (HMMs), as applied to model sequence motifs. Marcel Turcotte . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.4 A C A − − − A T G T C A A C T A T C A C A C − − A G C A .2 A G A − − − A T C A C C G − − A T C C .4 G .2 T .2 0.6 0.6 A .8 A A .8 A 1 A A 1.0 1.0 1.0 1.0 0.4 C C .8 C .2 C C C .8 G G .2 G G G .2 G .2 T .2 T T T T .8 T

  5. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivational example (contd) . It’s now easy to score the probability of any sequence. In particular the consensus sequence, Marcel Turcotte . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.4 A C A − − − A T G T C A A C T A T C A C A C − − A G C A .2 A G A − − − A T C A C C G − − A T C C .4 G .2 T .2 0.6 0.6 A .8 A A .8 A 1 A A 1.0 1.0 1.0 1.0 0.4 C C .8 C .2 C C C .8 G G .2 G G G .2 G .2 T .2 T T T T .8 T P ( ACACATC ) = 0 . 8 × 1 × 0 . 8 × 1 × 0 . 8 × 0 . 6 × 0 . 4 × 0 . 6 × 1 × 1 × 0 . 8 × 1 × 0 . 8 × 1 = 0 . 0472

  6. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivational example (contd) . And, the exceptional one, 1 exceptional one. Marcel Turcotte . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.4 A C A − − − A T G T C A A C T A T C A C A C − − A G C A .2 A G A − − − A T C A C C G − − A T C C .4 G .2 T .2 0.6 0.6 A .8 A A .8 A 1 A A 1.0 1.0 1.0 1.0 0.4 C C .8 C .2 C C C .8 G G .2 G G G .2 G .2 T .2 T T T T .8 T P ( TGCTAGG ) = 0 . 2 × 1 × 0 . 2 × 1 × 0 . 2 × 0 . 6 × 0 . 2 × 0 . 6 × 1 × 1 × 0 . 2 × 1 × 0 . 2 = 0 . 000023 The consensus sequence is 2,052 times more likely than the

  7. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . not anymore ! Marcel Turcotte insertion state is used). highly depends on the sequence length (number of times the There is one small problem to be fjxed, the computed probability symbols is also modeled. We used to model the length the gaps only, now distribution of the We used to assume that all the positions are identically distributed, Example . Remarks Estimation Model Specifjcation Likelihood Decoding . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.4 A C A − − − A T G T C A A C T A T C A C A C − − A G C A .2 A G A − − − A T C A C C G − − A T C C .4 G .2 T .2 0.6 0.6 A .8 A A .8 A 1 A A 1.0 1.0 1.0 1.0 0.4 C C .8 C .2 C C C .8 G G .2 G G G .2 G .2 T .2 T T T T .8 T

  8. . Model Specifjcation Example Defjnitions Profjle HMM Motivation Preamble Estimation Likelihood Likelihood Decoding Example Defjnitions Profjle HMM Motivation Preamble Decoding Model Specifjcation . Since the two models have the same length and that the log of a Marcel Turcotte underfmow problems. estimates from actual data. This would also help avoiding dividing the probability values by 0.25 or using the probability Markov model can be transformed to a log-odd score, in our case product is the sum of logs, each probability value in the hidden log 2 Estimation for a sequence S of length n , the log-odds score becomes, our random model, let’s assume that nucleotides are equiprobable, usual it’s convenient to express the ratio as a log-odds score . For probability of that sequence given a random (NULL) model , as sequence given the model just described is compared to the To remove the length dependency of the score, the probability of a Length dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . P ( S | M ) P ( S | M ) P ( S | R ) = log 2 0 . 25 n

  9. . Log-odds 4.9 3.3 ACA---ATG Sequence motifs 6.7 4.7 ACAC--ATC Consensus S 0.0075 Length dependency (contd) Estimation Model Specifjcation Likelihood Decoding Example Defjnitions Profjle HMM Motivation TCAACTATC 3.0 Estimation 0.0023 Marcel Turcotte negative, indicating that the null model is a better fjt . In the case of the exceptional sequence, its log-odd score is difgerent. scores are almost identical but their log-odd scores are quite the second sequence and the exceptional one, their raw probability space, because of the length dependency, consider the scores for Notice that two matches cannot be compared in the probability -0.97 TGCT--AGG ACAC--AGC Exception 4.6 0.59 ACCG--ATC 4.9 3.3 AGA---ATC 5.3 1.2 Preamble Model Specifjcation . . . . . . . . . . . . . . . . . . . . . . . . Likelihood . Decoding Example Defjnitions Profjle HMM Motivation Preamble . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics P ( S | M ) × 100

  10. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Profjle- HMM There is a particular type of HMM that is often used to model families of sequences they are called profjle-HMMs. They resemble normal sequence profjles and allow to model insertions and deletions in a position specifjc manner . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . ... ... ... ... j

  11. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Profjle- HMM The bottom nodes are called main or match states , each M j corresponds to a particular column in a multiple sequence alignment, the probability distribution at that node corresponds to the probability distribution of the column it models in the alignment . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . ... ... D j I j ... ... M j Begin End

  12. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Profjle- HMM The diamond shaped nodes are called insertion states , noted I j , they allow to model variable regions, the amino acid probability distribution at those nodes could be set to the overall probability distribution of amino acids, for example. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . ... ... D j I j ... ... M j Begin End

  13. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Profjle- HMM Finally, the round nodes are called delete or silent states , noted D j , they allow to model deletions, i.e. to skip certain columns of the alignment. As you can see, insertions and deletions are modelled separately. Also, their respective probabilities are allowed to vary along the profjle. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . ... ... D j I j ... ... M j Begin End

  14. . Model Specifjcation . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding . Estimation . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Profjle- HMM case of profjle HMMs they correspond to a column in a profjle (alignment). Marcel Turcotte . Likelihood . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . ... ... D j I j ... ... M j Begin End a h p w c i q y d k t e l s f m t g n v ⇒ Emission probabilities are now associated with each M j , in the

  15. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Remarks The Profjle-HMM topology (states and interconnections) is specifjc to bioinformatics. In bioinformatics, many other topologies are used, including specifjc topologies for modeling eukaryotic gene structures (exons and introns), sequence alignments and trans-membrane proteins . Examples will be seen later. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  16. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood . Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation An HMM can be seen as a generative model according to some transition probability distribution , emit a symbol according to the emission probability distribution of that state, move to an adjacent state, again according to the transition probabilities, repeat until the end state has been reached. Marcel Turcotte . Model Specifjcation . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . ... ... D j I j ... ... M j Begin End a h p w c i q y d k t e l s f m t g n v ⇒ Starting from the begin state, move to an adjacent state, i ,

  17. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood . Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation What’s hidden ? Seen as a generative model, at each step this abstract machine moves to a new state and produces a symbol. T he observer only sees the sequence of symbols ; not the sequence of state transitions, which are hidden. What is Markovian ? Marcel Turcotte . Model Specifjcation . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . ... ... D j I j ... ... M j Begin End a h p w c i q y d k t e l s f m t g n v

  18. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Defjnitions the sequence of symbols ( S ). modeled as a Markov chain , these transitions are not directly observable (they are hidden ), Each state has emission probabilities associated with it : the probability of observing /emitting the symbol b when in state k . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . We need to distinguish between the sequence of states ( π ) and The sequence of states, denoted by π and called the path , is a kl = P ( π i = l | π i − 1 = k ) where a kl is a transition probability from the state π k to π l . e k ( b ) = P ( S ( i ) = b | π i = k )

  19. . Model Specifjcation . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Estimation . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Defjnitions states , Q , a matrix of transition probabilities , A , as well as a the emission probabilities , E , are the parameters of an HMM, Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics The alphabet of emited symbols , Σ , the set of (hidden) M = < Σ , Q , A , E > .

  20. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Interesting questions maximum ; has been produced by this HMM, let’s call this the likelihood problem ; determined ? Let’s call this the parameter estimation problem . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . 1. P ( S , π ) : the joint probability of a sequence of symbols S and a sequence of states π . The decoding problem consists of fjnding a path π such that P ( S , π ) is 2. P ( S | θ ) : the probability of a sequence of symbols S given the model θ . It represents the likelihood that sequence S 3. Finally, how are the parameters of the model (HMM), θ ,

  21. . Estimation . . . . . . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Defjnitions Joint probability of a sequence of symbols S and a sequence of L advance. Marcel Turcotte . Preamble . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . ... ... D j I j ... ... M j Begin End states π : ∏ P ( S , π ) = a 0 π 1 e π i ( S ( i )) a π i π i + 1 i = 1 P ( S = VGPGGAHA , π = BEG , M 1 , M 2 , I 3 , I 3 , I 3 , M 3 , M 4 , M 5 , END ) ⇒ However in practice, the state sequence π is not known in

  22. . Likelihood Defjnitions Profjle HMM Motivation Preamble Estimation Model Specifjcation Decoding Decoding Example Defjnitions Profjle HMM Motivation Preamble . Example Likelihood . equiprobable outcomes, but the other one is loaded (biased), it Marcel Turcotte predict when the exchanges of coins occurred ? Objective : Looking at a series of observations, S , can you is hidden to you. I will not reveal when I am exchanging the coins. This information 4 . In fact, I will be using two coins ! One is fair , i.e. head and tail are Model Specifjcation or { 0, 1, 1, 0, 1, 1, …}. information can be represented as follows : { H, T, T, H, T, T, …} I want to play a game. I will be tossing a coin n times. This characteristics of HMMs. A simplifjed example will help better understanding the Worked example : the occasionally dishonest player Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics returns head with probability 1 4 and tail with probability 3

  23. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Worked example : the occasionally dishonest player Such game can be modeled using an HMM where each state represents a coin, with its own emission probability distribution, and the transition probabilities represent exchanging the coins. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . .9 .2 .1 P(0) = 1/2 P(0) = 1/4 P(1) = 1/2 P(1) = 3/4 .8 π π 1 2

  24. . Model Specifjcation . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Estimation . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Worked example : the occasionally dishonest player Given an input sequence of heads and tails, such as 0, 1, 1, 0, 1, 1, 1, which sequence of states has the highest probability ? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . .9 .2 .1 P(0) = 1/2 P(0) = 1/4 P(1) = 1/2 P(1) = 3/4 .8 π π 1 2

  25. . Likelihood Defjnitions Profjle HMM Motivation Preamble Estimation Model Specifjcation Decoding . Example Defjnitions Profjle HMM Motivation Preamble . Example Likelihood . 1 Marcel Turcotte possible paths are generally much larger. feasible, the number of states and consequently the number of coin to the other, selecting the path with the highest joint Since the game consists of printing the series of switches from one 1 1 Model Specifjcation 0 1 1 0 S Worked example : the occasionally dishonest player Estimation . Decoding . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . π π 1 π 1 π 1 π 1 π 1 π 1 π 1 π π 1 π 1 π 1 π 1 π 1 π 1 π 2 . . . π π 2 π 2 π 1 π 1 π 2 π 2 π 2 . . . π π 2 π 2 π 2 π 2 π 2 π 2 π 2 probability, P ( S , π ) , seems appropriate. Here, there are 2 7 = 128 possible paths, enumerating all of them is

  26. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem Given an observed sequence of symbols, S , the decoding problem For our game, the sequence of states is of interest because it serves to predict the exchanges of coins. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics consists of fjnding a sequence of states, π , such that the joint probability of S and π is maximum. argmax π P ( S , π )

  27. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . Example Decoding Likelihood Model Specifjcation Estimation The decoding problem If the observed sequence of symbols was of length one, the sequence of states would also be of length one (in our restricted example). Which state would you predict if the observed symbol was a 0 ? What if it was a 1 ? Now consider an observed sequence of length two, let’s assume that the last symbol is 1, what is the probability of that symbol Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics being emitted from state π 1 ? There are two ways of ending up in π 1 while producing S ( 2 ) : 1) S ( 1 ) could have been produced from π 1 , and the state remained π 1 , or 2) S ( 1 ) could have been produced from π 2 , and there was a transition π 2 to π 1 . The two joint probabilities would be

  28. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics P ( S ( 1 ) | π 1 ) P ( π 1 → π 1 ) P ( S ( 2 ) | π 1 ) and P ( S ( 1 ) | π 2 ) P ( π 2 → π 1 ) P ( S ( 2 ) | π 1 ) .

  29. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem Now consider an observed sequence of length three, let’s assume that the last symbol is 1, what is the probability of that symbol the last state that led to the production of the sequence of symbols ending in state k while producing the observation i . Using this notation for formulating the probabilities for the above two scenarios. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . being emitted from state π 1 ? There are two ways of ending up in π 1 while producing S ( 3 ) : 1) S [ 1 , 2 ] was π 1 and the state remained π 1 , or 2) the last state that led to the production of the sequence of symbols S [ 1 , 2 ] was π 2 and it is followed by a transition π 2 to π 1 , with probability a 21 . Let’s defjne v k ( i ) as the probability of the most probable path v 1 ( 3 ) = max [ v 1 ( 2 ) × a 11 × e 1 ( 0 ) , v 2 ( 2 ) × a 21 × e 1 ( 0 ) ]

  30. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem For our 2 states HMM, we can write the following equation, Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics v 1 ( i ) = max [ v 1 ( i − 1 ) × a 11 × e 1 ( S ( i )) , v 2 ( i − 1 ) × a 21 × e 1 ( S ( i )) ] v 2 ( i ) = max [ v 1 ( i − 1 ) × a 12 × e 2 ( S ( i )) , v 2 ( i − 1 ) × a 22 × e 2 ( S ( i )) ]

  31. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics 0 1 1 0 1 1 1 π 1 π 2

  32. . Model Specifjcation . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Estimation . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem The most probable path can be found recursively. The score for the is given by, Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics most probable path ending in state l with observation i , noted v l ( i ) , v l ( i ) = e l ( S ( i )) max k [ v k ( i − 1 ) a kl ]

  33. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics v k (i−1) k e (S(i)) l a kl l ...

  34. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics where k is running for states such that a kl is defjned.

  35. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . the dynamic programming technique. Marcel Turcotte probabilities leads to underfmow the algorithm is implemented using A pointer (backward) is kept from l to the value of k that ending in state k and position i in S . Recurrence : Initialization : Viterbi algorithm . It fjnds the best (most probable) path using Example The algorithm for solving the decoding problem is known as the The decoding problem Estimation Model Specifjcation Likelihood Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . v 0 = 1 , v k = 0 , k > 0 v l ( i ) = e l ( S ( i )) max k ( v k ( i − 1 ) a kl ) where, v k ( i ) represents the probability of the most probable path maximizes v k ( i − 1 ) a kl . ⇒ Implementation issue : because of the products (small)

  36. . Model Specifjcation . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Estimation . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem (cont.) the logarithm of the values and therefore the products becomes sums. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  37. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . S(1) S(2) S(3) S(n-1) S(n) π 1 ... π 2 ... π m

  38. . Likelihood Defjnitions Profjle HMM Motivation Preamble Estimation Model Specifjcation Decoding Decoding Example Defjnitions Profjle HMM Motivation Preamble . Example Likelihood . $e[1][0] = 0.05; $e[1][1] = 0.95; Marcel Turcotte $d[ 1 ][ 0 ] = $e[ 1 ][ $S[ 0 ] ]; $d[ 0 ][ 0 ] = $e[ 0 ][ $S[ 0 ] ]; # initialization (d is the dynamic programming table) @S = (0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1); # observed sequence (S) $e[0][0] = 0.50; $e[0][1] = 0.50; Model Specifjcation # emission probabilities (e) $t[1][0] = 0.2; $t[1][1] = 0.8; $t[0][0] = 0.9; $t[0][1] = 0.1; # transition probabilities (t) The decoding problem Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  39. . Model Specifjcation Example Defjnitions Profjle HMM Motivation Preamble Estimation Likelihood Likelihood Decoding Example Defjnitions Profjle HMM Motivation Preamble Decoding Model Specifjcation . } Marcel Turcotte } } $tr[ $i ][ $j ] = "($from->$to)"; $d[ $i ][ $j ] = $m; } $from = $k; $to = $i; $m = $v; Estimation if ( $v > $m ) { $v = $d[$k][$j-1]*$t[$k][$i]*$e[$i][$S[$j]]; for ( $k=0; $k <= 1; $k++ ) { $m = 0; for ( $i=0; $i <= 1; $i++ ) { for ( $j=1; $j < @S; $j++ ) { The decoding problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  40. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem for ( $i=0; $i <= 1; $i++ ) { for ( $j=0; $j < @S; $j++ ) { printf "\t%5.5f", $d[ $i ][ $j ]; } print "\n"; for ( $j=0; $j < @S; $j++ ) { printf "\t %s", $tr[ $i ][ $j ]; } print "\n"; } Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  41. . 0 1 1 1 1 1 1 1 0 1 1 (0->0) 0 e[0][0] = 0.50 ; e[0][1] = 0.50 ; e[1][0] = 0.05 ; e[1][1] = 0.95 ; t[0][0] = 0.9 ; t[0][1] = 0.1 ; t[1][0] = 0.2 ; t[1][1] = 0.8 ; The decoding problem Estimation Model Specifjcation Likelihood Decoding Example Defjnitions 0.50000 0.22500 0.10125 0.04556 0.02050 0.00923 0.00415 0.00187 0.00084 0.00038 0.00017 0.00008 (0->0) Motivation (0->1) Marcel Turcotte (1->1) (1->1) (1->1) (1->1) (1->1) (1->1) (0->1) (1->1) (1->1) (0->0) (0->1) 0.05000 0.04750 0.00190 0.00962 0.00038 0.00195 0.00148 0.00113 0.00086 0.00065 0.00049 0.00038 (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) Profjle HMM Preamble . . . . . . . . . . . . . . . . . . . . . . . . . Estimation . Model Specifjcation Likelihood Decoding Example Defjnitions Profjle HMM Motivation Preamble . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  42. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The decoding problem Given an HMM representing a protein family as well as an unknown protein sequence, the solution to the decoding problem reveals the internal structure of the unknown sequence, showing the location of the insertions and deletions, core elements, etc. ; Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  43. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . Example Decoding Likelihood Model Specifjcation Estimation In the case of a Markov chain there is a single path for a given n In the case of an HMM, there are several paths producing the should be defjned as the sum of all the probabilities of all possible paths producing S , The number of paths grows exponentially with respect to the length of the sequence, therefore all the paths cannot simply be enumerated and summed. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . The likelihood problem : calculating P ( S | θ ) sequence S and therefore P ( S | θ ) is given by, ∏ P ( S | θ ) = P ( S ( 1 )) a S ( i − 1 ) S ( i ) i = 2 same S (some paths will be more likely than others) and P ( S | θ ) ∑ P ( S | θ ) = P ( S , π ) π

  44. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The likelihood problem : forward algorithm Modifying the Viterbi algorithm changing the maximization by a sum calculates the probability of the observed sequence up to position i ending in state l , k Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ∑ f l ( i ) = e l ( S ( i )) f k ( i − 1 ) a kl

  45. . Model Specifjcation . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Estimation . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The likelihood problem The score represents the probability of the sequence up to (and k Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics including) S ( i ) , noted f l ( i ) , is given by, ∑ f l ( i ) = e l ( S ( i )) [ f k ( i − 1 ) a kl ]

  46. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation The likelihood problem (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics f k (i−1) k e (S(i)) l a kl l ... where k is running for states such that a kl is defjned.

  47. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Forward Algorithm Can you think of an application for the forward algorithm ? Pfam is a large collection of HMMs covering many common protein domains and families, one HMM per domain or family, version 30.0 (June 2016) contains 16306 families. Given a new sequence, the forward algorithm can be used for fjnding the family that it belongs (if any). Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ pfam.xfam.org

  48. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Model Specifjcation We now turn to our third and fjnal question. How to determine the parameters of the model ? (typically, m sequences), the objective is to fjnd a set parameters, max Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Let x 1 , . . . , x m be m independent examples forming the training set θ , such that Π m i = 1 P ( x i | θ ) θ

  49. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Model Specifjcation Structure : states + interconnect ; (This is an occasion to include domain specifjc information !) Estimating the transition/emission probabilities. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  50. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Modeling the length At least 5 symbols long Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  51. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Modeling the length (cont.) 2 to 8 symbols long Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  52. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Arbitrary Deletions Too expensive, too many parameters to evaluate ! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  53. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Arbitrary Deletions (cont.) Silent (null) states do not emit symbols. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ Silent states prevent modeling specifjc distant transitions.

  54. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Profjle HMMs Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . ... ... D j I j ... ... M j Begin End ⇒ Models insertion/deletions separately.

  55. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Trans-membrane (helical) proteins Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  56. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Trans-membrane (helical) proteins (cont.) www.cbs.dtu.dk/services/TMHMM-2.0 Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  57. . . . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Gene prediction Marcel Turcotte . Decoding . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . Flanking region Exon 1 Exon 2 Exon n Flanking region 5’UTR 3’UTR Intron I 5’ ... 3’ GT AG GT AG GC GC Poly (A) box box Initiation Stop codon codon CAAT TATA box box Transcription initiation

  58. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Gene prediction (cont.) genes.mit.edu/GENSCAN.html Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . E0 E1 E2 I0 I1 I2 Einit Eterm Esngl 5'UTR 3'UTR Inter Promo PolyA

  59. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . Example Decoding Likelihood Model Specifjcation Estimation The parameter estimation problem Given : a fjxed topology ; n Two scenarios : The paths are know (CG islands, secondary structure, gene prediction) ; The paths are unknown. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . Problem : estimate the a st and e k ( b ) probabilities. n independent positive examples : S 1 , S 2 , . . . , S n . ∑ log P ( S 1 , S 2 , . . . , S n | θ ) = log P ( S j | θ ) j = 1

  60. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . Example Decoding Likelihood Model Specifjcation Estimation Parameters estimation /known paths Maximum likelihood estimators are A kl Necessitates large number of positive examples ; If a state k is not visited than numerator and denominator are zero ; arc/emission is zero ? Work around ? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . E k ( b ) a kl = l ′ A kl ′ and b ′ E k ( b ′ ) ∑ ∑ P ( x , π ) is a product of probabilities, what happen if an A kl = A kl + r kl E k ( b ) = E k ( b ) + r k ( b )

  61. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Parameters estimation /known paths (cont.) pseudocounts would refmect our prior bias, using observed frequency of amino acids or derived from substitution scores. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics where r kl and r k ( b ) are pseudocounts. The simplest pseudo count would be r kl = 1 and r k ( b ) = 1. Better

  62. . Estimation Decoding Example Defjnitions Profjle HMM Motivation Preamble Model Specifjcation Model Specifjcation Likelihood Decoding Example Defjnitions Profjle HMM Motivation Preamble Likelihood Estimation . the number of occurrences of each event. Marcel Turcotte of the amino acids. value between zero and one, proportional to the overall distribution be integers, a solution would be to initialize the counts with a that all amino acids are equiprobable. Since counts don’t need to In the case of the emission probabilities, this would be assuming initializing all the counts to one ; rather than zeros before counting Parameters estimation : remarks calculating the frequencies. The simplest pseudocounts consist in To circumvent that problem, pseudocounts are added prior to conclusions would be drawn from very little evidence”. number of sequences used to build the model is low, “strong probability zero as well. In particular, this would happen if the case then all path involving those probabilities would have Some (emission, transition) probabilities can be zero, if this is the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  63. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Parameters estimation : remarks More sophisticated pseudocounts would refmect the distribution of the amino acids at that position. For example, if leucine occurs with a high frequency at that position, you would expect that isoleucine would occur with a high frequency too, but not arginine — in the PAM250 scoring matrix, the score for substituting leucine and isoleucine is 2.80 whilst the score for leucine arginine is -2.2. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  64. . Estimation Decoding Example Defjnitions Profjle HMM Motivation Preamble Model Specifjcation Model Specifjcation Likelihood Decoding Example Defjnitions Profjle HMM Motivation Preamble Likelihood Estimation . as a starting point). Marcel Turcotte alignment, which is used again to improve the probabilities of the sequences again, in general, this will lead to a slightly improved model. The “improved” model is used to align the training the alignment is then used to improve the parameters of the The model is used to aligned the sequences from the training set, the distribution of the amino acids or a rough sequence alignment Parameters estimation /unknown paths (we say more or less because one can use prior knowledge about as follows : the model is initialized with more or less random values The details of these methods are complex, but the general idea is parameters of the HMM starting with a set of unaligned sequence. In the case of profjle-HMMs, it is possible to estimate the when the paths are unknown. It is also possible to estimate the emission/transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  65. . Defjnitions . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Example . “Expectation-Maximization”, one of the standard algorithms for Marcel Turcotte the most probable model given the observed data. converges toward a local optimum, i.e. it is not guaranteed to fjnd One the main problem or limitation with this technique is that it algorithm. model estimation is called Baum-Welch or forward-backward The scheme for parameter estimation is called Decoding alignment is observed. model, the process is repeated until no improvement of sequence Parameters estimation /unknown paths (cont.) Estimation Model Specifjcation Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  66. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . Example Decoding Likelihood Model Specifjcation Estimation Expectation-Maximization (EM) algorithm 1. Choose an initial model. If no prior information is available, make all the transition probabilities equiprobable, similarly for the emission probabilities ; 2. Use the decoding algorithm for fjnding the maximum likelihood path for each input sequence ; 3. Using these alignments, tally statistics for estimating all 4. Repeat 3 and 4 until the parameter estimates converge. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics a kl and e k ( b ) values ;

  67. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Summary Like Markov Chains, Hidden Markov Models (HMMs) consist of a Unlike Markov Chains, HMMs also “emit” a symbol (letter) at each (most) states. Given a new observation, the sequence of symbols is known (observed) but not the sequence of states, “it is hidden”. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . fjnite number of states, π 1 , π 2 , . . . , and transition probabilities, P ( π i → π j ) . Sequence of states π = π 1 , π 2 . . . Sequence of observed symbols S = S ( 1 ) , S ( 2 ) . . .

  68. . Estimation Decoding Example Defjnitions Profjle HMM Motivation Preamble Model Specifjcation Model Specifjcation Likelihood Decoding Example Defjnitions Profjle HMM Motivation Preamble Likelihood Estimation . (symbols). Marcel Turcotte HMMs were developed in such context. substitutions, insertions and deletions). However, variations are observed (that will be seen as category categories. The next task is to recognize words in this long sequence of The input is now represented as a long sequence of category labels Historical note categories) ; category to each frame (typically 256 predefjned 2. A process called vector quantization assigns a predefjned milliseconds ; 1. A speech signal is divided into frames of 10 to 20 in the early 1970s. HMMs were fjrst developed for solving speech recognition problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  69. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Historical note (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  70. . Decoding Profjle HMM Motivation Preamble Estimation Model Specifjcation Likelihood Example Example Defjnitions Profjle HMM Motivation Preamble . . Defjnitions Decoding . genes, show a higher frequency of CG dinucleotides than Marcel Turcotte explains the fact the CGs occurs there more frequently then elsewhere. suppressed in biologically important regions such as the start of a gene which mutate to T with high frequency, therefore CG dinucleotides are observed less *. this is because whenever C is followed by G the chances that C will be Those regions are a few hundred to a few thousand bases long. These regions, located around promoters or start regions of many Likelihood islands. Certain regions of the human genome are known as CG (or CpG) [ From Durbin et al. Biological Sequence Analysis. ] Example : CG islands Estimation Model Specifjcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . elsewhere * . methylated are higher (adding CH 3 group to its base), also, methylated Cs frequently than expected by chance, P ( C ) × P ( G ) , fjnally, methylation is

  71. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Example : CG islands Problem 1 : Given an unlabeled (short) sequence of DNA, can we decide if it comes from a CG island or not ? and initiate transcription. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ promoter : a site on DNA to which RNA polymerase will bind

  72. . 150791 bp 20641 /note="CpG island" 20670..21997 misc_feature Human DNA sequence from clone RP11-465L10 on chromosome 20. DEFINITION 29-SEP-2000 PRI DNA AL162458 20701 CGTCCGGAGC CTACGTCACC ACGATGCCTC CCCTGGGCCG GCGGCAGAAC CCGAGACCCC LOCUS Example : CG islands (cont.) Estimation Model Specifjcation Likelihood Decoding Example Defjnitions Profjle HMM ...C GCGCGGGTGC CAGGACCCAG GTCCTTGCTA 20761 CGCAGGTTCT AAGACAGCCC CCACGCCCCC CAGTGCGCAC GCTCAGTCCA ACCCCGCCGC Preamble 21481 CGACTCTGAG TCGCCTCAGC CCGGGGGCGG GAGCGCGCGG CGGGGCGGGG GGCGGAGCCC Marcel Turcotte 21961 GGTTTATAAG CCGGAAAGCA GAAGGGCCCG GAATCCG... 21901 AGTTATGTCT TACTGGGAGC GTACAATGCT GGACTCTACA TATAACGGTC GAGTGATTCC 21841 ACTGCGGTAT GCGTGGGGGT CGGGAAGCCA CAGGATAAAT AAAGACGTTA ACTTAAGAGC 21781 GGTGGGTTTT TCTGGTTGCG CAGATAGAGT TGTTTATCCT TGAGCAGCGG TAATTCTCAA 21721 ACATTGCGGA GATGGTCCCG CCCCACGTGC CTCCAATCCC GGACTCGGAC TCTGGCTTCT 21661 GCCTGCGTAC TTTGTTCGCC CTTTGACTCC TCCCTACTGG GCCGGAGAAT TCTGATTGGT 21601 GGGGAAGGTG CGCGCGCGCG CGCGCGCTGG AGCTCGCCTC TCGCCTTCGT GCGCCGTCGC 21541 GAGAGATGGG CCGGCGCGCG CGCGCGCGCC AAACAGCCCA CCCTCGCTGG GGTAGGGGGA 21421 GCGGAGCGGG CTCTAGGGCC CCTCCGCTGC TGCCGCCGCC ACCGCCTTTG TGTCGGGCTC 20821 GCACCGCCCA CCGCGAACAT CCGGCTCCTG CGTGTGTGCT CGAGGGGGAA ACTGAGGCGG 21361 CCCGCGGGTG GGCGGCCGCG CCGGTGGCCG AAGTGAGGGA GGTGGGCCCG GAGAGCCCCA 21301 CCGGCGCTTT CTGCTCGGGA CTGCCGCTTG CCCTGTCTCT GTTGCCGCCG CCATCTTAGA 21241 TGGTCCCGCA GCGAGCCGCG CCAGGGTCTG GGGATCCGAA GCTGGGGGGC GGCGGCCCCT 21181 AGAACGTGTA GCCGCGTCCC CTCCAGTCCG CTCCGGGCAG GTAAGAGTCC CAGGAAGCCA 21121 AAGATGGCGG CAGCGGCGCT GGGGAGGGCG AGGCGGAGGC GGCAAAACGG GCGGTCGAGC 21061 GCGAGGTCCC TCAGAGGCGG TACCAGCGCA TGCGCAGCGC GGAGTCCCGG CCCGGGACAC 21001 GCAAGCTGGG TGCGAGGAGC CAGCCGACCC TGCCACACTC AAGATGGCGG CGCGGCCGCG 20941 CCTTCCTCAG CGCCACAAGG AACAGCAGGG ACGGATGGGA AGAAGGGGAG GGGGCCGAAA 20881 GGACGTGCCA GTGAATTCAT TCCTTCCTCA GTCCACCCGC AGGCCTACAA AGCTGTCTCC Motivation Estimation . . . . . . . . . . . . . . . . . . . . . . . . Model Specifjcation . Likelihood Decoding Example Defjnitions Profjle HMM Motivation Preamble . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  73. . Motivation . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Profjle HMM . Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Probabilistic model of a sequence (by application of the general multiplication rule) For example the probability of CGAT : Why can’t this framework be used for modeling CG islands ? First and foremost, the models requires estimating a large number of parameters, which in turn implies an exceptionally large number of examples. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . P ( x ) = P ( x L , x L − 1 , . . . , x 1 ) = P ( x L | x L − 1 , . . . , x 1 ) P ( x L − 1 | x L − 2 . . . , x 1 ) . . . P ( x 1 ) P ( CGAT ) = P ( T , A , G , C ) = P ( T | A , G , C ) P ( A | G , C ) P ( G | C ) P ( C )

  74. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Probabilistic model of a sequence Under the assumption that positions are independent from one another, Why can’t this framework be used for modeling CG islands ? Dinucleotides are playing a critical role and the above model ignores the dependencies completely. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . P ( x i | x i − 1 . . . , x 1 ) = P ( x i ) , P ( x ) = P ( x L | x L − 1 , . . . , x 1 ) P ( x L − 1 | x L − 2 . . . , x 1 ) . . . P ( x 1 ) = P ( x L ) P ( x L − 1 ) . . . P ( x 1 ) P ( CGAT ) = P ( T | A , G , C ) P ( A | G , C ) P ( G | C ) P ( C ) = P ( T ) P ( A ) P ( G ) P ( C )

  75. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Markov Chains However under the assumption of an underlying (fjrst order) and the previous equation can be rewritten as follows : In the previous example : This seems to be the right model, the dinucleotide dependencies are represented. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . Markovian process (memory less), P ( x i | x i − 1 . . . , x 1 ) = P ( x i | x i − 1 ) , P ( x ) = P ( x L | x L − 1 , . . . , x 1 ) P ( x L − 1 | x L − 2 . . . , x 1 ) . . . P ( x 1 ) = P ( x L | x L − 1 ) P ( x L − 1 | x L − 2 ) . . . P ( x 2 | x 1 ) P ( x 1 ) P ( CGAT ) = P ( T , A , G , C ) = P ( T | A ) P ( A | G ) P ( G | C ) P ( C )

  76. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Graphical Formalism for Markov Chains Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  77. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Graphical Formalism for Markov Chains (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A T C G

  78. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Graphical Formalism for Markov Chains (cont.) graph. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics a st = P ( S ( i ) = t | S ( i − 1 ) = s ) . ⇒ transition probabilities , a st , are associated with the arcs of this

  79. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . matrices) has been replaced by that of space, with similar Marcel Turcotte occurs at position i does not depend on the particular homogeneity of space : the probability that symbol a position i depends only on what symbol is found at memory less : the probability that symbol a occurs at observations, Here the concept of time (involved in the development of PAM Example n Markov Chains Estimation Model Specifjcation Likelihood Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . a st = P ( S ( i ) = t | S ( i − 1 ) = s ) is the probability that symbol t is observed at position i knowing that s occurs at position i − 1. Therefore P ( S ) can now be written as follows, ∏ P ( S ) = P ( S ( 1 )) a S ( i − 1 ) S ( i ) i = 2 position i − 1 ; and not any other i ′ < i − 1. value of i (e.g. i = 123 or i = 162 , 144).

  80. . Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Markov Chains (contd) Higher-order Markov models are interesting for modeling DNA sequences ; in particular for modeling coding regions, the codon structure. A Markov chain of order k is a model where the probability that symbol a occurs at position i depends only on what symbol is Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics found at positions i − 1 , i − 2 . . . i − k , and not any other i ′ < i − k .

  81. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Markov Chains (contd) Markov chains are particularly convenient for two reasons, you see why ?) ; They lead to computationally effjcient algorithms. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics P ( S ( i ) | S ( i − 1 ) . . . S ( 1 )) would be diffjcult to estimate (do

  82. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Markov Chains (contd) (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A T C G

  83. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Markov Chains (contd) (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ In the above model a sequence can start and end anywhere.

  84. . Decoding . . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Likelihood . Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Markov Chains (contd) (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A T Stop Start C G

  85. . Estimation . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Preamble . Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Markov Chains (contd) (cont.) the sequences, 3) defjnes a probability distribution of all possible sequence (of any length)(sum to 1), 4) lengths decays exponentially. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ 1) Allows modeling start/end efgects, P ( Stop | T ) could be difgerent than P ( Stop | G ) , 2) models the distribution of lengths of

  86. . Profjle HMM . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Defjnitions . nucleotides in all ; Marcel Turcotte log-odds ratio. To use the models for discrimination, calculate the transition probabilities ; one for the negative examples, this involves estimating the Construct a Markov Model for the positive examples and negative examples of CG islands, almost 60,000 Example Durbin et al. collected a large number of positive and Methodology Estimation Model Specifjcation Likelihood Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  87. . Likelihood . . . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Model Specifjcation . Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Maximum Likelihood Estimators st Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . A T 816 902 1296 C G 1776 c + a + st = t c + ∑ ′ st ′ ⇒ a + CA = 816 / ( 816 + 902 + 1296 + 1776 ) = 0 . 17

Recommend


More recommend