hmms for pairwise sequence alignment based on ch 4 from
play

HMMs for Pairwise Sequence Alignment based on Ch. 4 from - PowerPoint PPT Presentation

0. HMMs for Pairwise Sequence Alignment based on Ch. 4 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgement: M.Sc. student Oana R at oi [ B. subtilis PRPP synthase ] 1. PLAN 1 How to build a pair HMM for


  1. 0. HMMs for Pairwise Sequence Alignment based on Ch. 4 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgement: M.Sc. student Oana R˘ at ¸oi [ B. subtilis PRPP synthase ]

  2. 1. PLAN 1 How to build a pair HMM for pairwise alignment with affine gap penal- ties ...starting from the FSA for more complex pairwise alignments introduced in Ch. 2. Advantages of using such a full probabilistic model for pairwise alignment: 2 evaluate the similarity of two sequences independently of any specific alignment, by weighting all alternatives, and assess the reliability of the alignment obtained by dynamic program- ming 3 explore suboptimal alignments by probabilistic sampling 4 evaluate the posterior probability that x i is aligned to y j ; define the overall accuracy of an alignment; design an algorithm for getting the maximal overall accuracy alignment

  3. 2. Remember: A simple FSA for global pairwise sequence alignment with affine gap penalties Computing the values of the states I x − e (+1,+0) • Initialisations: s(x ,y ) i j V M (0 , 0) = 0 , V X (0 , 0) = V Y (0 , 0) = 0 − d M • Recurrence relations: s(x ,y )  V M ( i − 1 , j − 1) , (+1,+1) i j  V M ( i, j ) = s ( x i , y j ) + max V X ( i − 1 , j − 1) , − d V Y ( i − 1 , j − 1) ,  � V M ( i − 1 , j ) − d, V X ( i, j ) = max s(x ,y ) I y V X ( i − 1 , j ) − e, i j − e (+0,+1) � V M ( i, j − 1) − d, V Y ( i, j ) = max V Y ( i, j − 1) − e, Example: VLSPAD-K HL--AESK

  4. 3. 1 How to build a Pair HMM? A first, FSA-inspired version δ X X ε − e (+1,+0) q x i s(x ,y ) ε 1− i j δ − d M M s(x ,y ) 1−2 δ i j (+1,+1) p x y i j δ − d 1−2 δ ε 1− s(x ,y ) Y Y i j ε − e (+0,+1) q y j δ

  5. 4. The parameters • Set probabilities for transitions between states – transition from M to an insert state: δ – probability of staying in an insert state: ǫ • Set probabilities for emissions of symbols from the states – p ab for emitting an aligned pair a : b from a M state – q a for emitting symbol a against a gap • Set the initial transitions to each state to be the same as from the M state. Remark: The HMM shown on the previous slide generates pair alignments, except the pair of empty sequences. To get a full probabilistic version: • Define a new state End, and add a new parameter τ for the probability of the transition(s) into the End state.

  6. 5. Pair HMM: a full probabilistic version δ X ε δ q 1−2δ−τ x i τ τ 1−ε−τ τ p M End x y i j 1−2δ−τ 1−ε−τ τ δ Y ε q y j δ In this HMM, the emission of (pairs of) symbols is done when reaching the (non-End) states. To simplify the initialisation relations in the pair HMM algorithms (and also for symmetry with the End state), we will also add a silent state Begin.

  7. 6. Pair HMM: the final version δ X ε 1−2δ−τ δ q x i τ 1−ε−τ Begin τ p M End 1 x y i j 1−2δ−τ 1−ε−τ τ δ Y ε q δ y j τ

  8. 7. How does this pair HMM work? • start in the Begin state; • pick the next state according to the distribution of transi- tion probabilities leaving the current state; • pick a symbol pair to be added to the alignment according to the emission distribution in the new state; • cycle over the previous two steps, while the End state is not reached.

  9. 8. Remark Throughout this chapter and also in the next one (Profile HMMs), the emission probabilities, that have been denoted b ijk in [ Manning & Sch¨ utze, 2000 ] Ch. 9, are considered independent of the origin state i . As a consequence, slightly different definitions (and notations) will be used for forward, backward and Viterbi probabilities, compared to our HMM slides, which were based on [ Manning & Sch¨ utze, 2000 ] . f k ( i ) = P ( x 1 . . . x i , π i = k ) b k ( i ) = P ( x i +1 . . . x L | π i = k ) v k ( i ) = π 1 ...π i P ( x 1 . . . x i , π 1 . . . π i , π i = k ) max See also [ Durbin et al, 1998 ] , pages 56–59.

  10. 9. The Viterbi algorithm for pair HMMs • Notation: v k ( i, j ) = max π P ( x 1 . . . x i , y 1 . . . y j , π ij = k ) will correspond to the most probable alignment up to the positions i, j , ending in state k . Remark: To simplify the equations to be given below, the Begin state is defined as being the M state. • Initialisation: v M (0 , 0) = 1 , and all other v • ( i, 0) = 0 , v • (0 , j ) = 0 (See Remark on page 116 in [ Borodowsky and Ekisheva, 2006 ] .) • Recurrence relations:  (1 − 2 δ − τ ) v M ( i − 1 , j − 1)  v M ( i, j ) = p x i y j max (1 − ǫ − τ ) v X ( i − 1 , j − 1) for i ≥ 1 , j ≥ 1 (1 − ǫ − τ ) v Y ( i − 1 , j − 1)  � δ v M ( i − 1 , j ) v X ( i, j ) = q x i max for i ≥ 1 , j ≥ 0 ǫ v X ( i − 1 , j ) � δ v M ( i, j − 1) v Y ( i, j ) = q y j max for i ≥ 0 , j ≥ 1 ǫ v Y ( i, j − 1) • Termination: v E = τ max ( v M ( n, m ) , v X ( n, m ) , v Y ( n, m ))

  11. 10. An independence model for pairwise sequence alignment 1−η η End Begin η 1−η X Y 1−η η 1 q q x y j i η 1−η • The main states are X and Y, which emit the two sequences in turn, independently of each other. • There is a silent state between X and Y, that doesn’t emit any symbols. • The probability of a pair of sequences x and y : n m n m � � � � P ( x, y | R ) = η (1 − η ) n q x i η (1 − η ) m q y j = η 2 (1 − η ) n + m q x i q y j i =1 j =1 i =1 j =1

  12. 11. A pair HMM for local alignment 1−η 1−η δ X RX 1 RX q i ε q i q i 1−2δ−τ x x 2 x η δ η 1−η 1−η τ End 1−ε−τ M τ η η P η x y j η i 1−ε−τ 1−η 1−2δ−τ 1−η Begin η τ η δ RY Y RY ε q q y j q 1 2 y j δ y j 1−η 1−η τ Note: This model is composed of the global model (states M, X and Y ) flanked by two copies of the independence model (states RX 1 , RY 1 , RX 2 , RY 2 ), because the sequences in the flanking regions are unaligned.

  13. 12. 2 The full probability of aligning x and y , summing over all paths Problem: When two sequences have a weak similarity, it is hard to identify the correct (biologically meaningful) alignment. Alternative: Using the pair HMM forward algorithm, we can calculate the proba- bility that a given pair of sequences are related by any alignment. � P ( x, y ) = P ( x, y, π ) alignments π Note: If there is an unambiguous best/Viterbi alignment π ⋆ , almost all of the probability P ( x, y ) will be contributed by P ( x, y, π ⋆ ) . But when there are many comparable alternative alignments or alterna- tive variations, the probability P ( x, y ) can be significantly different from P ( x, y, π ⋆ ) .

  14. 13. Algorithm: Forward calculation for pair HMM • Notation: f k ( i, j ) is the combined probability of all alignments up to the positions i, j , ending in state k : f k ( i, j ) = P ( x 1 . . . x i , y 1 . . . y j , π ij = k ) . • Initialisation: f M (0 , 0) = 1 and all other f • ( i, 0) = 0 , f • (0 , j ) = 0 • Recursion relations: f M ( i, j ) = p x i y j [(1 − 2 δ − τ ) f M ( i − 1 , j − 1) + (1 − ǫ − τ )( f X ( i − 1 , j − 1) + f Y ( i − 1 , j − 1))] , for i ≥ 1 , j ≥ 1 f X ( i, j ) = q x i [ δf M ( i − 1 , j ) + ǫf X ( i − 1 , j )] , for i ≥ 1 , j ≥ 0 f Y ( i, j ) = q y j [ δf M ( i, j − 1) + ǫf Y ( i, j − 1)] , for i ≥ 0 , j ≥ 1 • Termination: f E ( n, m ) = τ ( f M ( n, m ) + f X ( n, m ) + f Y ( n, m ))

  15. 14. Note The log-odds ratio of the full probability P ( x, y ) = f E ( n, m ) to the independence model probability is a measure of the likelihood that the two sequences are related to each other by some unspecified alignment, as opposed to being unrelated.

  16. 15. The posterior probability P ( π | x, y ) over alignments π , given x and y P ( π | x, y ) = P ( x, y, π ) P ( x, y ) In particular, for the Viterbi path: P ( π ⋆ | x, y ) = P ( x, y, π ⋆ ) = v E ( n, m ) P ( x, y ) f E ( n, m ) Note: As already mentioned, this probability can be very small. For example, for the alignment of two highly diverged proteins shown below — the human alpha globin and the leghaemoglobin from yellow lupin —, this probability is 4 . 6 × 10 − 6 . HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG

  17. 16. 3 Suboptimal alignment by probabilistic sampling Idea: [How to generate a sample alignment] While tracing back through the matrix of f k ( i, j ) values instead of taking the highest choice at each step, make a probabilistic choice based on the relative strengths of the three components.

Recommend


More recommend