pair hmm based gap statistics
play

Pair HMM based gap statistics for re-evaluation of indels in - PowerPoint PPT Presentation

Guideline Introduction Methods Results Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties Alexander Schnhuth, Raheleh Salari Cenk Sahinalp Centrum Wiskunde & Informatica Amsterdam


  1. Guideline Introduction Methods Results Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties Alexander Schönhuth, Raheleh Salari Cenk Sahinalp Centrum Wiskunde & Informatica Amsterdam Computational Biology Lab School of Computing Science Simon Fraser University September 8, 2010

  2. Guideline Introduction Methods Results Guideline Introduction Motivation Affine Gap Cost Alignments Methods Indel Statistics: Problem Formulation Solution: Pair HMMs Results Data and Parameters Evaluation Strategies

  3. Guideline Introduction Methods Results Motivation • Inaccuracies in sequence alignments abundant • Detrimental effects in downstream analyses

  4. Guideline Introduction Methods Results Motivation • Inaccuracies in sequence alignments abundant • Detrimental effects in downstream analyses • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56% surrounding gaps

  5. Guideline Introduction Methods Results Motivation • Inaccuracies in sequence alignments abundant • Detrimental effects in downstream analyses • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56% surrounding gaps • [Lunter et al. (2007)]: Gap attraction and gap annihilation • [Loeytynoja et al. (2005), Polyanovsky et al. (2008)]: Downward bias in the number of computationally inferred indels

  6. Guideline Introduction Methods Results Motivation • Inaccuracies in sequence alignments abundant • Detrimental effects in downstream analyses • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56% surrounding gaps • [Lunter et al. (2007)]: Gap attraction and gap annihilation • [Loeytynoja et al. (2005), Polyanovsky et al. (2008)]: Downward bias in the number of computationally inferred indels • Numbers and sizes of indels may serve as indicator of indel quality

  7. Guideline Introduction Methods Results Indel Facts Evolutionary processes behind insertions and deletions are not comprehensively understood. Classical alignment procedures with affine gap penalties: • Geometric distribution.

  8. Guideline Introduction Methods Results Indel Facts Evolutionary processes behind insertions and deletions are not comprehensively understood. Classical alignment procedures with affine gap penalties: • Geometric distribution. Truly evolutionary distributions of indel length: • Mixtures of exponentials [Qian and Goldstein, 2003], [Pang et. al, 2005] • Zipfian distribution [Chang and Benner, 2004]

  9. Guideline Introduction Methods Results Indel Facts Evolutionary processes behind Moreover, from small-scale insertions and deletions are not studies: comprehensively understood. • Indels often occur in the proteins’ Classical alignment procedures loop regions and cause significant with affine gap penalties: structural changes [Fechteler et • Geometric distribution. al., 1995] • Indels occur in disease-causing Truly evolutionary distributions of indel length: mutational hot spots [Kondrashov et al., 2004] • Mixtures of exponentials [Qian and Goldstein, 2003], [Pang et. al, • Thanks to the structural changes: 2005] novel approaches to antibacterial drug design [Cherkasov et al. • Zipfian distribution [Chang and 2005, 2006], [Nandan et al. 2007] Benner, 2004]

  10. Guideline Introduction Methods Results Affine Gap Cost Alignments Smith-Waterman and Needleman-Wunsch • Computationally efficient and still most popular. • SW and NW alignments are Viterbi paths in pair HMMs.

  11. Guideline Introduction Methods Results Affine Gap Cost Alignments Smith-Waterman and Needleman-Wunsch • Computationally efficient and still most popular. • SW and NW alignments are Viterbi paths in pair HMMs. • Blast employs affine gap penalty scoring schemes. • ☞ Many more advanced approaches to sequence alignment are based on pair HMMs.

  12. Guideline Introduction Methods Results Methods: Problem Formulation

  13. Guideline Introduction Methods Results Indel Statistics Problem Definition • Let T be a pool of protein pairs of interest and L A ( x , y ) , Sim A ( x , y ) I d , A ( x , y ) and be the length, the similarity and the length of the d-th longest indel of the alignment of ( x , y ) ∈ T , as computed by the affine gap cost algorithm A .

  14. Guideline Introduction Methods Results Indel Statistics Problem Definition • Let T be a pool of protein pairs of interest and L A ( x , y ) , Sim A ( x , y ) I d , A ( x , y ) and be the length, the similarity and the length of the d-th longest indel of the alignment of ( x , y ) ∈ T , as computed by the affine gap cost algorithm A . • We are interested in computation of P ( I d , A ( x , y ) ≥ k | L A ( x , y ) = n , Sim A ( x , y ) ∈ [ σ 1 , σ 2 ] ) where ( x , y ) has been sampled from the pool T.

  15. Guideline Introduction Methods Results Indel Statistics Problem Definition Multiple Indel Length Probability Problem Input : A sequence pair x , y , a pool of sequence pairs T s. t. ( x , y ) ∈ T , integers d , k . Output: An estimate of P ( I d , A ( x , y ) ≥ k | L A ( x , y ) = n , Sim A ( x , y ) ∈ [ σ 1 , σ 2 ] ) Remark • Replacing I d , A ( x , y ) by S A ( x , y ) , the score of the A - alignment of x and y is the classical problem of score statistics. • Approximate solution (exact for ungapped local alignments): Altschul-Dembo-Karlin statistics [Karlin and Altschul, 1990], [Dembo and Karlin, 1991].

  16. Guideline Introduction Methods Results Motivation Systematically answer questions like: “Are 4 gaps of size at least 6 in an alignment of length 200 and similarity 50 likely to reflect true indels?”

  17. Guideline Introduction Methods Results Solution: Pair HMMs

  18. Guideline Introduction Methods Results Solution Strategy: Pair HMMs

  19. Guideline Introduction Methods Results Solution Strategy: Pair HMMs Idea : Approximate Viterbi path statistics by generative statistics of a Markov chain.

  20. Guideline Introduction Methods Results Solution Strategy: Pair HMMs Markov Chain Training

  21. Guideline Introduction Methods Results Solution Strategy: Pair HMMs Markov Chain Training C d , k ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : at least d IN -stretches of length at least k A n ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : alignment region of length n

  22. Guideline Introduction Methods Results Solution Strategy: Pair HMMs Markov Chain Training C d , k ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : at least d IN -stretches of length at least k A n ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : alignment region of length n Example: B M 1 M 1 IN IN IN M 2 IN IN IN M 3 E C 2 , 3 ∩ A 10 ∈ B M 1 IN IN IN IN M 2 IN M 2 IN M 3 M 3 E C 3 , 1 ∩ C 1 , 4 ∩ A 11 ∈

  23. Guideline Introduction Methods Results Solution Strategy: Pair HMMs Markov Chain Training 1: Align all sequence pairs from T . 2: Infer q 1 , q 2 , q 3 , q 4 , q 5 , q 6 by “Viterbi training” the Markov chain with the alignments of Sim ∈ [ σ 1 , σ 2 ] . ☞ Markov chain M σ 1 ,σ 2 . C d , k ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : at least d IN -stretches of length at least k A n ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : alignment region of length n Example: B M 1 M 1 IN IN IN M 2 IN IN IN M 3 E C 2 , 3 ∩ A 10 ∈ B M 1 IN IN IN IN M 2 IN M 2 IN M 3 M 3 E C 3 , 1 ∩ C 1 , 4 ∩ A 11 ∈

  24. Guideline Introduction Methods Results Solution Strategy: Pair HMMs Computation of Probabilities M = M σ 1 ,σ 2 1: n , d , k as from the alignment of x and y 2: Compute P M ( C d , k ∩ A n ) as well as P M ( A n ) 3: Output P M ( C d , k | A n ) = P M ( C d , k ∩ A n ) P M ( A n ) Markov chain M σ 1 ,σ 2 for P ( I d ≥ k | L = n , Sim ∈ [ σ 1 , σ 2 ]) . C d , k ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : at least d IN -stretches of length at least k A n ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : alignment region of length n Example: B M 1 M 1 IN IN IN M 2 IN IN IN M 3 E C 2 , 3 ∩ A 10 ∈ B M 1 IN IN IN IN M 2 IN M 2 IN M 3 M 3 E C 3 , 1 ∩ C 1 , 4 ∩ A 11 ∈

  25. Guideline Introduction Methods Results Computation of P ( C k , d ∩ A n ) : Naive Approach Let B n , t , k be sequences of the type ( Z ∈ { M 1 , M 2 , M 3 , IN } ): B 0 Z 1 ... Z IN IN t + k ... Z Z E t ... n + 1 . t − 1 t + k − 1 n

  26. Guideline Introduction Methods Results Computation of P ( C k , d ∩ A n ) : Naive Approach Let B n , t , k be sequences of the type ( Z ∈ { M 1 , M 2 , M 3 , IN } ): B 0 Z 1 ... Z IN IN t + k ... Z Z E t ... n + 1 . t − 1 t + k − 1 n For example ( d = 1), it holds that n − k + 1 C 1 , k ∩ A n = B n , t , k [ t = 1

  27. Guideline Introduction Methods Results Computation of P ( C k , d ∩ A n ) : Naive Approach Let B n , t , k be sequences of the type ( Z ∈ { M 1 , M 2 , M 3 , IN } ): B 0 Z 1 ... Z IN IN t + k ... Z Z E t ... n + 1 . t − 1 t + k − 1 n For example ( d = 1), it holds that n − k + 1 C 1 , k ∩ A n = B n , t , k [ t = 1 However, proceeding by inclusion-exlusion n − k + 1 ( − 1 ) m + 1 P ( C 1 , k ∩ A n ) = X X P ( B n , t 1 , k ∩ ... ∩ B n , t m , k ) m = 1 1 ≤ t 1 <...< t m ≤ n − k + 1 results in computation of n − k + 1 “ n − k + 1 ” = 2 n − k + 1 − 1 X m m = 1 terms of the type P ( B n , t 1 , k ∩ ... ∩ B n , t m , k ) , hence computation of P ( C 1 , k ∩ A n ) would be exponential in n .

Recommend


More recommend