rna sequence analysis using covariance models
play

RNA sequence analysis using covariance models Eddy & Durbin - PowerPoint PPT Presentation

RNA sequence analysis using covariance models Eddy & Durbin Nucleic Acids Research, 1994 vol 22 #11, 2079-2088 What A probabilistic model for RNA families The Covariance Model A Stochastic Context-Free Grammar


  1. “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic Acids Research, 1994 vol 22 #11, 2079-2088

  2. What • A probabilistic model for RNA families – The “Covariance Model” – ≈ A Stochastic Context-Free Grammar – A generalization of a profile HMM • Algorithms for Training – From aligned or unaligned sequences – Automates “comparative analysis” – Complements Nusinov/Zucker RNA folding • Algorithms for searching

  3. Main Results • Very accurate search for tRNA – (Precursor to tRNAscanSE - current favorite) • Given sufficient data, model construction comparable to, but not quite as good as, human experts • Some quantitative info on importance of pseudoknots and other tertiary features

  4. Probabilistic Model Search • As with HMMs, given a sequence, you calculate llikelihood ratio that the model could generate the sequence, vs a background model • You set a score threshold • Anything above threshold --> a “hit” • Scoring: – “Forward” / “Inside” algorithm - sum over all paths – Viterbi approximation - find single best path (Bonus: alignment & structure prediction)

  5. Example: searching for tRNAs

  6. Alignment Quality

  7. Comparison to TRNASCAN • Fichant & Burks - best heuristic then evaluation criteria – 97.5% true positive Slightly different – 0.37 false positives per MB • CM A1415 (trained on trusted alignment) – > 99.98% true positives – <0.2 false positives per MB • Current method-of-choice is “tRNAscanSE”, a CM-based scan with heuristic pre-filtering (including TRNASCAN?) for performance reasons.

  8. Profile Hmm Structure M j : Match states (20 emission probabilities) I j : Insert states (Background emission probabilities) D j : Delete states (silent - no emission)

  9. CM Structure • A: Sequence + structure • B: the CM “guide tree” • C: probabilities of letters/ pairs & of indels • Think of each branch being an HMM emitting both sides of a helix (but 3’ side emitted in reverse order)

  10. Overall CM Architecture • One box (“node”) per node of guide tree • BEG/MATL/INS/DEL just like an HMM • MATP & BIF are the key additions: MATP emits pairs of symbols, modeling base-pairs; BIF allows multiple helices

  11. CM Viterbi Alignment x i = i th letter of input x ij = substring i ,..., j of input T yz = P (transition y � z ) y E x i , x j = P (emission of x i , x j from state y ) y = max � log P ( x ij generated starting in state y via path � ) S ij

  12. y = max � log P ( x ij generated starting in state y via path � ) S ij � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right � z max z [ S i , j + log T yz ] delete � y right ] y left + S k + 1, j � max i < k � j [ S i , k bifurcation �

  13. Model Training

  14. Mutual Information f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj • Max when no sequence conservation but perfect pairing • MI = expected score gain from using a pair state • Finding optimal MI, (i.e. optimal pairing of columns) is NP-hard(?) • Finding optimal MI without pseudoknots can be done by dynamic programming

  15. M.I. Example * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U Cols 1 & 9, 2 & 8: perfect conservation and A G G C U U C C U A G U A A A A C U might be base-paired, but unclear whether A G U C C A A C U they are. M.I. = 0 A G U U G C A C U A G U U U C A C U Cols 3 & 7: completely unconserved, but always W-C pairs, so seems likely that they A 16 0 4 2 4 4 4 0 0 do base-pair. M.I. = 2 bits. C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 Cols 7->6: unconserved, but each letter in 7 U 0 0 4 8 4 4 4 0 16 has only 2 possible mates in 6. M.I. = 1 bit.

  16. MI-Based Structure-Learning • find best (max total MI) subset of column pairs among i…j, subject to absence of pseudo-knots � S i + 1, j � � S i , j � 1 S i , j = max � S i + 1, j � 1 + M i , j � � max i < j < k S i , k + S k + 1, j � • “just like Nussinov/Zucker folding” • BUT, need enough data---enough sequences at right phylogenetic distance

  17. Pseudoknots � � n � disallowed allowed � � /2 max j M i , j � � i = 1

  18. Rfam – an RNA family DB Griffiths-Jones, et al., NAR ‘03,’05 • Biggest scientific computing user in Europe - 1000 cpu cluster for a month per release • Rapidly growing: – Rel 1.0, 1/03: 25 families, 55k instances – Rel 7.0, 3/05: 503 families, >300k instances

  19. Rfam • Input (hand-curated): IRE (partial seed alignment): – MSA “seed alignment” Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC – SS_cons Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC – Score Thresh T Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC – Window Len W Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG • Output: Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU – CM Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC – scan results & “full Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC alignment” Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>>

  20. Rfam – key issues • Overly narrow families • Variant structures/unstructured RNAs • Spliced RNAs • RNA pseudogenes – Human ALU is SRP related w/ 1.1m copies – Mouse B2 repeat (350k copies) tRNA related • Speed & sensitivity • Motif discovery

  21. Faster Genome Annotation of Non-coding RNAs Without Loss of Accuracy Zasha Weinberg & W.L. Ruzzo Recomb ‘04, ISMB ‘04

  22. Covariance Model Key difference of CM vs HMM: Pair states emit paired symbols, corresponding to base-paired nucleotides; 16 emission probabilities here.

  23. CM’s are good, but slow Rfam Reality Our Work Rfam Goal EMBL EMBL EMBL Z Blast CM CM CM junk junk hits hits hits 1 month, ~2 months, 10 years, 1000 computers 1000 computers 1000 computers

  24. Oversimplified CM (for pedagogical purposes only) A A C C G G U U – – A A C C G G U U – –

  25. CM to HMM A A A A C C C C G G G G U U U U – – – – CM HMM A A A A C C C C G G G G U U U U – – – – 25 emisions per state 5 emissions per state, 2x states

  26. Key Issue: 25 scores  10 A A A A C C C C CM HMM G G G G U U U U – – – – • Need: log Viterbi scores CM ≤ HMM

  27. Viterbi/Forward Scoring • Path π defines transitions/emissions • Score( π ) = product of “probabilities” on π • NB: ok if “probs” aren’t, e.g. ∑≠1 (e.g. in CM, emissions are odds ratios vs 0th-order background) • For any nucleotide sequence x: – Viterbi-score(x) = max{ score( π ) | π emits x} – Forward-score(x) = ∑{ score( π ) | π emits x}

  28. Key Issue: 25 scores  10 A A A A C C C C CM HMM L R G G G G U U U U – – – – NB:HMM not a prob. model • Need: log Viterbi scores CM ≤ HMM P AA ≤ L A + R A P CA ≤ L C + R A … P AC ≤ L A + R C P CC ≤ L C + R C … P AG ≤ L A + R G P CG ≤ L C + R G … P AU ≤ L A + R U P CU ≤ L C + R U … P C– ≤ L C + R – P A– ≤ L A + R – …

  29. P AA ≤ L A + R A P AC ≤ L A + R C Rigorous Filtering P AG ≤ L A + R G P AU ≤ L A + R U P A– ≤ L A + R – … • Any scores satisfying the linear inequalities give rigorous filtering Proof: CM Viterbi path score ≤ “corresponding” HMM path score ≤ Viterbi HMM path score (even if it does not correspond to any CM path)

  30. Some scores filter better P UA = 1 ≤ L U + R A P UG = 4 ≤ L U + R G Assuming ACGU ≈ 25% Option 1: Opt 1: L U = R A = R G = 2 L U + (R A + R G )/2 = 4 Option 2: Opt 2: L U = 0, R A = 1, R G = 4 L U + (R A + R G )/2 = 2.5

  31. Optimizing filtering • For any nucleotide sequence x: Viterbi-score(x) = max{ score( π ) | π emits x } Forward-score(x) = ∑{ score( π ) | π emits x } • Expected Forward Score E(L i , R i ) = ∑ all sequences x Forward-score(x)*Pr(x) – NB: E is a function of L i , R i only Under 0th-order background model • Optimization: Minimize E(L i , R i ) subject to score L.I.s – This is heuristic (“forward ↓ ⇒ Viterbi ↓ ⇒ filter ↓ ”) – But still rigorous because “subject to score L.I.s”

Recommend


More recommend