rna search and motif discovery
play

RNA Search and Motif Discovery CSEP 527 Computational Biology - PowerPoint PPT Presentation

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture Many biologically interesting roles for RNA RNA secondary structure prediction 1 2 Approaches to Structure Prediction Maximum Pairing + works on


  1. RNA Search and � Motif Discovery CSEP 527 � Computational Biology

  2. Previous Lecture � Many biologically interesting roles for RNA RNA secondary structure prediction 1

  3. 2

  4. Approaches to Structure Prediction Maximum Pairing � + works on single sequences � + simple � - too inaccurate Minimum Energy � + works on single sequences � - ignores pseudoknots � - only finds “optimal” fold Partition Function � + finds all folds � - ignores pseudoknots 3

  5. “Optimal pairing of r i ... r j ” � Two possibilities i j Unpaired: � Find best pairing of r i ... r j-1 j j-1 j Paired (with some k): � i Find best r i ... r k-1 + � k-1 best r k+1 ... r j-1 plus 1 k j Why is it slow? � k+1 j-1 Why do pseudoknots matter? 4

  6. Nussinov: � A Computation Order Or energy B(i,j) = # pairs in optimal pairing of r i ... r j B(i,j) = 0 for all i, j with i ≥ j-4; otherwise B(i,j) = max of: K=2 B(i,j-1) 3 4 5 max { B(i,k-1)+1+B(k+1,j-1) | � i ≤ k < j-4 and r k -r j may pair} Time: O(n 3 ) Loop-based energy version is better; recurrences similar, slightly messier 5

  7. Loop -based Energy Minimization 1 Detailed experiments show it’s � more accurate to model based � 2 on loops , rather than just pairs 3 Loop types 1. Hairpin loop 2. Stack 4 3. Bulge 4. Interior loop 5. Multiloop 5 6

  8. Zuker: Loop-based Energy, I W(i,j) = energy of optimal pairing of r i ... r j V(i,j) = as above, but forcing (i.e., subset with) pair i•j W(i,j) = V(i,j) = ∞ for all i, j with i ≥ j-4 W(i,j) = min( W(i,j-1), � min { W(i,k-1)+V(k,j) | i ≤ k < j-4 } ) 7

  9. Zuker: Loop-based Energy, II bulge/ multi- hairpin stack interior loop V(i,j) = min(eh(i,j), es(i,j)+V(i+1,j-1), VBI(i,j), VM(i,j)) VM(i,j) = min { W(i,k)+W(k+1,j) | i < k < j } VBI(i,j) = min { ebi(i,j,i’,j’) + V(i’, j’) | � i < i’ < j’ < j & i’-i+j-j’ > 2 } Time: O(n 4 ) bulge/ interior O(n 3 ) possible if ebi(.) is “nice” 8

  10. Single Seq Prediction Accuracy Mfold, Vienna,... [Nussinov, Zuker, Hofacker, McCaskill] Estimates suggest ~50-75% of base pairs predicted correctly in sequences of up to ~300nt Definitely useful, but obviously imperfect 9

  11. Approaches, II Comparative sequence analysis � + handles all pairings (potentially incl. pseudoknots) � - requires several (many?) aligned, � Today appropriately diverged sequences Stochastic Context-free Grammars � Roughly combines min energy & comparative, but no pseudoknots Physical experiments (x-ray crystalography, NMR) 10

  12. Covariation is strong evidence for base pairing 11

  13. A L19 ( rplS ) mRNA leader P1 TSS Excess L19 represses L19 (RF00556; 555-559 similar) P2 -35 -10 RBS Start Example: Ribosomal Autoregulation: C B B. subtilis L19 mRNA leader nucleotide nucleotide UU U C identity present UU U A G U U A U U A G G N 97% 97% G G A U G U N 90% 90% G U G A C G ? C G N 75% 75% G C G L19 P2 C G G C U 50% 3' U G C U U G C G Y C A A G A G stem loop U G C G U A A C C always present U A A C G U U G U C G U C G G A G C C G C C G C G compensatory mutations G U A U A U A 3' A compatible mutations R U A U G G A R Y G P1 G C G U R G A U A G C Watson-Crick base pair U G G R U G A U A U U 5' 3' G Y U U G A other base interaction 5' C G C U G 12 G U A G 5' C G AAAA C

  14. Mutual Information f xi , xj ∑ M ij = f xi , xj log 2 ; 0 ≤ M ij ≤ 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing MI = expected score gain from using a pair state (below) Finding optimal MI, (i.e. opt pairing of cols) is hard(?) Finding optimal MI without pseudoknots can be done by dynamic programming 13

  15. M.I. Example (Artificial) * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G G C U U C C U A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 C 0 0 4 4 4 4 4 16 0 only 2 possible mates in 6. M.I. = 1 bit. G 0 16 4 2 4 4 4 0 0 U 0 0 4 8 4 4 4 0 16 14

  16. 15

  17. MI-Based Structure-Learning � Problem: Find best (max total MI) pseudo-knot-free subset of column pairs among i…j. Solution: “Just like Nussinov/Zucker folding” $ S i , j = max S i , j − 1 j unpaired % max i ≤ k < j − 4 S i , k − 1 + M k , j + S k + 1, j − 1 j paired & BUT, need the right data—enough sequences at the right phylogenetic distance 16

  18. Computational Problems How to predict secondary structure How to model an RNA “motif” � (I.e., sequence/structure pattern) Given a motif, how to search for instances Given (unaligned) sequences, find motifs How to score discovered motifs How to leverage prior knowledge 17

  19. Motif Description 18

  20. RNA Motif Models “Covariance Models” (Eddy & Durbin 1994) aka profile stochastic context-free grammars aka hidden Markov models on steroids Model position-specific nucleotide preferences and base-pair preferences Pro: accurate Con: model building hard, search slow 19

  21. Eddy & Durbin 1994: What A probabilistic model for RNA families The “Covariance Model” ≈ A Stochastic Context-Free Grammar A generalization of a profile HMM Algorithms for Training From aligned or unaligned sequences Automates “comparative analysis” Complements Nusinov/Zucker RNA folding Algorithms for searching 20

  22. Main Results Very accurate search for tRNA (Precursor to tRNAscanSE – a very good tRNA-finder) Given sufficient data, model construction comparable to, but not quite as good as, � human experts Some quantitative info on importance of pseudoknots and other tertiary features 21

  23. Probabilistic Model Search As with HMMs, given a sequence: You calculate likelihood ratio that the model could generate the sequence, vs a background model You set a score threshold Anything above threshold → a “hit” Scoring: “Forward” / “Inside” algorithm - sum over all paths Viterbi approximation - find single best path � (Bonus: alignment & structure prediction) 22

  24. Example: searching for tRNAs � 23

  25. Profile Hmm Structure M j : Match states (20 emission probabilities) I j : Insert states (Background emission probabilities) D j : Delete states (silent - no emission) 24

  26. How to model an RNA “Motif”? Conceptually, start with a profile HMM: from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position given a new seq, estimate likelihood that it could be generated by the model, & align it to the model ins mostly G del all G 25

  27. How to model an RNA “Motif”? Add “column pairs” and pair emission probabilities for base-paired regions <<<<<<< >>>>>>> paired columns … … 26

  28. Profile Hmm Structure M j : Match states (20 emission probabilities) I j : Insert states (Background emission probabilities) D j : Delete states (silent - no emission) 27

  29. CM Structure A: Sequence + structure B: the CM “guide tree” C: probabilities of letters/ pairs & of indels Think of each branch being an HMM emitting both sides of a helix (but 3’ side emitted in reverse order) 28

  30. CM Viterbi Alignment � (the “inside” algorithm) = i th letter of input x i x ij = substring i ,..., j of input T yz = P (transition y → z ) y E x i , x j = P (emission of x i , x j from state y ) y S ij = max π log P ( x ij gen'd starting in state y via path π ) 29

  31. CM Viterbi Alignment � (the “inside” algorithm) y = max π log P ( x ij generated starting in state y via path π ) S ij % z y max z [ S i + 1, j − 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] z S ij & max z [ S i , j − 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y right ] y left + S k + 1, j ' max i < k ≤ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 30

  32. Primary vs Secondary Info disallowing / allowing pseudoknots ( ) / 2 n ∑ max j M i,j i = 1 31

  33. An Important Application: � Rfam A Database of RNA Families 32

Recommend


More recommend