Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA " Genome 541 ! RNA secondary structure prediction " Intro to Computational ! Molecular Biology " 3 Approaches to Structure Prediction " Maximum Pairing ! " + works on single sequences ! " + simple ! " - too inaccurate " Minimum Energy ! " + works on single sequences ! " - ignores pseudoknots ! " - only finds “optimal” fold " Partition Function ! " + finds all folds ! " - ignores pseudoknots " 4 Nussinov: ! Approaches, II " A Computation Order " Comparative sequence analysis ! Or energy " + handles all pairings (potentially incl. pseudoknots) ! B(i,j) = # pairs in optimal pairing of r i ... r j " " - requires several (many?) aligned, ! B(i,j) = 0 for all i, j with i � j-4; otherwise " " appropriately diverged sequences " Stochastic Context-free Grammars ! B(i,j) = max of: " Roughly combines min energy & comparative, but K=2 B(i,j-1) " 3 no pseudoknots " 4 5 Physical experiments (x-ray crystalography, NMR) " max { B(i,k-1)+1+B(k+1,j-1) | ! i ! k < j-4 and r k -r j may pair} " Time: O(n 3 )
Day 2 " Computational Problems " How to predict secondary structure " Day 1: ! Many biologically interesting roles for RNA " How to model an RNA “motif” ! (I.e., sequence/structure pattern) " RNA secondary structure prediction " Given a motif, how to search for instances " Today: " Given (unaligned) sequences, find motifs " Covariance Models (CMs) represent ! How to score discovered motifs " RNA sequence/structure motifs " How to leverage prior knowledge " Fast CM search " 8 9 RNA Motif Models " “Covariance Models” (Eddy & Durbin 1994) " Motif Description " aka profile stochastic context-free grammars " aka hidden Markov models on steroids " Model position-specific nucleotide preferences and base-pair preferences " Pro: accurate " Con: model building hard, search slow " 15 What " Main Results " A probabilistic model for RNA families " Very accurate search for tRNA " The “Covariance Model” " (Precursor to tRNAscanSE - current favorite) " " A Stochastic Context-Free Grammar " Given sufficient data, model construction A generalization of a profile HMM " comparable to, but not quite as good as, ! Algorithms for Training " human experts " From aligned or unaligned sequences " Some quantitative info on importance of Automates “comparative analysis” " pseudoknots and other tertiary features " Complements Nusinov/Zucker RNA folding " Algorithms for searching " 16 17
Example: searching for Probabilistic Model Search " tRNAs ! As with HMMs, given a sequence, you calculate likelihood ratio that the model could generate the sequence, vs a background model " You set a score threshold " Anything above threshold # a “hit” " Scoring: " “Forward” / “Inside” algorithm - sum over all paths " Viterbi approximation - find single best path ! (Bonus: alignment & structure prediction) " 18 19 How to model an RNA “Motif”? " How to model an RNA “Motif”? " Add “column pairs” and pair emission probabilities Conceptually, start with a profile HMM: " for base-paired regions " from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position " given a new seq, estimate likelihood that it could be generated by the model, & align it to the model " <<<<<<< >>>>>>> paired columns … … mostly G del ins all G 21 22 Profile Hmm Structure " CM Structure " A: Sequence + structure " B: the CM “guide tree” " C: probabilities of letters/ pairs & of indels " Think of each branch being an HMM emitting both sides of a helix (but M j : " Match states (20 emission probabilities) " 3’ side emitted in I j : " Insert states (Background emission probabilities) " reverse order) " D j : " Delete states (silent - no emission) " 23 29
Overall CM CM Viterbi Alignment ! Architecture " (the “inside” algorithm) " One box (“node”) per node = i th letter of input of guide tree " x i BEG/MATL/INS/DEL just x ij = substring i ,..., j of input like an HMM " T yz = P (transition y " z ) MATP & BIF are the key y E x i , x j = P (emission of x i , x j from state y ) additions: MATP emits pairs of symbols, modeling base- y S ij = max # log P ( x ij gen'd starting in state y via path # ) pairs; BIF allows multiple helices " 30 31 CM Viterbi Alignment ! (the “inside” algorithm) " y = max " log P ( x ij generated starting in state y via path " ) S ij % z y max z [ S i + 1, j # 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] S ij z & max z [ S i , j # 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y right ] y left + S k + 1, j ' max i < k $ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 32 34 Covariation is strong evidence for base pairing mRNA leader Mutual Information " f xi , xj " M ij = f xi , xj log 2 ; 0 # M ij # 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing " MI = expected score gain from using a pair state " mRNA leader switch? Finding optimal MI, (i.e. opt pairing of cols) is hard(?) " Finding optimal MI without pseudoknots can be done by dynamic programming " 35 36
M.I. Example (Artificial) " * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U A G G C U U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 only 2 possible mates in 6. M.I. = 1 bit. " C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 37 40 U 0 0 4 8 4 4 4 0 16 Primary vs Secondary Info " Comparison to TRNASCAN " Fichant & Burks - best heuristic then " evaluation criteria Slightly different 97.5% true positive " 0.37 false positives per MB " CM A1415 (trained on trusted alignment) " > 99.98% true positives " < 0.2 false positives per MB " Current method-of-choice is “tRNAscanSE”, a CM- based scan with heuristic pre-filtering (including disallowing / allowing TRNASCAN?) for performance reasons. " pseudoknots # n & % " ( /2 max j M i , j $ ' i = 1 42 45 tRNAScanSE " Uses 3 older heuristic tRNA finders as An Important Application: ! prefilter " Rfam " Uses CM built as described for final scoring " Actually 3(?) different CMs " " eukaryotic nuclear " " prokaryotic " " organellar " Used in all genome annotation projects " 46
Rfam – an RNA family DB ! Example Rfam Family " Griffiths-Jones, et al., NAR ’03, ’05, ’08 " IRE (partial seed alignment): ! Biggest scientific computing user in Europe - Input (hand-curated): " Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC 1000 cpu cluster for a month per release " MSA “seed alignment” " Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons " Rapidly growing: " Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T " Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC DB size: Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Rel 1.0, 1/03: 25 families, 55k instances " Window Len W " Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG ~8GB Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Rel 7.0, 3/05: 503 families, 363k instances " Output: " Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC Rel 9.0, 7/08: 603 families, 636k instances ! CM " Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Rel 9.1, 1/09: 1372 families, 1148k instances " Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” " Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rel 10.0, 1/10: 1446 families, " 3193k instances " ~160GB Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC phylogeny, etc. " SS_cons <<<<<...<<<<<......>>>>>.>>>>> 48 53
Recommend
More recommend