Outline � RNA Search and � Whirlwind tour of ncRNA search & discovery � Motif Discovery � RNA motif description (Covariance Model Review) � Algorithms for searching � Rigorous & heuristic filtering � Motif discovery � Lecture 9 � Applications � CSEP 590A � Autumn 2008 � 2 RNA Motif Models � “Covariance Models” (Eddy & Durbin 1994) � aka profile stochastic context-free grammars � Motif Description � aka hidden Markov models on steroids � Model position-specific nucleotide preferences and base-pair preferences � Pro: accurate � Con: model building hard, search sloooow � 7 8
Profile Hmm Structure � Example: searching for tRNAs M j : � Match states (20 emission probabilities) � I j : � Insert states (Background emission probabilities) � D j : � Delete states (silent - no emission) � 13 17 CM Structure � CM Viterbi Alignment � A: Sequence + structure � = i th letter of input x i B: the CM “guide tree” � x ij = substring i ,..., j of input C: probabilities of T yz = P (transition y � z ) letters/ pairs & of indels � y E x i , x j = P (emission of x i , x j from state y ) Think of each branch being an HMM emitting y S ij = max � log P ( x ij gen'd starting in state y via path � ) both sides of a helix (but 3’ side emitted in reverse order) � 18 20
mRNA leader y = max � log P ( x ij generated starting in state y via path � ) S ij � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right mRNA leader switch? � z max z [ S i , j + log T yz ] delete � y left + S k + 1, j y right ] � max i < k � j [ S i , k bifurcation � Time O(qn 3 ), q states, seq len n 21 23 Mutual Information � f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing � MI = expected score gain from using a pair state � Finding optimal MI, (i.e. opt pairing of cols) is hard(?) � Finding optimal MI without pseudoknots can be done by dynamic programming � 24 26
Pseudoknots � � n disallowed allowed � � � /2 max j M i , j � � i = 1 29 31 Rfam – an RNA family DB � Rfam � Griffiths-Jones, et al., NAR ‘03,’05 � IRE (partial seed alignment): � Biggest scientific computing user in Europe - Input (hand-curated): � Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC 1000 cpu cluster for a month per release � MSA “seed alignment” � Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons � Rapidly growing: � Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T � Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Rel 1.0, 1/03: 25 families, 55k instances � Window Len W � Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: � Rel 7.0, 3/05: 503 families, >300k instances � Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM � Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG scan results & “full Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” � Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 33 34
Faster Genome Annotation � Faster Search � of Non-coding RNAs � Without Loss of Accuracy � Zasha Weinberg � & W.L. Ruzzo � Recomb ‘04, ISMB ‘04, Bioinfo ‘06 � 37 38 RaveNnA: Genome Scale � CM’s are good, but slow � RNA Search � Rfam Reality Our Work Rfam Goal Typically 100x speedup over raw CM, w/ no loss in accuracy: � EMBL EMBL EMBL � � drop structure from CM to create a (faster) HMM � � use that to pre-filter sequence; � BLAST Ravenna � discard parts where, provably, CM score < threshold; � � actually run CM on the rest (the promising parts) � � assignment of HMM transition/emission scores is key � CM CM CM �� (large convex optimization problem) � junk junk hits hits hits 1 month, ~2 months, 10 years, Weinberg & Ruzzo, Bioinformatics , 2004, 2006 1000 computers 39 1000 computers 1000 computers 41
CM to HMM � Key Issue: 25 scores � 10 � CM HMM CM HMM A A A A A A A A C C C C C C C C P L R G G G G G G G G U U U U U U U U – – – – – – – – A A A A C C C C G G G G Need: log Viterbi scores CM � HMM � U U U U – – – – 25 emisions per state 5 emissions per state, 2x states 43 44 Viterbi/Forward Scoring � Key Issue: 25 scores � 10 � CM HMM Path � defines transitions/emissions � A A A A C C C C Score( � ) = product of “probabilities” on � � P L R G G G G U U U U NB: ok if “probs” aren’t, e.g. !" 1 � – – – – NB: HMM not a prob. model (e.g. in CM, emissions are odds ratios vs � 0th-order background) � Need: log Viterbi scores CM � HMM � For any nucleotide sequence x: � P AA � L A + R A P CA � L C + R A … Viterbi-score(x) = max{ score( � ) | � emits x} � P AC � L A + R C P CC � L C + R C … P CG � L C + R G … Forward-score(x) = ! { score( � ) | � emits x} � P AG � L A + R G P AU � L A + R U P CU � L C + R U … P A– � L A + R – P C– � L C + R – … 45 46
P AA � L A + R A P AC � L A + R C Rigorous Filtering � Some scores filter better � P AG � L A + R G P AU � L A + R U P A– � L A + R – … P UA = 1 � L U + R A � Any scores satisfying the linear inequalities P UG = 4 � L U + R G � give rigorous filtering � � � � � � � Assuming ACGU � 25% � Proof: � Option 1: � � � � Opt 1: � CM Viterbi path score � L U = R A = R G = 2 � � L U + (R A + R G )/2 = 4 � � “corresponding” HMM path score � � Viterbi HMM path score � Option 2: � � � � Opt 2: � (even if it does not correspond to any CM path) � L U = 0, R A = 1, R G = 4 � L U + (R A + R G )/2 = 2.5 � 47 48 Optimizing filtering � Calculating E(L i , R i ) � For any nucleotide sequence x: � E(L i , R i ) = ! x Forward-score(x)*Pr(x) � Viterbi-score(x) = max{ score( � ) | � emits x } � Forward-score(x) = ! { score( � ) | � emits x } � Forward-like: for every state, calculate Expected Forward Score � expected score for all paths ending there; E(L i , R i ) = ! all sequences x Forward-score(x)*Pr(x) � NB: E is a function of L i , R i only � easily calculated from expected scores of Under 0th-order background model Optimization: � predecessors & transition/emission Minimize E(L i , R i ) subject to score Lin.Ineq.s � probabilities/scores � This is heuristic (“forward � � Viterbi � � filter � ”) � But still rigorous because “subject to score Lin.Ineq.s” � 49 50
Minimizing E(L i , R i ) � “Convex” Optimization � Calculate E(L i , R i ) Convex: � Nonconvex: � Forward: local max = global max; � can be many local maxima, symbolically , in terms of << global max; � simple “hill climbing” works � emission scores, so we “hill-climbing” fails � can do partial derivatives Viterbi: for numerical convex optimization algorithm � � E ( L 1 , L 2 ,...) � L i 51 52 Results: New ncRNA’s? � Estimated Filtering Efficiency � # found � # found # new � (139 Rfam 4.0 families) � rigorous filter Name � BLAST � + CM � + CM � Filtering # families # families fraction � (compact) � (expanded) � Pyrococcus snoRNA � 57 � 180 � 123 � Iron response element � 201 � 322 � 121 � < 10 -4 � 105 � 110 � ~100x Histone 3’ element � 1004 � 1106 � 102 � speedup 10 -4 - 10 -2 � 8 � 17 � Purine riboswitch � 69 � 123 � 54 � Retron msr � 11 � 59 � 48 � .01 - .10 � 11 � 3 � Hammerhead I � 167 � 193 � 26 � .10 - .25 � 2 � 2 � Hammerhead III � 251 � 264 � 13 � U4 snRNA � 283 � 290 � 7 � .25 - .99 � 6 � 4 � S-box � 128 � 131 � 3 � U6 snRNA � 1462 � 1464 � 2 � .99 - 1.0 � 7 � 3 � U5 snRNA � 199 � 200 � 1 � U7 snRNA � 312 � 313 � 1 � 53 54
RNA Motif Discovery � Typical problem: given a ~10-20 unaligned sequences of ~1kb, most of which contain Motif Discovery � instances of one RNA motif of, say, 150bp -- find it. � Example: 5’ UTRs of orthologous glycine cleavage genes from � -proteobacteria � 60 61 Searching for noncoding RNAs � Cmfinder--A Covariance � Model Based RNA Motif � CM’s are great, but where do they come from? � Finding Algorithm An approach: comparative genomics � Bioinformatics , 2006, 22(4): 445-452 Search for motifs with common secondary structure in a set of functionally related sequences. � Challenges � Zizhen Yao � Three related tasks � Zasha Weinberg � Locate the motif regions. � Walter L. Ruzzo � Align the motif instances. � University of Washington, Seattle � Predict the consensus secondary structure. � Motif search space is huge! � Motif location space, alignment space, structure space. � 62 63
Recommend
More recommend