Day 3 ! " 5 slide synopsis of last lecture A: Sequence + structure " Covariance Models (CMs) represent conserved RNA sequence/structure motifs " B: the CM “guide tree” " They allow accurate search " C: probabilities of letters/ pairs & of indels " But " Think of each branch " a) search is slow " being an HMM emitting " b) model construction is laborious " both sides of a helix (but 3’ side emitted in reverse order) " 56 57 Example: CM Viterbi Alignment ! searching for (the “inside” algorithm) " tRNAs y = max " log P ( x ij generated starting in state y via path " ) S ij % z y max z [ S i + 1, j # 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] S ij z & max z [ S i , j # 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y right ] y left + S k + 1, j ' max i < k $ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 58 59 Example Rfam Family " Today’s Goals " IRE (partial seed alignment): ! Input (hand-curated): " Faster Search " Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC MSA “seed alignment” " Infernal & RaveNnA " Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons " Automated Model-building " Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T " Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU CMfinder " Window Len W " Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: " Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM " Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” " Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 60 61
Impact of RNA homology search " (Barrick, et al., 2004) � Homology search " glycine operon � riboswitch � Sequence-based " B. subtilis � Smith-Waterman " L. innocua � FASTA " BLAST " A. tumefaciens � V. cholera � Sharp decline in sensitivity at ~60-70% identity " M. tuberculosis � So, use structure, too " (and 19 more species) � 64 65 Impact of RNA homology search " (Barrick, et al., 2004) � (Mandal, et al., 2004) � glycine operon � riboswitch � Faster Genome Annotation ! B. subtilis � of Non-coding RNAs ! L. innocua � Without Loss of Accuracy " A. tumefaciens � Zasha Weinberg " V. cholera � & W.L. Ruzzo " M. tuberculosis � Recomb ‘04, ISMB ‘04, Bioinfo ‘06 " (and 42 more species) � (and 19 more species) � BLAST-based CM-based 66 RaveNnA: Genome Scale ! CM’s are good, but slow ! RNA Search " Rfam Reality Our Work Rfam Goal Typically 100x speedup over raw CM, w/ no loss in accuracy: " EMBL EMBL EMBL " " Drop structure from CM to create a (faster) HMM " " Use that to pre-filter sequence; " BLAST Ravenna " Discard parts where, provably, CM score < threshold; " " Actually run CM on the rest (the promising parts) " " Assignment of HMM transition/emission scores is key " CM CM CM " " (a large convex optimization problem) " junk junk hits hits hits 1 month, ~2 months, 10 years, Weinberg & Ruzzo, Bioinformatics , 2004, 2006 1000 computers 69 1000 computers 1000 computers 71
Covariance ! Oversimplified CM ! (for pedagogical purposes only) " Model " Key difference of CM vs HMM: A A C C Pair states emit paired symbols, G G corresponding to base-paired U U – – nucleotides; 16 emission probabilities here. A A C C G G U U – – 72 73 CM to HMM " Key Issue: 25 scores " 10 " CM HMM CM HMM A A A A A A A A C C C C C C C C P L R G G G G G G G G U U U U U U U U – – – – – – – – A A A A C C C C G G G G Need: log Viterbi scores CM ! HMM " U U U U – – – – 25 emisions per state 5 emissions per state, 2x states 74 75 P AA ! L A + R A P AC ! L A + R C Key Issue: 25 scores " 10 " Rigorous Filtering " P AG ! L A + R G P AU ! L A + R U CM HMM P A– ! L A + R – … Any scores satisfying the linear inequalities A A A A C C C C P L R give rigorous filtering ! G G G G U U U U – – – – NB: HMM not a prob. model Proof: ! Need: log Viterbi scores CM ! HMM " CM Viterbi path score ! ! “corresponding” HMM path score ! P AA ! L A + R A P CA ! L C + R A … P AC ! L A + R C P CC ! L C + R C … ! Viterbi HMM path score ! P AG ! L A + R G P CG ! L C + R G … (even if it does not correspond to any CM path) " P AU ! L A + R U P CU ! L C + R U … P C– ! L C + R – … P A– ! L A + R – 77 78
Some scores filter better " Assignment of scores/ “probabilities” " P UA = 1 ! L U + R A " Convex optimization problem " P UG = 4 ! L U + R G " Constraints: enforce rigorous property " Objective function: filter as aggressively as " " " " " " Assuming ACGU # 25% " possible " Option 1: " " " " Opt 1: " Problem sizes: " L U = R A = R G = 2 " " L U + (R A + R G )/2 = 4 " 1000-10000 variables " Option 2: " " " " Opt 2: " 10000-100000 inequality constraints " L U = 0, R A = 1, R G = 4 " L U + (R A + R G )/2 = 2.5 " 79 83 Estimated Filtering Efficiency ! “Convex” Optimization " (139 Rfam 4.0 families) " Filtering # families # families Convex: ! Nonconvex: ! fraction " (compact) " (expanded) " local max = global max; " can be many local maxima, ≪ global max; ! simple “hill climbing” works " < 10 -4 " 105 " 110 " # break even ~100x “hill-climbing” fails " 10 -4 - 10 -2 " 8 " 17 " speedup .01 - .10 " 11 " 3 " .10 - .25 " 2 " 2 " .25 - .99 " 6 " 4 " .99 - 1.0 " 7 " 3 " 84 85 Averages 283 times faster than CM � Results: new ncRNAs (?) " Heuristic Filters " Name " # Known " # New " Rigorous filters optimized for worst case " (BLAST + CM) " (rigorous filter + CM) " Pyrococcus snoRNA " 57 " 123 " Possible to trade improved speed for small Iron response element " 201 " 121 " loss in sensitivity? " Histone 3’ element " 1004 " 102* " Yes – profile HMMs as before, but optimized Retron msr " 11 " 48 " Hammerhead I " 167 " 26 " for average case " Hammerhead III " 251 " 13 " Often 10x faster, modest loss in sensitivity " U6 snRNA " 1462 " 2 " U7 snRNA " 312 " 1 " cobalamin riboswitch " 170 " 7 " 13 other families " 5-1107 " 0 " 87 102
Software " CM Search Summary " Ravenna implements both rigorous and Still slower than we might like, but dramatic heuristic filters " speedup over raw CM is possible with: " Infernal (engine behind Rfam) implements No loss in sensitivity (provably), or " heuristic filters and some other accelerations " Even faster with modest (and estimable) loss in sensitivity " E,g., dynamic “banding” of dynamic programming matrix based on the insight that large deviations from consensus length must have low scores. " 107 108 RNA Motif Discovery " RNA Motif Discovery " CM’s are great, but where do they come from? " Typical problem: given a 10-20 unaligned An approach: comparative genomics " sequences of 1-10kb, most of which contain Search for motifs with common secondary structure in a instances of one RNA motif of 100-200bp set of functionally related sequences. " -- find it. " Challenges " Example: 5’ UTRs of orthologous glycine Three related tasks " Locate the motif regions. " cleavage genes from $ -proteobacteria " Align the motif instances. " Example: corresponding introns of Predict the consensus secondary structure. " Motif search space is huge! " orthogolous vertebrate genes " Motif location space, alignment space, structure space. " 115 117 Pitfall for sequence-alignment- Approaches " first approach " Align-First: Align sequences, then look for Structural conservation ≠ Sequence conservation " Alignment without structure information is unreliable " common structure " CLUSTALW alignment of SECIS elements with flanking regions Fold-First: Predict structures, then try to align them " Joint: Do both together " same-colored boxes should be aligned 118 120
Recommend
More recommend