Day 3 ! " 5 slide synopsis of last lecture A: Sequence + - PDF document

Day 3 ! " 5 slide synopsis of last lecture A: Sequence + structure " Covariance Models (CMs) represent conserved RNA sequence/structure motifs " B: the CM “guide tree” " They allow accurate search " C: probabilities of letters/ pairs & of indels " But " Think of each branch " a) search is slow " being an HMM emitting " b) model construction is laborious " both sides of a helix (but 3’ side emitted in reverse order) " 56 57 Example: CM Viterbi Alignment ! searching for (the “inside” algorithm) " tRNAs y = max " log P ( x ij generated starting in state y via path " ) S ij % z y max z [ S i + 1, j # 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] S ij z & max z [ S i , j # 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y right ] y left + S k + 1, j ' max i < k $ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 58 59 Example Rfam Family " Today’s Goals " IRE (partial seed alignment): ! Input (hand-curated): " Faster Search " Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC MSA “seed alignment” " Infernal & RaveNnA " Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons " Automated Model-building " Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T " Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU CMfinder " Window Len W " Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: " Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM " Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” " Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 60 61

Impact of RNA homology search " (Barrick, et al., 2004) � Homology search " glycine operon � riboswitch � Sequence-based " B. subtilis � Smith-Waterman " L. innocua � FASTA " BLAST " A. tumefaciens � V. cholera � Sharp decline in sensitivity at ~60-70% identity " M. tuberculosis � So, use structure, too " (and 19 more species) � 64 65 Impact of RNA homology search " (Barrick, et al., 2004) � (Mandal, et al., 2004) � glycine operon � riboswitch � Faster Genome Annotation ! B. subtilis � of Non-coding RNAs ! L. innocua � Without Loss of Accuracy " A. tumefaciens � Zasha Weinberg " V. cholera � & W.L. Ruzzo " M. tuberculosis � Recomb ‘04, ISMB ‘04, Bioinfo ‘06 " (and 42 more species) � (and 19 more species) � BLAST-based CM-based 66 RaveNnA: Genome Scale ! CM’s are good, but slow ! RNA Search " Rfam Reality Our Work Rfam Goal Typically 100x speedup over raw CM, w/ no loss in accuracy: " EMBL EMBL EMBL " " Drop structure from CM to create a (faster) HMM " " Use that to pre-filter sequence; " BLAST Ravenna " Discard parts where, provably, CM score < threshold; " " Actually run CM on the rest (the promising parts) " " Assignment of HMM transition/emission scores is key " CM CM CM " " (a large convex optimization problem) " junk junk hits hits hits 1 month, ~2 months, 10 years, Weinberg & Ruzzo, Bioinformatics , 2004, 2006 1000 computers 69 1000 computers 1000 computers 71

Covariance ! Oversimplified CM ! (for pedagogical purposes only) " Model " Key difference of CM vs HMM: A A C C Pair states emit paired symbols, G G corresponding to base-paired U U – – nucleotides; 16 emission probabilities here. A A C C G G U U – – 72 73 CM to HMM " Key Issue: 25 scores " 10 " CM HMM CM HMM A A A A A A A A C C C C C C C C P L R G G G G G G G G U U U U U U U U – – – – – – – – A A A A C C C C G G G G Need: log Viterbi scores CM ! HMM " U U U U – – – – 25 emisions per state 5 emissions per state, 2x states 74 75 P AA ! L A + R A P AC ! L A + R C Key Issue: 25 scores " 10 " Rigorous Filtering " P AG ! L A + R G P AU ! L A + R U CM HMM P A– ! L A + R – … Any scores satisfying the linear inequalities A A A A C C C C P L R give rigorous filtering ! G G G G U U U U – – – – NB: HMM not a prob. model Proof: ! Need: log Viterbi scores CM ! HMM " CM Viterbi path score ! ! “corresponding” HMM path score ! P AA ! L A + R A P CA ! L C + R A … P AC ! L A + R C P CC ! L C + R C … ! Viterbi HMM path score ! P AG ! L A + R G P CG ! L C + R G … (even if it does not correspond to any CM path) " P AU ! L A + R U P CU ! L C + R U … P C– ! L C + R – … P A– ! L A + R – 77 78

Some scores filter better " Assignment of scores/ “probabilities” " P UA = 1 ! L U + R A " Convex optimization problem " P UG = 4 ! L U + R G " Constraints: enforce rigorous property " Objective function: filter as aggressively as " " " " " " Assuming ACGU # 25% " possible " Option 1: " " " " Opt 1: " Problem sizes: " L U = R A = R G = 2 " " L U + (R A + R G )/2 = 4 " 1000-10000 variables " Option 2: " " " " Opt 2: " 10000-100000 inequality constraints " L U = 0, R A = 1, R G = 4 " L U + (R A + R G )/2 = 2.5 " 79 83 Estimated Filtering Efficiency ! “Convex” Optimization " (139 Rfam 4.0 families) " Filtering # families # families Convex: ! Nonconvex: ! fraction " (compact) " (expanded) " local max = global max; " can be many local maxima, ≪ global max; ! simple “hill climbing” works " < 10 -4 " 105 " 110 " # break even ~100x “hill-climbing” fails " 10 -4 - 10 -2 " 8 " 17 " speedup .01 - .10 " 11 " 3 " .10 - .25 " 2 " 2 " .25 - .99 " 6 " 4 " .99 - 1.0 " 7 " 3 " 84 85 Averages 283 times faster than CM � Results: new ncRNAs (?) " Heuristic Filters " Name " # Known " # New " Rigorous filters optimized for worst case " (BLAST + CM) " (rigorous filter + CM) " Pyrococcus snoRNA " 57 " 123 " Possible to trade improved speed for small Iron response element " 201 " 121 " loss in sensitivity? " Histone 3’ element " 1004 " 102* " Yes – profile HMMs as before, but optimized Retron msr " 11 " 48 " Hammerhead I " 167 " 26 " for average case " Hammerhead III " 251 " 13 " Often 10x faster, modest loss in sensitivity " U6 snRNA " 1462 " 2 " U7 snRNA " 312 " 1 " cobalamin riboswitch " 170 " 7 " 13 other families " 5-1107 " 0 " 87 102

Software " CM Search Summary " Ravenna implements both rigorous and Still slower than we might like, but dramatic heuristic filters " speedup over raw CM is possible with: " Infernal (engine behind Rfam) implements No loss in sensitivity (provably), or " heuristic filters and some other accelerations " Even faster with modest (and estimable) loss in sensitivity " E,g., dynamic “banding” of dynamic programming matrix based on the insight that large deviations from consensus length must have low scores. " 107 108 RNA Motif Discovery " RNA Motif Discovery " CM’s are great, but where do they come from? " Typical problem: given a 10-20 unaligned An approach: comparative genomics " sequences of 1-10kb, most of which contain Search for motifs with common secondary structure in a instances of one RNA motif of 100-200bp set of functionally related sequences. " -- find it. " Challenges " Example: 5’ UTRs of orthologous glycine Three related tasks " Locate the motif regions. " cleavage genes from $ -proteobacteria " Align the motif instances. " Example: corresponding introns of Predict the consensus secondary structure. " Motif search space is huge! " orthogolous vertebrate genes " Motif location space, alignment space, structure space. " 115 117 Pitfall for sequence-alignment- Approaches " first approach " Align-First: Align sequences, then look for Structural conservation ≠ Sequence conservation " Alignment without structure information is unreliable " common structure " CLUSTALW alignment of SECIS elements with flanking regions Fold-First: Predict structures, then try to align them " Joint: Do both together " same-colored boxes should be aligned 118 120

Day 3 ! " 5 slide synopsis of last lecture A: Sequence + - PDF document

Day 3 ! " 5 slide synopsis of last lecture A: Sequence + structure " Covariance Models (CMs) represent conserved RNA sequence/structure motifs " B: the CM guide tree " They allow accurate search " C: probabilities

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

Day 1 Day 1 Staging area Buses & Ambulances In Use Day 1 Day 2 Days 2 & 3 Day 4

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Module 4 AFA CyberCamp Format Day T wo Day Three Day Four Day Five Day One Windows

Workflow 6 Touchpoints After First Visit Day 0 - Sunday Day 2 - Tuesday Day 6 -

Summer School Overview Day 0: R bootcamp Day 1: Workflow, Google App Engine Day 2:

2014 Investor Day DECEMBER 10, 2014 5 | 2014 INVESTOR DAY | 2014 INVESTOR DAY Welcome MARK

BJC BJC BJC BJC Opportunity Day Opportunity Day 4Q09 Opportunity Day Opportunity Day

CACTM Patricia Arizmendi Garcia LauraCalleja Diez Fernando Carmona Mateos Andrea Magn

Europe 2014 A Million Heartbeats by rt Ahlin 2 Who am I? 3 What do I do? 4 Why QS? 5

2020 Effective Mentoring Program Combined Program (School and Early Childhood) Day 2 1 2020 SB

King, Jr. Day Remember! Celebrate! Act! A Day On, Not A Day Off! January 18, 2016 Dr. Martin

KINDERGARTEN EXPERIENCE How will the full-day kindergarten differ from the half-day

In the United State we have celebrated Independence Day and Presidents Day since the 1870s.

Markov Decision Processes: Biosens II E. Jrgensen & Lars R. Nielsen Department of Genetics

Introduction to probability distributions and testing Lecture 2 Summary of this week This

Welcome Our plans to make Bicknacre future- ready with the fastest, most dependable

From the Spectral Stokes solvers to the Stokes eigenmodes in square/cube, until

Agenda Dairy Cattle Milk Recording WG Aarhus, May 28, 2013 Agenda Opening, Kai, Pavel, Franz,

Boolean Analysis Unveils Universal Rules in the Microbiome Data Daniella Vo, Shayal Singh, Sara

Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels Alan M. Dunn

Introduction to the Dictionary of Medieval Names from European Sources Dr. Sara L. Uckelman

Day 3 ! " 5 slide synopsis of last lecture A: Sequence + - PDF document

Day 3 ! " 5 slide synopsis of last lecture A: Sequence + structure " Covariance Models (CMs) represent conserved RNA sequence/structure motifs " B: the CM guide tree " They allow accurate search " C: probabilities

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

Day 1 Day 1 Staging area Buses &amp; Ambulances In Use Day 1 Day 2 Days 2 &amp; 3 Day 4

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Module 4 AFA CyberCamp Format Day T wo Day Three Day Four Day Five Day One Windows

Workflow 6 Touchpoints After First Visit Day 0 - Sunday Day 2 - Tuesday Day 6 -

Summer School Overview Day 0: R bootcamp Day 1: Workflow, Google App Engine Day 2:

2014 Investor Day DECEMBER 10, 2014 5 | 2014 INVESTOR DAY | 2014 INVESTOR DAY Welcome MARK

BJC BJC BJC BJC Opportunity Day Opportunity Day 4Q09 Opportunity Day Opportunity Day

CACTM Patricia Arizmendi Garcia LauraCalleja Diez Fernando Carmona Mateos Andrea Magn

Europe 2014 A Million Heartbeats by rt Ahlin 2 Who am I? 3 What do I do? 4 Why QS? 5

2020 Effective Mentoring Program Combined Program (School and Early Childhood) Day 2 1 2020 SB

King, Jr. Day Remember! Celebrate! Act! A Day On, Not A Day Off! January 18, 2016 Dr. Martin

KINDERGARTEN EXPERIENCE How will the full-day kindergarten differ from the half-day

In the United State we have celebrated Independence Day and Presidents Day since the 1870s.

Markov Decision Processes: Biosens II E. Jrgensen &amp; Lars R. Nielsen Department of Genetics

Introduction to probability distributions and testing Lecture 2 Summary of this week This

Welcome Our plans to make Bicknacre future- ready with the fastest, most dependable

From the Spectral Stokes solvers to the Stokes eigenmodes in square/cube, until

Agenda Dairy Cattle Milk Recording WG Aarhus, May 28, 2013 Agenda Opening, Kai, Pavel, Franz,

Boolean Analysis Unveils Universal Rules in the Microbiome Data Daniella Vo, Shayal Singh, Sara

Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels Alan M. Dunn

Introduction to the Dictionary of Medieval Names from European Sources Dr. Sara L. Uckelman

Day 1 Day 1 Staging area Buses & Ambulances In Use Day 1 Day 2 Days 2 & 3 Day 4

Markov Decision Processes: Biosens II E. Jrgensen & Lars R. Nielsen Department of Genetics