Day 1 " Last lecture: ! RNA Search and ! many biologically - PowerPoint PPT Presentation

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA " Motif Discovery " Today: " Covariance Models (CMs) represent CSE 527 ! conserved RNA sequence/structure motifs " Computational Biology " 2 Computational Problems " How to predict secondary structure " How to model an RNA “motif” ! (I.e., sequence/structure pattern) " Given a motif, how to search for instances " Given (unaligned) sequences, find motifs " How to score discovered motifs " How to leverage prior knowledge " 3 4

RNA Motif Models " “Covariance Models” (Eddy & Durbin 1994) " aka profile stochastic context-free grammars " Motif Description " aka hidden Markov models on steroids " Model position-specific nucleotide preferences and base-pair preferences " Pro: accurate " Con: model building hard, search sloooow " 10 What " Main Results " A probabilistic model for RNA families " Very accurate search for tRNA " The “Covariance Model” " (Precursor to tRNAscanSE - current favorite) " ! A Stochastic Context-Free Grammar " Given sufficient data, model construction A generalization of a profile HMM " comparable to, but not quite as good as, ! Algorithms for Training " human experts " From aligned or unaligned sequences " Some quantitative info on importance of Automates “comparative analysis” " pseudoknots and other tertiary features " Complements Nusinov/Zucker RNA folding " Algorithms for searching " 11 12

Example: searching for Probabilistic Model Search " tRNAs ! As with HMMs, given a sequence, you calculate likelihood ratio that the model could generate the sequence, vs a background model " You set a score threshold " Anything above threshold " a “hit” " Scoring: " “Forward” / “Inside” algorithm - sum over all paths " Viterbi approximation - find single best path ! (Bonus: alignment & structure prediction) " 13 14 How to model an RNA “Motif”? " How to model an RNA “Motif”? " Add “column pairs” and pair emission probabilities Conceptually, start with a profile HMM: " for base-paired regions " from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position " given a new seq, estimate likelihood that it could be generated by the model, & align it to the model " <<<<<<< >>>>>>> paired columns … … mostly G ins all G del 16 17

Profile Hmm Structure " CM Structure " A: Sequence + structure " B: the CM “guide tree” " C: probabilities of letters/ pairs & of indels " Think of each branch being an HMM emitting both sides of a helix (but M j : " Match states (20 emission probabilities) " 3’ side emitted in I j : " Insert states (Background emission probabilities) " reverse order) " D j : " Delete states (silent - no emission) " 18 24 Overall CM CM Viterbi Alignment ! Architecture " (the “inside” algorithm) " One box (“node”) per node = i th letter of input of guide tree " x i BEG/MATL/INS/DEL just x ij = substring i ,..., j of input like an HMM " T yz = P (transition y " z ) MATP & BIF are the key y E x i , x j = P (emission of x i , x j from state y ) additions: MATP emits pairs of symbols, modeling base- y S ij = max # log P ( x ij gen'd starting in state y via path # ) pairs; BIF allows multiple helices " 25 26

CM Viterbi Alignment ! (the “inside” algorithm) " y = max " log P ( x ij generated starting in state y via path " ) S ij % z y max z [ S i + 1, j # 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] z S ij & max z [ S i , j # 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y left + S k + 1, j y right ] ' max i < k $ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 27 29 Covariation is strong evidence for base pairing mRNA leader Mutual Information " f xi , xj " M ij = f xi , xj log 2 ; 0 # M ij # 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing " MI = expected score gain from using a pair state " mRNA leader switch? Finding optimal MI, (i.e. opt pairing of cols) is hard(?) " Finding optimal MI without pseudoknots can be done by dynamic programming " 30 31

M.I. Example (Artificial) " * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U A G G C U U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 only 2 possible mates in 6. M.I. = 1 bit. " C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 32 35 U 0 0 4 8 4 4 4 0 16 MI-Based Structure-Learning ! Primary vs Secondary Info " Find best (max total MI) subset of column pairs among i…j, subject to absence of pseudo-knots ! $ S i , j = max S i , j " 1 j unpaired % max i # k < j " 4 S i , k " 1 + M k , j + S k + 1, j " 1 j paired & “Just like Nussinov/Zucker folding” " disallowing allowing BUT, need enough data---enough sequences at right pseudoknots phylogenetic distance " # n & % " ( /2 max j M i , j $ ' i = 1 36 37

Comparison to TRNASCAN " tRNAScanSE " Fichant & Burks - best heuristic then " Uses 3 older heuristic tRNA finders as evaluation criteria Slightly different 97.5% true positive " prefilter " 0.37 false positives per MB " Uses CM built as described for final scoring " CM A1415 (trained on trusted alignment) " Actually 3(?) different CMs " > 99.98% true positives " " eukaryotic nuclear " < 0.2 false positives per MB " Current method-of-choice is “tRNAscanSE”, a CM- " prokaryotic " based scan with heuristic pre-filtering (including " organellar " TRNASCAN?) for performance reasons. " Used in all genome annotation projects " 40 41 Rfam – an RNA family DB ! Griffiths-Jones, et al., NAR ’03, ’05, ’08 " Biggest scientific computing user in Europe - An Important Application: ! 1000 cpu cluster for a month per release " Rfam " Rapidly growing: " Rel 1.0, 1/03: 25 families, 55k instances " Rel 7.0, 3/05: 503 families, >300k instances " Rel 9.0, 7/08: 603 families, 896k instances ! Rel 9.1, 1/09: 1372 families, ??? instances " 43

Rfam database ! Example Rfam Family " http://www.sanger.ac.uk/Software/Rfam/ ! (Release 7.0, 3/2005) " IRE (partial seed alignment): ! Input (hand-curated): " 503 ncRNA families � Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC MSA “seed alignment” " Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC 280,000 annotated ncRNAs � Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons " Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T " Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU 8 riboswitches, 235 small nucleolar RNAs, Window Len W " Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG 8 spliceosomal RNAs, 10 bacterial Output: " Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU antisense RNAs, 46 microRNAs, 9 Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM " Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG ribozymes, 122 cis RNA regulatory Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC elements, … � alignment” " Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 44 46 Day 2 ! Rfam – key issues "" " 5 slide synopsis of last lecture Overly narrow families " Covariance Models (CMs) represent conserved RNA sequence/structure motifs " Variant structures/unstructured RNAs " Spliced RNAs " They allow accurate search " RNA pseudogenes " But " Human ALU is SRP related w/ 1.1m copies " " a) search is slow " Mouse B2 repeat (350k copies) tRNA related " " b) model construction is laborious " Speed & sensitivity " Motif discovery " 48 49

Day 1 " Last lecture: ! RNA Search and ! many biologically - PowerPoint PPT Presentation

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA " Motif Discovery " Today: " Covariance Models (CMs) represent CSE 527 ! conserved RNA sequence/structure motifs " Computational Biology

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

Day 1 Day 1 Staging area Buses & Ambulances In Use Day 1 Day 2 Days 2 & 3 Day 4

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Module 4 AFA CyberCamp Format Day T wo Day Three Day Four Day Five Day One Windows

Workflow 6 Touchpoints After First Visit Day 0 - Sunday Day 2 - Tuesday Day 6 -

Summer School Overview Day 0: R bootcamp Day 1: Workflow, Google App Engine Day 2:

2014 Investor Day DECEMBER 10, 2014 5 | 2014 INVESTOR DAY | 2014 INVESTOR DAY Welcome MARK

BJC BJC BJC BJC Opportunity Day Opportunity Day 4Q09 Opportunity Day Opportunity Day

CACTM Patricia Arizmendi Garcia LauraCalleja Diez Fernando Carmona Mateos Andrea Magn

Europe 2014 A Million Heartbeats by rt Ahlin 2 Who am I? 3 What do I do? 4 Why QS? 5

2020 Effective Mentoring Program Combined Program (School and Early Childhood) Day 2 1 2020 SB

King, Jr. Day Remember! Celebrate! Act! A Day On, Not A Day Off! January 18, 2016 Dr. Martin

KINDERGARTEN EXPERIENCE How will the full-day kindergarten differ from the half-day

In the United State we have celebrated Independence Day and Presidents Day since the 1870s.

Prex: Finding Guidance for Forward and Backward Porting of Linux Device Drivers Julia Lawall,

Part II: Applications in Database Systems Some slides courtesy of Amol Deshpande Outline

Restructuring Scientific Software using Semantic Patching with Coccinelle Michele MARTONE

Linux Qualification - Coding Style / Type issues in IEC 61508 Nicholas Mc Guire <

Chapter 24 Life in the Universe 24.1 Life on Earth Our goals for learning When did life

The earliest evidence of life earliest evidence of life?

THE THEORY ORY OF EVOLUTION UTION Evoluti tion on Process by which living organisms descend

Music, Hearing Loss, and Cochlear Implants The Next Frontier Charles J. Limb, M.D. Francis A.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Day 1 " Last lecture: ! RNA Search and ! many biologically - PowerPoint PPT Presentation

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA " Motif Discovery " Today: " Covariance Models (CMs) represent CSE 527 ! conserved RNA sequence/structure motifs " Computational Biology

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

Day 1 Day 1 Staging area Buses &amp; Ambulances In Use Day 1 Day 2 Days 2 &amp; 3 Day 4

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Module 4 AFA CyberCamp Format Day T wo Day Three Day Four Day Five Day One Windows

Workflow 6 Touchpoints After First Visit Day 0 - Sunday Day 2 - Tuesday Day 6 -

Summer School Overview Day 0: R bootcamp Day 1: Workflow, Google App Engine Day 2:

2014 Investor Day DECEMBER 10, 2014 5 | 2014 INVESTOR DAY | 2014 INVESTOR DAY Welcome MARK

BJC BJC BJC BJC Opportunity Day Opportunity Day 4Q09 Opportunity Day Opportunity Day

CACTM Patricia Arizmendi Garcia LauraCalleja Diez Fernando Carmona Mateos Andrea Magn

Europe 2014 A Million Heartbeats by rt Ahlin 2 Who am I? 3 What do I do? 4 Why QS? 5

2020 Effective Mentoring Program Combined Program (School and Early Childhood) Day 2 1 2020 SB

King, Jr. Day Remember! Celebrate! Act! A Day On, Not A Day Off! January 18, 2016 Dr. Martin

KINDERGARTEN EXPERIENCE How will the full-day kindergarten differ from the half-day

In the United State we have celebrated Independence Day and Presidents Day since the 1870s.

Prex: Finding Guidance for Forward and Backward Porting of Linux Device Drivers Julia Lawall,

Part II: Applications in Database Systems Some slides courtesy of Amol Deshpande Outline

Restructuring Scientific Software using Semantic Patching with Coccinelle Michele MARTONE

Linux Qualification - Coding Style / Type issues in IEC 61508 Nicholas Mc Guire &lt;

Chapter 24 Life in the Universe 24.1 Life on Earth Our goals for learning When did life

The earliest evidence of life earliest evidence of life?

THE THEORY ORY OF EVOLUTION UTION Evoluti tion on Process by which living organisms descend

Music, Hearing Loss, and Cochlear Implants The Next Frontier Charles J. Limb, M.D. Francis A.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Day 1 Day 1 Staging area Buses & Ambulances In Use Day 1 Day 2 Days 2 & 3 Day 4

Linux Qualification - Coding Style / Type issues in IEC 61508 Nicholas Mc Guire <