Day 1 " RNA Search and ! Motif Discovery " Many - PDF document

Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA " Genome 541 ! RNA secondary structure prediction " Intro to Computational ! Molecular Biology " 3 Approaches to Structure Prediction " Maximum Pairing ! " + works on single sequences ! " + simple ! " - too inaccurate " Minimum Energy ! " + works on single sequences ! " - ignores pseudoknots ! " - only finds “optimal” fold " Partition Function ! " + finds all folds ! " - ignores pseudoknots " 4 Nussinov: ! Approaches, II " A Computation Order " Comparative sequence analysis ! Or energy " + handles all pairings (potentially incl. pseudoknots) ! B(i,j) = # pairs in optimal pairing of r i ... r j " " - requires several (many?) aligned, ! B(i,j) = 0 for all i, j with i � j-4; otherwise " " appropriately diverged sequences " Stochastic Context-free Grammars ! B(i,j) = max of: " Roughly combines min energy & comparative, but K=2 B(i,j-1) " 3 no pseudoknots " 4 5 Physical experiments (x-ray crystalography, NMR) " max { B(i,k-1)+1+B(k+1,j-1) | ! i ! k < j-4 and r k -r j may pair} " Time: O(n 3 )

Day 2 " Computational Problems " How to predict secondary structure " Day 1: ! Many biologically interesting roles for RNA " How to model an RNA “motif” ! (I.e., sequence/structure pattern) " RNA secondary structure prediction " Given a motif, how to search for instances " Today: " Given (unaligned) sequences, find motifs " Covariance Models (CMs) represent ! How to score discovered motifs " RNA sequence/structure motifs " How to leverage prior knowledge " Fast CM search " 8 9 RNA Motif Models " “Covariance Models” (Eddy & Durbin 1994) " Motif Description " aka profile stochastic context-free grammars " aka hidden Markov models on steroids " Model position-specific nucleotide preferences and base-pair preferences " Pro: accurate " Con: model building hard, search slow " 15 What " Main Results " A probabilistic model for RNA families " Very accurate search for tRNA " The “Covariance Model” " (Precursor to tRNAscanSE - current favorite) " " A Stochastic Context-Free Grammar " Given sufficient data, model construction A generalization of a profile HMM " comparable to, but not quite as good as, ! Algorithms for Training " human experts " From aligned or unaligned sequences " Some quantitative info on importance of Automates “comparative analysis” " pseudoknots and other tertiary features " Complements Nusinov/Zucker RNA folding " Algorithms for searching " 16 17

Example: searching for Probabilistic Model Search " tRNAs ! As with HMMs, given a sequence, you calculate likelihood ratio that the model could generate the sequence, vs a background model " You set a score threshold " Anything above threshold # a “hit” " Scoring: " “Forward” / “Inside” algorithm - sum over all paths " Viterbi approximation - find single best path ! (Bonus: alignment & structure prediction) " 18 19 How to model an RNA “Motif”? " How to model an RNA “Motif”? " Add “column pairs” and pair emission probabilities Conceptually, start with a profile HMM: " for base-paired regions " from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position " given a new seq, estimate likelihood that it could be generated by the model, & align it to the model " <<<<<<< >>>>>>> paired columns … … mostly G del ins all G 21 22 Profile Hmm Structure " CM Structure " A: Sequence + structure " B: the CM “guide tree” " C: probabilities of letters/ pairs & of indels " Think of each branch being an HMM emitting both sides of a helix (but M j : " Match states (20 emission probabilities) " 3’ side emitted in I j : " Insert states (Background emission probabilities) " reverse order) " D j : " Delete states (silent - no emission) " 23 29

Overall CM CM Viterbi Alignment ! Architecture " (the “inside” algorithm) " One box (“node”) per node = i th letter of input of guide tree " x i BEG/MATL/INS/DEL just x ij = substring i ,..., j of input like an HMM " T yz = P (transition y " z ) MATP & BIF are the key y E x i , x j = P (emission of x i , x j from state y ) additions: MATP emits pairs of symbols, modeling base- y S ij = max # log P ( x ij gen'd starting in state y via path # ) pairs; BIF allows multiple helices " 30 31 CM Viterbi Alignment ! (the “inside” algorithm) " y = max " log P ( x ij generated starting in state y via path " ) S ij % z y max z [ S i + 1, j # 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] S ij z & max z [ S i , j # 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y right ] y left + S k + 1, j ' max i < k $ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 32 34 Covariation is strong evidence for base pairing mRNA leader Mutual Information " f xi , xj " M ij = f xi , xj log 2 ; 0 # M ij # 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing " MI = expected score gain from using a pair state " mRNA leader switch? Finding optimal MI, (i.e. opt pairing of cols) is hard(?) " Finding optimal MI without pseudoknots can be done by dynamic programming " 35 36

M.I. Example (Artificial) " * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U A G G C U U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 only 2 possible mates in 6. M.I. = 1 bit. " C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 37 40 U 0 0 4 8 4 4 4 0 16 Primary vs Secondary Info " Comparison to TRNASCAN " Fichant & Burks - best heuristic then " evaluation criteria Slightly different 97.5% true positive " 0.37 false positives per MB " CM A1415 (trained on trusted alignment) " > 99.98% true positives " < 0.2 false positives per MB " Current method-of-choice is “tRNAscanSE”, a CM- based scan with heuristic pre-filtering (including disallowing / allowing TRNASCAN?) for performance reasons. " pseudoknots # n & % " ( /2 max j M i , j $ ' i = 1 42 45 tRNAScanSE " Uses 3 older heuristic tRNA finders as An Important Application: ! prefilter " Rfam " Uses CM built as described for final scoring " Actually 3(?) different CMs " " eukaryotic nuclear " " prokaryotic " " organellar " Used in all genome annotation projects " 46

Rfam – an RNA family DB ! Example Rfam Family " Griffiths-Jones, et al., NAR ’03, ’05, ’08 " IRE (partial seed alignment): ! Biggest scientific computing user in Europe - Input (hand-curated): " Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC 1000 cpu cluster for a month per release " MSA “seed alignment” " Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons " Rapidly growing: " Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T " Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC DB size: Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Rel 1.0, 1/03: 25 families, 55k instances " Window Len W " Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG ~8GB Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Rel 7.0, 3/05: 503 families, 363k instances " Output: " Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC Rel 9.0, 7/08: 603 families, 636k instances ! CM " Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Rel 9.1, 1/09: 1372 families, 1148k instances " Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” " Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rel 10.0, 1/10: 1446 families, " 3193k instances " ~160GB Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC phylogeny, etc. " SS_cons <<<<<...<<<<<......>>>>>.>>>>> 48 53

Day 1 " RNA Search and ! Motif Discovery " Many - PDF document

Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA " Genome 541 ! RNA secondary structure prediction " Intro to Computational ! Molecular Biology " 3 Approaches to Structure Prediction

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif

Motif Discovery Upper Bound An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA "

RNA Search and Motif Discovery Lecture 9 CSEP 590A Summer 2006 Outline Whirlwind tour of

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE Bioinformatics Senior Project Wasay Hussain Spring

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

RNA Search and Motif Discovery Lectures 18-19 CSE 527 Autumn 2007 The Human Parts List, circa

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

Regulatory Motif Prediction in DNA Regulatory Motif Prediction in DNA Introduction: toward

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University Four

2 nd semester Topic 22: Talking about the arts For many people there is nothing more highbrow

When Inequality Matters for Macro and Macro Matters for Inequality Discussion by Christopher

CAMPAIGN FINANCE LAWS AND IDEOLOGY IN REGARDS TO CITIZENS UNITED V. FEDERAL ELECTIONS COMMISSION

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-xxxx 1. Introduction The Asteraceae family

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-04938 2 Introduction Many active compounds from

Lepton Flavour Universality tests with heavy flavour decays at LHCb Including a new R K result

with ALICE at the LHC M. Gagliardi (INFN Torino) for the ALICE collaboration Workshop on

Day 1 " RNA Search and ! Motif Discovery " Many - PDF document

Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA " Genome 541 ! RNA secondary structure prediction " Intro to Computational ! Molecular Biology " 3 Approaches to Structure Prediction

RNA Search and Whirlwind tour of ncRNA search &amp; discovery Motif Discovery RNA motif

Motif Discovery Upper Bound An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Day 1 &quot; Last lecture: ! RNA Search and ! many biologically interesting roles for RNA &quot;

RNA Search and Motif Discovery Lecture 9 CSEP 590A Summer 2006 Outline Whirlwind tour of

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE Bioinformatics Senior Project Wasay Hussain Spring

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

RNA Search and Motif Discovery Lectures 18-19 CSE 527 Autumn 2007 The Human Parts List, circa

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

Regulatory Motif Prediction in DNA Regulatory Motif Prediction in DNA Introduction: toward

DNA AND RNA ATI TEAS SCIENCE DNA &amp; RNA Questions related to DNA and RNA cover topics

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University Four

2 nd semester Topic 22: Talking about the arts For many people there is nothing more highbrow

When Inequality Matters for Macro and Macro Matters for Inequality Discussion by Christopher

CAMPAIGN FINANCE LAWS AND IDEOLOGY IN REGARDS TO CITIZENS UNITED V. FEDERAL ELECTIONS COMMISSION

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-xxxx 1. Introduction The Asteraceae family

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-04938 2 Introduction Many active compounds from

Lepton Flavour Universality tests with heavy flavour decays at LHCb Including a new R K result

with ALICE at the LHC M. Gagliardi (INFN Torino) for the ALICE collaboration Workshop on

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA "

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics