the double helix cse 421 intro to algorithms
play

The Double Helix CSE 421: Intro to Algorithms Summer 2007 W. L. - PowerPoint PPT Presentation

The Double Helix CSE 421: Intro to Algorithms Summer 2007 W. L. Ruzzo Dynamic Programming, II RNA Folding Los Alamos Science http://www.rcsb.org/pdb/explore.do?structureId=1GAT The Central Dogma of Non-coding RNA Molecular Biology


  1. The Double Helix CSE 421: Intro to Algorithms Summer 2007 W. L. Ruzzo Dynamic Programming, II RNA Folding Los Alamos Science http://www.rcsb.org/pdb/explore.do?structureId=1GAT The “Central Dogma” of Non-coding RNA Molecular Biology DNA → RNA → Protein • Messenger RNA - codes for proteins • Non-coding RNA - all the rest – Before, say, mid 1990’s, 1-2 dozen known Protein (critically important, but narrow roles: e.g. gene ribosomal and transfer RNA, splicing, SRP) DNA • Since mid 90’s dramatic discoveries (chromosome) • Regulation, transport, stability/degradation RNA • E.g. “microRNA”: hundreds in humans (messenger) cell • E.g. “riboswitches”: thousands in bacteria 1

  2. RNA DNA structure: dull Structure: Rich …ACCGCTAGATG… • RNA’s fold, and function …TGGCGATCTAC… • Nature uses what works RNA http://www.rcsb.org/pdb/explore.do?structureId=1EVV Secondary Structure: Not everything, but important, easier than 3d 2

  3. Q: What’s so hard? Why is structure important? G A A A A A A A A U G C G U U C U C G A C U C G C U A G C G G U G C A A G G G G A G A C U C G C C • For protein-coding, similarity in sequence is a G G C A G C A A G A G G G G A G A A G G A powerful tool for finding related sequences C A C C A C U U G U A C C – e.g. “hemoglobin” is easily recognized in all vertebrates C C G A A • For non-coding RNA, many different sequences A A A G G have the same structure, and structure is most C U G C C A A A A U A G A A A G U important for function. G A G A C A C U C U U U G U U G G U C C U C U G G C A G C G G U G C G – So, using structure plus sequence, can find related A C G C A U U G C G U A A A sequences at much greater evolutionary distances A C G U G C U G U U U G U A G G G C A: Structure often more important than sequence Chloroflexi Chloroflexus aurantiacus δ -Proteobacteria Geobacter metallireducens 6S mimics an Used by CMfinder Geobacter sulphurreducens Found by scan open promoter Symbiobacterium thermophilum E.coli Barrick et al. RNA 2005 Trotochaud et al. NSMB 2005 Willkomm et al. NAR 2005 3

  4. “Central Dogma” = 6.5 RNA Secondary Structure “Central Chicken & Egg”? DNA → RNA → Protein Nussinov’s Algorithm Protein gene DNA (chromosome) RNA (messenger) cell Was there once an “RNA World”? 4

  5. RNA Secondary Structure RNA Secondary Structure (somewhat oversimplified) RNA. String B = b 1 b 2 … b n over alphabet { A, C, G, U }. Secondary structure. A set of pairs S = { (b i , b j ) } that satisfy:  [Watson-Crick.] Secondary structure. RNA is usually single-stranded, and tends to loop – S is a matching , i.e. each base pairs with at most one other, and back and form base pairs with itself. This structure is essential for – each pair in S is a Watson-Crick pair: A-U, U-A, C-G, or G-C. understanding behavior of molecule.  [No sharp turns.] The ends of each pair are separated by at least 4 intervening bases. If (b i , b j ) ∈ S, then i < j - 4. A C  [Non-crossing.] If (b i , b j ) and (b k , b l ) are two pairs in S, then we Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA A A cannot have i < k < j < l. (Violation of this is called a pseudoknot. ) U A C G G A A U G C Free energy. Usual hypothesis is that an RNA molecule will form the G A U U A U G secondary structure with the optimum total free energy. U C G C A G approximate by number of base pairs C G A G C G C G Goal. Given an RNA molecule B = b 1 b 2 … b n , find a secondary structure S U A that maximizes the number of base pairs. G complementary base pairs: A-U, C-G RNA Secondary Structure: Examples RNA Secondary Structure: Subproblems Examples. First attempt. OPT[j] = maximum number of base pairs in a secondary structure of the substring b 1 b 2 … b j . G G G G G C G G C U C U G G A match b t and b j C G C G C U C U A U A U A G C G U A U A U A A U 1 t j U A base pair Difficulty. Results in two sub-problems.  Finding secondary structure in: b 1 b 2 … b t-1 . OPT(t-1) A C C G G U G U A A C G G G G U A A C C G G U U G A U U U  Finding secondary structure in: b t+1 b t+2 … b j-1 . ≤ 4 not OPT of anything; need more sub-problems ok sharp turn U A C C G G U G U A A C crossing 5

  6. Dynamic Programming Over Intervals: (R. Nussinov’s algorithm) Bottom Up Dynamic Programming Over Intervals Notation. OPT[i, j] = maximum number of base pairs in a secondary Q. What order to solve the sub-problems? structure of the substring b i b i+1 … b j . A. Do shortest intervals first.  Case 1. If i ≥ j - 4. Key point: – OPT[i, j] = 0 by no-sharp turns condition. k Either last base RNA(b 1 ,…,b n ) { 4 0 0 0 for k = 5, 6, …, n-1 is unpaired 3 0 0 for i = 1, 2, …, n-k  Case 2. Base b j is not involved in a pair. i 2 0 j = i + k (case 1,2) or – OPT[i, j] = OPT[i, j-1] Compute OPT[i, j] 1 paired (case 3) 6 7 8 9 return OPT[1, n] using recurrence  Case 3. Base b j pairs with b t for some i ≤ t < j - 4. j } – non-crossing constraint decouples resulting sub-problems – OPT[i, j] = 1 + max t { OPT[i, t-1] + OPT[t+1, j-1] } j 1 4 5 6 7 8 9 take max over t such that i ≤ t < j-4 and 1 0 0 0 1 Running time. O(n 3 ). b t and b j are Watson-Crick complements 2 2 0 0 0 0 i i 0 0 0 0 0 3 3 Remark. Same core idea in CKY algorithm to parse context-free grammars. 0 0 0 4 0 0 0 4 k Computing one cell: OPT[2,18] = ? G G G A A A A C C C A A A G G G G U U U n= 20 C U C C G G U U G C A A U G U C n = 16 ( ( ( . . . . ) ) ) ( ( ( . . . . ) ) ) ( ( . ( . . . . ) . ) . . ) . . 0 0 0 0 0 1 1 1 1 1 2 2 2 3 3 3 0 0 0 0 0 0 0 1 2 3 3 3 3 3 3 3 3 4 5 6 E.g.: 0 0 0 0 0 0 0 0 1 1 2 2 2 2 2 2 Case 1: 0 0 0 0 0 0 0 1 2 2 2 2 2 2 3 3 3 4 5 6 0 0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 OPT[1,6] = 1: 2 ≥ 18-4? no. 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 3 3 4 5 6 0 0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 Case 2: 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4 5 6 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 B 18 unpaired? CUCCGG 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4 5 6 Always a possibility; (....) 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 then OPT[2,18] ≥ 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 GGAAAACCCAAAGGGGU 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E.g.: ((....))(....)... 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 2 2 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 OPT[6,16] = 2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 GUUGCAAUGUC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 ((....)...) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 � 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 if i � j � 4 � 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 � OPT[ i , j -1] � OPT( i , j ) = � max � � � otherwise 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 � 1 + max t (OPT[ i , t � 1] + OPT[ t + 1, j � 1] � � 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6

Recommend


More recommend