The Double Helix RNA Secondary Structure CSE 417 W.L. Ruzzo Los Alamos Science The “Central Dogma” of Non-coding RNA Molecular Biology • Messenger RNA - codes for proteins DNA RNA Protein • Non-coding RNA - all the rest – Before, say, mid 1990’s, 1-2 dozen known Protein (critically important, but narrow roles: e.g. gene ribosomal and transfer RNA, splicing, SRP) DNA • Since mid 90’s dramatic discoveries (chromosome) • Regulation, transport, stability/degradation RNA • E.g. “microRNA”: hundreds in humans (messenger) cell • E.g. “riboswitches”: thousands in bacteria 1
RNA DNA structure: dull Structure: Rich …ACCGCTAGATG… • RNA’s fold, and function …TGGCGATCTAC… • Nature uses what works Q: What’s so hard? Why is structure Important? A G A A A A A A G A U C G U U C U C G A C U C C G G U A C G G U G C A A G G G G A C G A U C G C G C • For protein-coding, similarity in sequence is a G C A G C A A G A G G A G G A G A G G A powerful tool for finding related sequences C C A C C A U U G U A C C – e.g. “hemoglobin” is easily recognized in all C C G vertebrates A A A A • For non-coding RNA, many different G A C G U G C C A A A A U A G A A sequences have the same structure, and A G U G A G A C A C U C U U U U G U C G G U C U C U G G C G structure is most important for function. A G C G U C G G A C G C A U U C G G A A U A – So, using structure plus sequence, can find related A C G U G C U U U G U G sequences at much greater evolutionary distances U A G G C G A: Structure often more important than sequence 2
Chloroflexi Chloroflexus aurantiacus Used by CMfinder δ -Proteobacteria Geobacter metallireducens 6S mimics an Geobacter sulphurreducens Found by scan open promoter Symbiobacterium thermophilum E.coli Barrick et al. RNA 2005 Trotochaud et al. NSMB 2005 Willkomm et al. NAR 2005 “Central Dogma” = “Central Chicken & Egg”? DNA RNA Protein Protein gene DNA (chromosome) RNA (messenger) cell Was there once an “RNA World”? 3
RNA Secondary Structure 6.5 RNA Secondary Structure RNA. String B = b 1 b 2 … b n over alphabet { A, C, G, U }. Secondary structure. RNA is single-stranded so it tends to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule. Algorithms C A Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA A A A U G C C G U A A G G U A U U A G A C G C U G C G C G A G C G A U G complementary base pairs: A-U, C-G RNA Secondary Structure RNA Secondary Structure: Examples Secondary structure. A set of pairs S = { (b i , b j ) } that satisfy: Examples. [Watson-Crick.] G – S is a matching and G G G G G G C U C U – each pair in S is a Watson-Crick pair: A-U, U-A, C-G, or G-C. C G C G C U [No sharp turns.] The ends of each pair are separated by at least 4 intervening bases. If (b i , b j ) ∈ S, then i < j - 4. A U A U A G [Non-crossing.] If (b i , b j ) and (b k , b l ) are two pairs in S, then we U A U A U A cannot have i < k < j < l. base pair Free energy. Usual hypothesis is that an RNA molecule will form the secondary structure with the optimum total free energy. approximate by number of base pairs U G U G G C C A U U G G G G C A U G U U G G C C A U A A A ≤ 4 Goal. Given an RNA molecule B = b 1 b 2 … b n , find a secondary structure S ok sharp turn crossing that maximizes the number of base pairs. 4
RNA Secondary Structure: Subproblems Dynamic Programming Over Intervals First attempt. OPT(j) = maximum number of base pairs in a secondary Notation. OPT(i, j) = maximum number of base pairs in a secondary structure of the substring b 1 b 2 … b j . structure of the substring b i b i+1 … b j . Case 1. If i ≥ j - 4. match b t and b n – OPT(i, j) = 0 by no-sharp turns condition. Case 2. Base b j is not involved in a pair. – OPT(i, j) = OPT(i, j-1) t n 1 Case 3. Base b j pairs with b t for some i ≤ t < j - 4. – non-crossing constraint decouples resulting sub-problems Difficulty. Results in two sub-problems. – OPT(i, j) = 1 + max t { OPT(i, t-1) + OPT(t+1, j-1) } Finding secondary structure in: b 1 b 2 … b t-1 . OPT(t-1) take max over t such that i ≤ t < j-4 and Finding secondary structure in: b t+1 b t+2 … b n-1 . need more sub-problems b t and b j are Watson-Crick complements Remark. Same core idea in CKY algorithm to parse context-free grammars. Bottom Up Dynamic Programming Over Intervals CUCCGGUUGCAAUGUC n= 16 E.g.: ((.(....).)..).. Q. What order to solve the sub-problems? OPT(1,6) = 1: 0 0 0 0 0 1 1 1 1 1 2 2 2 3 3 3 A. Do shortest intervals first. 0 0 0 0 0 0 0 0 1 1 2 2 2 2 2 2 CUCCGG 0 0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 (....) 0 0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 RNA(b 1 ,…,b n ) { 4 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 for k = 5, 6, …, n-1 3 0 0 for i = 1, 2, …, n-k 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 i 2 0 j = i + k 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 Compute M[i, j] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 6 7 8 9 return M[1, n] using recurrence 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 E.g.: j } 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 OPT(6,16) = 2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 j 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 7 8 9 GUUGCAAUGUC 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Running time. O(n 3 ). (.(...)...) 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5
Recommend
More recommend