LinearFold Linear-Time RNA Folding x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 1 G C U C C A C G G C 70 76 G C 60 G C A U G U A U A C U G C U Liang Huang U 10 G A G G C G A G A U C U C U C U C G U 50 U Baidu Research USA & Oregon State University G A G C G G G A U A G G C G 20 G C Joint work with Dezhong Deng (Oregon State / Baidu) and Kai Zhao (Oregon State / Google) A U 30 C G 40 and David Hendrix (Oregon State) and David Mathews (Rochester) C G U A U A G C C Stanford University School of Medicine, July 2018
A Bit About Myself… … Ph.D., 2008 Research Scientist, 2009 Assistant Professor, 2015- Principal Scientist, 2018- • my main area is computational linguistics (aka natural language processing) • where I develop faster (linear-time) algorithms to understand/translate languages • but I also apply these algorithms to computational structural biology… 2
RNA Structure Prediction and Design RNA sequence CRISPR/Cas9: gene editing GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA design structure prediction RNA secondary structure RNA 3D structure M. tuberculosis 3
RNA Structure Prediction (Folding) allowed pairs: G-C A-U G-U example: transfer RNA (tRNA) assume no crossing pairs x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 3’ 5’ 75 A G C C C G 5 U G 1 G C U C C A C G 70 G A C G C A challenge: existing structure prediction C G C 70 U 76 C A 10 G C 60 U G G C U algorithms are way too slow: O ( n 3 ) C U 65 A U U G U G A C U A C C U U G C U A 10 G 15 U G A G A G G C G C A U U C U U 60 C U U G C U C U G G 50 U A G G C G A G G U G G 20 A A U C A G U 55 G solution: borrow linear-time algorithms G U A C G 20 G G G C A C from natural language parsing G A U 25 A GUCGC CGAC 30 C 40 G 50 C G U A C U U G C U C G G G 30 U parse tree A 45 C A A G G G C 35 C 40 4 4
Our Linear-Time Prediction is Much Faster… 10,000 nt (~HIV) 244,296 nt (longest in RNAcentral) 4min 7s ~200hrs 120s 9 2 hrs 8 running time per sequence (sec) s 1000 7 6 s n 2.6 100 CONTRAfold MFE, ~ n 2.6 5 ~ , d l o f A 4 N R s 10 a n 3 n e i V 2 LinearFold b=100 , ~ n 1.0 s 1 Vienna RNAfold: n 2.6 CONTRAfold MFE: n 2.6 1 LinearFold b=100 : n 1.0 0 . 1 LinearFold b=50 , ~ n LinearFold b=050 : n 1.0 0 10 3 nt 10 4 nt 10 5 nt 0 1000 nt 2000 nt 3000 nt with even slightly better prediction accuracy!! 5 5
Computational Linguistics => Computational Biology linguistics computer science biology 1955 Chomsky: 1953 Watson & Crick: 1958 Backus & Naur: context-free grammars DNA double-helix CFGs in programming lang. 1964 Cocke \ 1965 Kasami - CKY Parsing: O ( n 3 ) 1967 Younger / 1965 Knuth: LR Parsing: O ( n ) 1980s: O ( n 3 ) CKY for RNA structures 1970 Joshi: tree-adjoining grammars 1985 CKY-style TAG parsing in O ( n 6 ) 1985 Shieber: non-CF languages 1986 Tomita: Generalized LR Parsing 1999: TAGs for RNA pseudoknots ~1990: linear-time greedy parsing 2010: linear-time DP parsing 2018: LinearFold: O ( n ) RNA (Huang & Sagae) structure prediction 6
Current Structure Prediction Method: O ( n 3 ) • Dynamic Programming — O ( n 3 ) ( ) • bottom-up CKY parsing i i+1 j-1 j • example: maximize # of pairs (A-U G-C G-U) ((.)) i k j . .(.) (.). . . ... (.) (.) .. .. .. () . . . . . A C A G U 7
How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 0: tag each nucleotide from left to right • maintain a stack: push “(”, pop “)”, skip “.” • exhaustive: O (3 n ) 8 (Huang and Sagae, 2010)
How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 1: DP by merging “equivalent states” • maintain graph-structured stacks • DP: O ( n 3 ) 9 (Huang and Sagae, 2010)
How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 1: DP by merging “equivalent states” • maintain graph-structured stacks • DP: O ( n 3 ) 10 (Huang and Sagae, 2010)
How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 2: approximate search: beam pruning • keep only top b states per step • DP+beam: O ( n ) each DP state corresponds to exponentially many non-DP states graph-structured stack (GSS) (Tomita, 1986) 11 (Huang and Sagae, 2010)
Another View: Left-to-Right CKY • many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) (S, 0, n) bottom-up left-to-right right-to-left all O(n 3 ), but the incremental ones can apply beam search to run in O(n) 12
Our Linear-Time Prediction is Much Faster… 10,000 nt (~HIV) 244,296 nt (longest in RNAcentral) 4min 7s ~200hrs 120s 9 2 hrs 8 running time per sequence (sec) s 1000 7 6 s n 2.6 100 CONTRAfold MFE, ~ n 2.6 5 ~ , d l o f A 4 N R s 10 a n 3 n e i V 2 LinearFold b=100 , ~ n 1.0 s 1 Vienna RNAfold: n 2.6 CONTRAfold MFE: n 2.6 1 LinearFold b=100 : n 1.0 0 . 1 LinearFold b=50 , ~ n LinearFold b=050 : n 1.0 0 10 3 nt 10 4 nt 10 5 nt 0 1000 nt 2000 nt 3000 nt with even slightly better prediction accuracy!! 13 13
On to details...
An Example Path push push skip pop pop 15
Version 1: Exhaustive Search O (3 n ) 16
Version 1: Exhaustive Search O (3 n ) 17
Version 1: Exhaustive Search O (3 n ) 18
Version 1: Exhaustive Search O (3 n ) 19
Version 1: Exhaustive Search O (3 n ) 20
Version 1: Exhaustive Search O (3 n ) 21
Idea 1a: Merge Identical Stacks Merge states with the same full stack (unpaired openings): “Equivalent States” 22
Version 2: Merge by Full Stack O (2 n ) exhaustive full-stack merge 23
Version 2: Merge by Full Stack O (2 n ) merge states with identical stacks exhaustive full-stack merge 24
Version 2: Merge by Full Stack O (2 n ) exhaustive O (2 n ) full-stack merge 25
Idea 1b: Merge “Temporary Equivalents” Merge states with the same top of the stack (last unpaired opening): O (2 n ) “Temporarily Equivalent States” 26
Version 3: Merge by Stack Top O ( n 3 ) packing temporarily equivalent states 27
Version 3: Merge by Stack Top O ( n 3 ) 28
Version 3: Merge by Stack Top O ( n 3 ) 29
Version 3: Merge by Stack Top O ( n 3 ) unpacking packing 30
Version 3: Merge by Stack Top O ( n 3 ) O (2 n ) packing 31
Close Up Look at Two Paths 32
Close Up Look at Two Paths 33
Idea 3: Beam Pruning O (2 n ) full-stack merge stack-top merge 34
Version 4: DP with Beam Search O ( n ) stack-top merge +beam pruning 35
Recap: O (3 n ) to O ( n 3 ) to O ( n ) 0 1 2 3 4 5 • 5 search algorithms no DP C CC CCA CCAG CCAGG O (3 n ) ..( ...( × × ( 3 0 ( 4 0 . .. ... .... ..... . . . . . ✏ • DP: bottom-up CKY: O ( n 3 ) 0 0 0 0 0 0 0 0 0 0 0 0 ( +full stack merge . . .( .(. .(.. .(..) ) 2 0 2 0 2 0 0 0 ( ) ( . .(( .(.) .(.). × • left-to-right (exhaustive): O (3 n ) 2 3 0 0 0 0 . . . ( (. (.. (... (...) ) 1 0 1 0 1 0 1 0 0 0 ( ) . (.( (..) (..). ( × • DP: merge by full stack: O (2 n ) 1 3 0 0 0 0 . (( ((. ((.) ((.)) ) ) 1 2 1 2 1 0 0 0 • DP: merge by stack top: O ( n 3 ) DP C CC CCA CCAG CCAGG O (2 n ) 2 ) n O (2 . .. ... .(.. . . . ✏ . 0 0 0 0 0 0 0 0 2 0 ) ( . . . .( .(. .(.) ((.)) ( ) • approx. DP via beam search: O ( n ) +GSS 2 0 2 0 0 0 0 0 ) ) . . . ( (. (.. ((.) 1 0 1 0 1 0 1 0 ( ) . (( ((. • this is a simple illustration that we just 1 2 1 2 DP+GSS C CC CCA CCAG CCAGG O ( n 3 ) . .. ... ?(.. . . . maximize the number of pairs ✏ . 0 0 0 0 0 0 0 0 .. 2 ) ( ( . +beam . . ( ?( ?(. .(.) ((.)) ( ) . .. 1 .. 2 .. 2 0 0 0 0 ) • our real systems work with complicated ) ) . . (. (.. ((.) .. 1 .. 1 .. 1 CCAGG LinearFold feature templates C CC CCA CCAG . . O ( n ) . ?( ?(. .(.) ((.)) . ) ( ✏ 0 0 0 0 .. 2 .. 2 0 0 0 0 ) ) ( ( . . ( approx. DP) ( (. (.. ((.) ) 36 .. 1 .. 1 .. 1 .. 1
Recommend
More recommend