Truly Subcubic Algorithms for Language Edit Distance and RNA Folding - PowerPoint PPT Presentation

Truly Subcubic Algorithms for Language Edit Distance and RNA Folding via Fast Bounded-Difference Min-Plus Product Karl Bringmann , Fabrizio Grandoni, Barna Saha, Virginia Vassilevska Williams June 11, 2017

Bounded Differences (BD) Matrices Integer matrix 𝑁 has BD if for all 𝑗, 𝑘 : 2 2 3 2 𝑁 𝑗, 𝑘 − 𝑁[𝑗, 𝑘 + 1] ≤ 1 1 1 2 3 and 2 1 2 3 𝑁 𝑗, 𝑘 − 𝑁[𝑗 + 1, 𝑘] ≤ 1 1 0 1 2 More generally: 𝑿 -BD when differences are at most 𝑋

� � (min,+) Product For 𝑜×𝑜 -matrices 𝐵, 𝐶 , their (min,+) product 𝐷 = 𝐵 ∗ 𝐶 is defined by 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] (min,+) product is equivalent to All Pairs Shortest Paths [Fischer,Meyer’71] trivial algorithm: 𝑃(𝑜 ; ) best known algorithm: 𝑜 ; /2 ?( @AB C ) [Williams’14] 𝐷 𝑗, 𝑘 = E 𝐵 𝑗, 𝑙 ⋅ 𝐶[𝑙, 𝑘] Standard matrix multiplication: 7 time 𝑃(𝑜 G ) where 𝜕 ≤ 2.373

� (min,+) Product For 𝑜×𝑜 -matrices 𝐵, 𝐶 , their (min,+) product 𝐷 = 𝐵 ∗ 𝐶 is defined by 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] (min,+) product is equivalent to All Pairs Shortest Paths [Fischer,Meyer’71] trivial algorithm: 𝑃(𝑜 ; ) best known algorithm: 𝑜 ; /2 ?( @AB C ) [Williams’14] Big Open Problem: Is (min,+) product in time 𝑷(𝒐 𝟒O𝜻 ) for some 𝜻 > 𝟏 ? Study special cases!

� (min,+) Product for Structured Matrices Matrices with small entries: [Alon,Galil,Margalit’97] If 𝐵, 𝐶 have entries in −𝑈, … , 𝑈 ∪ ∞ i(𝑈𝑜 G ) then 𝐵 ∗ 𝐶 can be computed in time 𝑃 Sketch: 𝐵 S 𝑗, 𝑘 = 𝑦 U[V,W] 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] 𝐷′ 𝑗, 𝑘 = E 𝐵 S 𝑗, 𝑙 ⋅ 𝐶 S [𝑙, 𝑘] 7 𝐷 𝑗, 𝑘 = degree of highest monomial in 𝐷 S [𝑗, 𝑘]

(min,+) Product for Structured Matrices Matrices with small entries: [Alon,Galil,Margalit’97] If 𝐵, 𝐶 have entries in −𝑈, … , 𝑈 ∪ ∞ i(𝑈𝑜 G ) then 𝐵 ∗ 𝐶 can be computed in time 𝑃 Matrices with few distinct entries: [Yuster’09] If each row of 𝐵 has a small number of distinct entries, then for arbitrary 𝐶 we can compute 𝐵 ∗ 𝐶 in truly subcubic time Question: Is (min,+) product in time 𝑷(𝒐 𝟒O𝜻 ) for BD matrices? Why care about BD matrices?

1 st Application: Language Edit Distance (LED) for simplicity: |𝐻| = 𝑃(1) CFG Parsing: Given a context-free grammar 𝐻 and a string 𝑡 of length 𝑜 , is 𝑡 in 𝑀(𝐻) ? i(𝑜 G ) ... is in time 𝑃 [L. Valiant’75] Language Edit Distance: „error-correcting CFG parsing“ Given a CFG 𝐻 and a string 𝑡 , compute minimum edit distance of 𝑡 to any string in 𝑀(𝐻) insertions, deletions, substitutions ... is in time 𝑃(𝑜 ; ) [Aho,Peterson’72] We show using Valiant’s approach: If (min,+) product on BD matrices is in time 𝑃(𝑜 n ) , ~8 page proof i(𝑜 n ) then LED is in time 𝑃 intuitive reason for BD: LED( 𝑡 ) and LED( 𝑡𝑑 ) differ by ≤ 1 for any symbol 𝑑

2 nd Application: RNA Folding RNA can be seen as a sequence of symbols from {A,C,G,U} Biologists want to predict the secondary structure of RNA: A can pair with U, and C can pair with G Given an RNA sequence, find the largest set of matching pairs, such that no two pairs intersect AUUGCAG not allowed but AUUGCAG is okay ... is in time 𝑃(𝑜 ; ) [Nussinov,Jacobson’80] Disclaimer: No author of ... can be cast as a LED problem (without substitutions) this paper is a biologist. If (min,+) product on BD matrices is in time 𝑃(𝑜 n ) , i(𝑜 n ) then RNA Folding is in time 𝑃

3 rd Application: Optimal Stack Generation for simplicity: |Σ| = 𝑃(1) Optimal Stack Generation: Given a string 𝑡 over alphabet Σ , determine the shortest sequence of stack operations push(.), emit, pop s.t. performing these operations starting from an empty stack will emit 𝑡 and end with an empty stack a a 𝑡 = bab b b b b b b push(b) emit push(a) emit pop emit pop b a b ... is in time 𝑃(𝑜 ; ) (dynamic programming) [Tarjan’05] If (min,+) product on O(1)-BD matrices is in time 𝑃(𝑜 n ) , We show: i(𝑜 n ) then Optimal Stack Generation is in time 𝑃 intuitive reason for BD: OSG( 𝑡 ) and OSG( 𝑡𝑑 ) differ by ≤ 3 for any 𝑑 ∈ Σ

Main Result ... so we have seen that (min,+) product of BD matrices is well motivated Main Result: We can compute the (min,+) product of BD matrices in randomized time 𝑃(𝑜 v.y; ) and deterministic time 𝑃(𝑜 v.yz ) here: 𝑷(𝒐 𝟑.𝟘 ) Generalization: For 𝑿 -BD matrix 𝐵 with 𝑋 ≪ 𝑜 ;OG ≈ 𝑜 t.uvu and arbitrary 𝐶 we can compute their (min,+) product in randomized truly subcubic time

Algorithm Sketch Input: BD matrices 𝐵, 𝐶 . Want: 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] 1) Compute approximation 𝐸 𝑗, 𝑘 = 𝐷 𝑗, 𝑘 ± 𝑃 𝑜 t.v time 𝑃(𝑜 v.u ) compute 𝐷 𝑗, 𝑘 exactly for all 𝑗, 𝑘 that are multiples of 𝑜 t.v set 𝐸 𝑗, 𝑘 to some 𝐷[𝑗’, 𝑘’] by rounding 𝑗, 𝑘 If 𝐵, 𝐶 are BD, then their (𝑗 S , 𝑘 S ) (min,+) product is also BD (𝑗, 𝑘) 𝑜 t.v

Algorithm Sketch Input: BD matrices 𝐵, 𝐶 . Want: 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] 1) Compute approximation 𝐸 𝑗, 𝑘 = 𝐷 𝑗, 𝑘 ± 𝑃 𝑜 t.v ≤ 𝑃 𝑜 t.v 𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 = 𝐷 𝑗, 𝑘 implies 𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 − 𝐸 𝑗, 𝑘 call these triples (𝑗, 𝑙, 𝑘) relevant then 𝐷 𝑗, 𝑘 = 7:(V,7,W) •€@€•‚ƒ„ 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] min

Algorithm Sketch (𝑗, 𝑙, 𝑘) relevant: |𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 − 𝐸 𝑗, 𝑘 | ≤ 𝑃 𝑜 t.v Input: BD matrices 𝐵, 𝐶 . Want: 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] 1) Compute approximation 𝐸 𝑗, 𝑘 = 𝐷 𝑗, 𝑘 ± 𝑃 𝑜 t.v 2) Cover most relevant triples: fix 𝑗 ∗ , 𝑘 ∗ , and define matrices 𝐵 ∗ , 𝐶 ∗ 𝐵 ∗ 𝑗, 𝑙 ≔ 𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 ∗ − 𝐸 𝑗, 𝑘 ∗ − 𝐵 𝑗 ∗ , 𝑙 + 𝐶 𝑙, 𝑘 ∗ − 𝐸 𝑗 ∗ , 𝑘 ∗ 𝐶 ∗ 𝑙, 𝑘 ≔ 𝐵 𝑗 ∗ , 𝑙 + 𝐶 𝑙, 𝑘 − 𝐸 𝑗 ∗ , 𝑘 (min,+) product 𝐷 ∗ of 𝐵 ∗ , 𝐶 ∗ : = 𝐷 𝑗, 𝑘 − 𝐸 𝑗, 𝑘 ∗ + 𝐸 𝑗 ∗ , 𝑘 ∗ − 𝐸 𝑗 ∗ , 𝑘 𝐷 ∗ 𝑗, 𝑘 = min 7 𝐵 ∗ 𝑗, 𝑙 + 𝐶 ∗ 𝑙, 𝑘 can be cancelled afterwards

Algorithm Sketch (𝑗, 𝑙, 𝑘) relevant: |𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 − 𝐸 𝑗, 𝑘 | ≤ 𝑃 𝑜 t.v Input: BD matrices 𝐵, 𝐶 . Want: 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] 1) Compute approximation 𝐸 𝑗, 𝑘 = 𝐷 𝑗, 𝑘 ± 𝑃 𝑜 t.v 2) Cover most relevant triples: fix 𝑗 ∗ , 𝑘 ∗ , and define matrices 𝐵 ∗ , 𝐶 ∗ 𝐵 ∗ 𝑗, 𝑙 ≔ 𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 ∗ − 𝐸 𝑗, 𝑘 ∗ − 𝐵 𝑗 ∗ , 𝑙 + 𝐶 𝑙, 𝑘 ∗ − 𝐸 𝑗 ∗ , 𝑘 ∗ 𝐶 ∗ 𝑙, 𝑘 ≔ 𝐵 𝑗 ∗ , 𝑙 + 𝐶 𝑙, 𝑘 − 𝐸 𝑗 ∗ , 𝑘 if 𝑗, 𝑙, 𝑘 ∗ , 𝑗 ∗ , 𝑙, 𝑘 ∗ , 𝑗 ∗ , 𝑙, 𝑘 are all relevant, then 𝐵 ∗ 𝑗, 𝑙 , 𝐶 ∗ 𝑙, 𝑘 = 𝑃 𝑜 t.v set all 𝛻(𝑜 t.v ) -entries of 𝐵 ∗ , 𝐶 ∗ to ∞ then (min,+) product of 𝐵 ∗ and 𝐶 ∗ can be computed in time 𝑃 i(𝑜 G‡t.v ) (𝑗, 𝑙, 𝑘) is „covered“ if 𝐵 ∗ 𝑗, 𝑙 and 𝐶 ∗ 𝑙, 𝑘 are 𝑃(𝑜 t.v ) , i.e., not set to ∞

Algorithm Sketch (𝑗, 𝑙, 𝑘) relevant: |𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 − 𝐸 𝑗, 𝑘 | ≤ 𝑃 𝑜 t.v Input: BD matrices 𝐵, 𝐶 . Want: 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] (𝑗, 𝑙, 𝑘) is „covered“ 1) Compute approximation 𝐸 𝑗, 𝑘 = 𝐷 𝑗, 𝑘 ± 𝑃 𝑜 t.v if 𝐵 ∗ 𝑗, 𝑙 and 𝐶 ∗ 𝑙, 𝑘 are 𝑃(𝑜 t.v ) 2) Cover most relevant triples: in some round ˆ 𝑗, 𝑘 ≔ ∞ initialize 𝐷 repeat for 𝑃(𝑜 t.; log 𝑜) rounds: i(𝑜 t.; ) iterations 𝑃 pick 𝑗 ∗ , 𝑘 ∗ randomly 𝐵 ∗ 𝑗, 𝑙 ≔ 𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 ∗ − 𝐸 𝑗, 𝑘 ∗ − 𝐵 𝑗 ∗ , 𝑙 + 𝐶 𝑙, 𝑘 ∗ − 𝐸 𝑗 ∗ , 𝑘 ∗ 𝐶 ∗ 𝑙, 𝑘 ≔ 𝐵 𝑗 ∗ , 𝑙 + 𝐶 𝑙, 𝑘 − 𝐸 𝑗 ∗ , 𝑘 set all 𝛻(𝑜 t.v ) -entries of 𝐵 ∗ , 𝐶 ∗ to ∞ time 𝑃 𝑜 G‡t.v = 𝑃(𝑜 v.u ) compute (min,+) product 𝐷 ∗ = 𝐵 ∗ ∗ 𝐶 ∗ ˆ 𝑗, 𝑘 , 𝐷 ∗ 𝑗, 𝑘 + 𝐸 𝑗, 𝑘 ∗ − 𝐸 𝑗 ∗ , 𝑘 ∗ + 𝐸 𝑗 ∗ , 𝑘 ˆ 𝑗, 𝑘 ≔ min 𝐷 𝐷 Lem: After 𝑃(𝑜 ‰ log 𝑜) rounds there are 𝑃(𝑜 ;O‰/; + 𝑜 v.Š ) total time = 𝑃 𝑜 v.‹ i 𝑜 v.‹ uncovered relevant triples w.h.p. 𝑃

Algorithm Sketch (𝑗, 𝑙, 𝑘) relevant: |𝐵 𝑗, 𝑙 + 𝐶 𝑙, 𝑘 − 𝐸 𝑗, 𝑘 | ≤ 𝑃 𝑜 t.v Input: BD matrices 𝐵, 𝐶 . Want: 𝐷 𝑗, 𝑘 = min 7 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] (𝑗, 𝑙, 𝑘) is „covered“ 1) Compute approximation 𝐸 𝑗, 𝑘 = 𝐷 𝑗, 𝑘 ± 𝑃 𝑜 t.v if 𝐵 ∗ 𝑗, 𝑙 and 𝐶 ∗ 𝑙, 𝑘 are 𝑃(𝑜 t.v ) 2) Cover most relevant triples in some round 3) Enumerate uncovered relevant triples: ”for each uncovered relevant (𝑗, 𝑙, 𝑘) :“ ˆ 𝑗, 𝑘 ≔ min 𝐷 ˆ 𝑗, 𝑘 , 𝐵 𝑗, 𝑙 + 𝐶[𝑙, 𝑘] 𝐷 ˆ is correct output now 𝐷

Truly Subcubic Algorithms for Language Edit Distance and RNA Folding - PowerPoint PPT Presentation

Truly Subcubic Algorithms for Language Edit Distance and RNA Folding via Fast Bounded-Difference Min-Plus Product Karl Bringmann , Fabrizio Grandoni, Barna Saha, Virginia Vassilevska Williams June 11, 2017 Bounded Differences (BD) Matrices

Truly group 2016/05 TRULY Group 4 6 PRODUCTS COMPANIES 38 28000 YEARS EMPLOYEES TRULY

Minimum Cost Edit Distance Edit a source string into a target string Each edit has a cost

Coloring Algorithms on Subcubic Graphs Harold N. Gabow, San Skulrattanakulchai

Click to edit Master title style DRVR Click to edit Master title style Click to edit Master

Click to edit Master title style Click to edit Master title style Click to edit Master title

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Click to edit Master title style Click to edit Master title style Edit Master text styles Edit

The q -gram distance Bioinformatics Algorithms In many situations, edit distance is a good

Edit distance Dynamic Programming Edit distance and its variants Misspellings make approximate

Minimum Edit Distance Definition of Minimum Edit Distance How

Why compute minimum edit distance? Minimum edit distance: worked example Sometimes we want to

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

REIMAGINE President and CEO 10/19/2016 1 Click to edit Master title style Click to edit Master

Click to edit Master title style Click to edit Master title style Regional Planimetrics Project

Click to edit Master title style Click to edit Master subtitle style National Child Measurement

Click to edit Master title style TSX:KGI Click to edit Master Click to edit Master text

Pairwise RNA Edit Distance In the following: Sequences S 1 and S 2 associated

Approximation of RNA Multiple Structural Alignment Marcin Kubica 1 , Romeo Rizzi 2 , Stphane

On the Combinatorics of RNA Secondary Structures in a Polymer-Zeta Model Markus E. Nebel based on

A better k-means++ Algorithm via Local Search Silvio Lattanzi Christian Sohler Google

Sequence alignment Correspondence between bases of two DNA sequences, or between amino acids of

Pattern matching and common structure inference in RNA (secondary) structures St ephane

Small RNAs and how to analyze them using sequencing Jakub

GENOME 541 Syllabus ! protein and DNA sequence analysis to Modeling and Searching