Edit Distance: Sketching, Streaming and Document Exchange Djamal Belazzougui Qin Zhang CERIST, Algeria IU Bloomington FOCS 2016 Oct. 9, 2016 1-1
Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . 2-1
Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 2-2
Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 Applications: numerous. E.g., bioinformatics (measuring similarity between DNA seq. 2-3
Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 Applications: numerous. E.g., bioinformatics (measuring automatic spelling correction similarity between DNA seq. 2-4
Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. 3-1
Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) t s document exchange App: remote file sync; file transmission through a noisy channel 3-2
Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) sk(t) sk(s) s t t s sketching App: distributed similarity join document exchange App: remote file sync; file transmission through a noisy channel 3-3
Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) sk(t) sk(s) s t t s sketching App: distributed similarity join document exchange App: remote file sync; RAM file transmission through a noisy channel t s streaming CPU 3-4
Previous and our results K : distance threshold; n : input size. For simplicity, assuming K < n 0 . 1 √ log n under • Information theoretic optimal communication for K ≤ 2 almost linear encoding/decoding time for doc-exchange. • First sketching/streaming algorithm with poly( K , log n ) size/space. Note: Ω( n ) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat. STOC’13) 4-1
Previous and our results IMS scheme K : distance threshold; n : input size. For simplicity, assuming K < n 0 . 1 √ log n under • Information theoretic optimal communication for K ≤ 2 almost linear encoding/decoding time for doc-exchange. • First sketching/streaming algorithm with poly( K , log n ) size/space. Note: Ω( n ) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat. STOC’13) 4-2
Main Tool: CGK Embedding 5-1
Our main tool – CGK embedding (Chakraborty, Goldenberg, Koucky, STOC’16 The CGK embedding Similar idea by Saha, FOCS’14 ) f : s ∈ { 0 , 1 } n → s ′ ∈ { 0 , 1 } 3 n . Two counters i and j both initialized to 1. For j = 1 , 2 , . . . steps: 1. s ′ [ j ] ← s [ i ]. 2. Flip a coin; if head, then i ← i + 1. Stop when i = n + 1. 3. j ← j + 1. s s’ 1 0 1 0 1 1 0 1 1 1 0 0 1 j i 6-1
Our main tool – CGK embedding (Chakraborty, Goldenberg, Koucky, STOC’16 The CGK embedding Similar idea by Saha, FOCS’14 ) f : s ∈ { 0 , 1 } n → s ′ ∈ { 0 , 1 } 3 n . Two counters i and j both initialized to 1. For j = 1 , 2 , . . . steps: 1. s ′ [ j ] ← s [ i ]. 2. Flip a coin; if head, then i ← i + 1. Stop when i = n + 1. 3. j ← j + 1. s s’ 1 0 1 0 1 1 0 1 1 1 0 0 1 j i Property If ed ( s , t ) = k , then k / 2 ≤ ham ( f ( s ) , f ( t )) ≤ O ( k 2 ) w.pr. 0 . 99 6-2
CGK as a random walk CGK → a random walk on two strings s s’ CGK 1 0 1 1 0 1 1 1 p j t t’ CGK 1 1 1 1 1 1 1 q 7-1
CGK as a random walk CGK → a random walk on two strings s s’ CGK 1 0 1 1 0 1 1 1 p j t t’ CGK 1 1 1 1 1 1 1 q The shift ( p − q ) is a random walk on the line. 7-2
Document Exchange sk(s) s t App: remote file sync; file transmission through a noisy channel Warning: I will cheat in multiple places 8-1
Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) 9-1
Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? 9-2
Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping 9-3
Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping Challenge : the O ( K 2 ) errors after CGK embedding can possibly be distributed into O ( K 2 ) pairs of blocks. This may introduce a factor of K 2 of communication in the error-correcting which is too much. 9-4
Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping Challenge : the O ( K 2 ) errors after CGK embedding can possibly be distributed into O ( K 2 ) pairs of blocks. This may introduce a factor of K 2 of communication in the error-correcting which is too much. • Can reduce O ( K 2 ) pairs to O ( K ), by removing long common periodic substrings. • Not easy: everything has to be done using one-way comm.! 9-5
Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q 10-1
Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens 10-2
Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens Call a seq. of walks from state ( p , q ) where the next progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase 10-3
Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens Call a seq. of walks from state ( p , q ) where the next progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase a progress phase ⇔ a pair of mismatching blocks ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block 10-4
Technique overview: document exchange (cont.) Call a seq. of walks from state ( p , q ) where a (the next) progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block Whp, a progress phase “consumes” ≤ K 10 progress steps. 11-1
Technique overview: document exchange (cont.) Call a seq. of walks from state ( p , q ) where a (the next) progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block Whp, a progress phase “consumes” ≤ K 10 progress steps. Can show that after properly removing long common periods, we get a progress step in ≤ K 50 random walk steps 11-2
Recommend
More recommend