Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte Wedel Vildhøj 2 1 Moscow State University, Department of Mechanics and Mathematics, tat.starikovskaya@gmail.com 2 Technical University of Denmark, DTU Compute, hwv@hwv.dk CPM 2013, Bad Herrenalb, Germany June 17, 2013 1 / 27
The Longest Common Substring Problem Definition Problem: Given T 1 , T 2 , . . . , T m of total length n . Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T 1 = a g g c t a g c t a c c t T 2 = a c a c c t a c c c t a g T 3 = a c t a g t a a t g c a t 2 / 27
The Longest Common Substring Problem Definition Problem: Given T 1 , T 2 , . . . , T m of total length n . Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T 1 = a g g c t a g c t a c c t T 2 = a c a c c t a c c c t a g T 3 = a c t a g t a a t g c a t d = 3 ⇒ LCS = c t a g 3 / 27
The Longest Common Substring Problem Definition Problem: Given T 1 , T 2 , . . . , T m of total length n . Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T 1 = a g g c t a g c t a c c t T 2 = a c a c c t a c c c t a g T 3 = a c t a g t a a t g c a t d = 3 ⇒ LCS = c t a g d = 2 ⇒ LCS = c t a c c 4 / 27
The Longest Common Substring Problem A patented solution 5 / 27
6 / 27 5 g$ ccctag$ 11 g cc a 9 t$ ctag$ $ 6 13 2 gctagctacct$ t 3 gctacct$ cta cct$ $ 7 4 g 13 g$ ccctag$ g 10 a c c 8 t$ ctag$ $ 5 12 c t 9 g$ ccctag$ a T 2 = a c a c c t a c c c t a g 13 t 4 t $ 11 c ctag$ a 8 c c t 12 c a c c c a t a g $ 11 c 2 1 gctagctacct$ 10 a ctacct$ 6 g $ 3 t $ 9 g t a c c c 12 a t $ c 10 8 c c c t a g $ 7 acctaccctag$ g 7 A Textbook Solution 1 Build Generalized Suffix Tree a 6 t 5 c 4 g 3 g 2 T 1 = a 1
7 / 27 5 g$ ccctag$ 11 g cc a 9 t$ ctag$ $ 6 13 2 gctagctacct$ t 3 gctacct$ cta cct$ $ 7 4 g 13 g$ ccctag$ g 10 cc a 8 t$ ctag$ $ 5 12 c t 9 g$ ccctag$ a T 2 = a c a c c t a c c c t a g 13 t 4 t $ 11 c ctag$ a 8 c c t 12 c a c c c a t a g $ 11 c 2 1 gctagctacct$ 10 a ctacct$ 6 g $ 3 t $ 9 g t a c c c 12 a t $ c 10 8 c c c t a g $ 7 acctaccctag$ g 7 A Textbook Solution 1 Build Generalized Suffix Tree a 6 t 5 c 4 g 3 g 2 T 1 = a 1
8 / 27 5 g$ ccctag$ 11 g cc a 9 t$ ctag$ $ 6 13 2 gctagctacct$ t 3 gctacct$ cta cct$ $ 7 4 g 13 g$ ccctag$ g 10 cc a 8 t$ ctag$ $ 5 12 c t 9 g$ ccctag$ a T 2 = a c a c c t a c c c t a g 13 t 4 t $ 11 c ctag$ a 8 c c t 12 c a c c c a t a g $ 11 c 2 1 gctagctacct$ 10 a ctacct$ 6 g $ 3 t $ 9 g t a c c c 12 a t $ c 10 8 c c c t a g $ 7 acctaccctag$ g 7 A Textbook Solution 1 Build Generalized Suffix Tree a 6 t 5 Θ( n ) c 4 g 3 � g 2 Space: T 1 = a 1
Our Results Question � n 1 − ε � Can the LCS problem be solved (deterministically) in O space � n 1 + ε � and O time for 0 ≤ ε ≤ 1? Our Answer Yes if 0 ≤ ε ≤ 1 3 . More precisely, For two strings ( d = m = 2), the problem can be solved in: � n 1 + ε � Time: O for any 0 < ε ≤ 1 3 . � n 1 − ε � Space: O In the general case (2 ≤ d ≤ m ), the problem can be solved in: n 1 + ε log 2 n ( d log 2 n + d 2 ) � � Time: O for any 0 ≤ ε < 1 3 . � n 1 − ε � Space: O 9 / 27
A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 10 / 27
A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 DC τ DC τ DC τ DC τ DC τ DC τ Difference Covers A difference cover modulo τ is a set of integers DC τ ⊆ { 0 , 1 , . . . , τ − 1 } such that for any distance d ∈ { 0 , 1 , . . . , τ − 1 } , DC τ contains two elements separated by distance d modulo τ . Ex: The set DC τ = { 1 , 2 , 4 } is a difference cover modulo 5. 4 4 d 0 1 2 3 4 3 3 0 2 i , j 1 , 1 2 , 1 1 , 4 4 , 1 1 , 2 2 1 1 11 / 27
A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 DC τ DC τ DC τ DC τ DC τ DC τ � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ 12 / 27
A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ ◮ The LCS is the LCP of two suffixes. 13 / 27
A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ ◮ The LCS is the LCP of two suffixes. ◮ If | LCS | ≥ τ one of the first τ characters of the LCS is sampled in both strings. 14 / 27
A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ ◮ The LCS is the LCP of two suffixes. ◮ If | LCS | ≥ τ one of the first τ characters of the LCS is sampled in both strings. ◮ Hence the LCS corresponds to a pair ( p ∗ 1 , p ∗ 2 ) maximizing � � � � lcp RB ( p 1 ) , RB ( p 2 ) + lcp T [ p 1 .. ] , T [ p 2 .. ] − 1 15 / 27
A Solution for Two Strings When the LCS is long Idea: Preprocess a sparse sample of the n suffixes for LCP queries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 RB ( 11 )= ( g c t a c ) R = c a t c g � n � n ◮ Number of sampled suffixes: O � � τ | DC τ | = O . √ τ ◮ The LCS is the LCP of two suffixes. ◮ If | LCS | ≥ τ one of the first τ characters of the LCS is sampled in both strings. ◮ Hence the LCS corresponds to a pair ( p ∗ 1 , p ∗ 2 ) maximizing � � � � lcp RB ( p 1 ) , RB ( p 2 ) + lcp T [ p 1 .. ] , T [ p 2 .. ] − 1 16 / 27
A Solution for Two Strings When the LCS is long � n 2 � How to compute the pair ( p ∗ 1 , p ∗ 2 ) faster than O ? τ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 SA τ = [ 14 21 17 26 , , , , 6 , 1 , 16 22 11 12 19 24 , , , , , , 4 , 27 , 7 , 2 , 9 ] LCP τ = [ 0 , 3 , 1 , 2 , 2 , 0 , 1 , 2 , 1 , 2 , 3 , 4 , 0 , 1 , 1 , 0 ] SA R τ = [ 14 , 1 , 17 21 26 , , , 6 , 16 22 11 19 12 24 , , , , , , 4 , 2 , 27 , 7 , 9 ] LCP R τ = [ 0 , 1 , 1 , 4 , 3 , 0 , 2 , 4 , 1 , 3 , 2 , 1 , 0 , 2 , 4 , 0 ] Main observation: lcp ( T [ p ∗ 1 .. ] , T [ p ∗ 2 .. ]) ∈ [ ℓ max − τ + 1 ; ℓ max ] , so we can ignore all pairs with lcp values smaller than ℓ max − τ + 1. 17 / 27
A Solution for Two Strings When the LCS is long � n 2 � How to compute the pair ( p ∗ 1 , p ∗ 2 ) faster than O ? τ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 T = a g g c t a g c t a c c t $ 1 a c a c c t a c c c t a g $ 2 SA τ = [ 14 21 17 26 , , , , 6 , 1 , 16 22 11 12 19 24 , , , , , , 4 , 27 , 7 , 2 , 9 ] LCP τ = [ 0 , 3 , 1 , 2 , 2 , 0 , 1 , 2 , 1 , 2 , 3 , 4 , 0 , 1 , 1 , 0 ] SA R τ = [ 14 , 1 , 17 21 26 , , , 6 , 16 22 11 19 12 24 , , , , , , 4 , 2 , 27 , 7 , 9 ] LCP R τ = [ 0 , 1 , 1 , 4 , 3 , 0 , 2 , 4 , 1 , 3 , 2 , 1 , 0 , 2 , 4 , 0 ] Main observation: lcp ( T [ p ∗ 1 .. ] , T [ p ∗ 2 .. ]) ∈ [ ℓ max − τ + 1 ; ℓ max ] , so we can ignore all pairs with lcp values smaller than ℓ max − τ + 1. 18 / 27
Recommend
More recommend