Faster Longest Common Extension Queries in Strings over General Alphabets Paweł Gawrychowski 1 , 2 Tomasz Kociumaka 1 Wojciech Rytter 1 Tomasz Waleń 1 1 University of Warsaw, Poland [gawry,kociumaka,rytter,walen]@mimuw.edu.pl 2 University of Haifa, Israel CPM 2016 Tel Aviv, Israel 2016–06–27 1/23
Introduction LCE problem We consider Longest Common Extension problem ( LCE ) in case of general ordered alphabet. ← only comparisons of characters! Preprocess a given word w of length n for queries: LCE ( i , j ) — the length of the longest common factor starting at position i and j in w . 2/23
Introduction LCE problem We consider Longest Common Extension problem ( LCE ) in case of general ordered alphabet. ← only comparisons of characters! Preprocess a given word w of length n for queries: LCE ( i , j ) — the length of the longest common factor starting at position i and j in w . Example 1 2 3 4 5 6 7 8 9 10 11 12 w = a b a b b a b b a b a a
Introduction LCE problem We consider Longest Common Extension problem ( LCE ) in case of general ordered alphabet. ← only comparisons of characters! Preprocess a given word w of length n for queries: LCE ( i , j ) — the length of the longest common factor starting at position i and j in w . Example 1 2 3 4 5 6 7 8 9 10 11 12 w = a b a b b a b b a b a a LCE ( 2 , 8 ) =?
Introduction LCE problem We consider Longest Common Extension problem ( LCE ) in case of general ordered alphabet. ← only comparisons of characters! Preprocess a given word w of length n for queries: LCE ( i , j ) — the length of the longest common factor starting at position i and j in w . Example 1 2 3 4 5 6 7 8 9 10 11 12 w = a b b a b b a b b a b b b a b a a b a a LCE ( 2 , 8 ) = 3 2/23
Results Naive solution Answering n LCE queries can be done in: O ( n log n ) time (reduce alphabet to [ 1 .. n ] via sorting). Previous results: Kosolobov (IPL, 2016) Answering n LCE queries can be done in: O ( n log 2 / 3 n ) time. Conjectured that O ( n ) time is possible. Motivation: efficient computation of runs (Bannai et al 2015). Our result: Answering n LCE queries can be done in: O ( n log log n ) time, using O ( n ) symbol comparisons. 3/23
Difference cover t -Cover / difference cover A set S ( t ) ⊆ [ 1 .. n ] is called a t - cover of [ 1 .. n ] if: ◮ S ( t ) is t -periodic, for each i ∈ [ 1 .. n − t ] : i ∈ S ( t ) ⇔ i + t ∈ S ( t ) ◮ there is a constant-time computable function h , such that for 1 ≤ i , j ≤ n − t : 0 ≤ h ( i , j ) ≤ t and i + h ( i , j ) , j + h ( i , j ) ∈ S ( t ) Lemma For each t ≤ n there is a t -cover S ( t ) of size O ( n √ t ) which can be constructed in O ( n √ t ) time. 4/23
t -Cover, example S ( 6 ) = { 2 , 3 , 5 , 8 , 9 , 11 , 14 , 15 , 17 , 20 , 21 , 23 } . 6 6 6 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 h ( 3 , 10 ) = 5 h ( 3 , 10 ) = 5 For i = 3, j = 10: we have h ( 3 , 10 ) = 5, since 3 + 5 , 10 + 5 ∈ S ( 6 ) . 5/23
ShortLCE t vs CoarseLCE t ShortLCE t ShortLCE t ( i , j ) = min ( LCE ( i , j ) , t ) . is used to find the length of LCE but up to maximal length t . CoarseLCE t � ⌊ LCE ( i , j ) / t ⌋ if i , j ∈ S ( t ) , CoarseLCE t ( i , j ) = ⊥ otherwise. is used to find the length of LCE , but up to t characters precision. 6/23
Generic algorithm Algorithm 1: GenericLCE t ( i , j ) ℓ 1 = ShortLCE t ( i , j ) if ℓ 1 < t then return ℓ 1 ∆ = h t ( i , j ) ⊲ i + ∆ , j + ∆ ∈ S ( t ) ℓ 2 = t · CoarseLCE t ( i + ∆ , j + ∆) ℓ 3 = ShortLCE t ( i + ∆ + ℓ 2 , j + ∆ + ℓ 2 ) return ∆ + ℓ 2 + ℓ 3 ∆ ∆ j i t t CoarseLCE ℓ 2 ℓ 2 ShortLCE ℓ 1 ℓ 3 ℓ 1 ℓ 3 7/23
CoarseLCE t CoarseLCE t algorithm for t = Ω( log 2 n ) : ◮ reduce word w to new word code ( w ) , that is: ◮ shorter (of length O ( n / √ t ) ) ◮ over small alphabet [ 1 .. n ] ◮ use naive solution (with suffix arrays) 8/23
CoarseLCE t CoarseLCE t algorithm for t = Ω( log 2 n ) : 1. sort all t -blocks starting in S ( t ) and remove duplicates, 2. encode every t -block with its rank on the sorted list, 3. construct a new string code ( w ) of length O ( n / log n ) over alphabet [ 1 .. n ] , such that any CoarseLCE t query can be reduced to an LCE query on code ( w ) , 4. preprocess code ( w ) for LCE queries. 2 3 5 8 9 11 1415 17 2021 23 w : b a a b b a a b b a a a b b a a b b a a a b b b * * * * α 1 8 6 2 β 3 5 1 4 γ 6 1 8 7 code ( w ) : 1 8 6 2 $ 3 5 1 4 # 6 1 8 7 & γ α β 9/23
CoarseLCE t CoarseLCE t algorithm for t = Ω( log 2 n ) : 1. sort all t -blocks starting in S ( t ) and remove duplicates, 2. encode every t -block with its rank on the sorted list, 3. construct a new string code ( w ) of length O ( n / log n ) over alphabet [ 1 .. n ] , such that any CoarseLCE t query can be reduced to an LCE query on code ( w ) , 4. preprocess code ( w ) for LCE queries. 2 3 5 8 9 11 1415 17 2021 23 w : b a a b b a a b b a a a b b a a b b a a a b b b * * * * α 1 8 6 2 β 3 5 1 4 γ 6 1 8 7 code ( w ) : 1 8 6 2 $ 3 5 1 4 # 6 1 8 7 & γ α β 9/23
CoarseLCE t continued Lemma For t = Ω( log 2 n ) we can lexicographically sort all t -blocks of w starting in S ( t ) using O ( n ) ShortLCE t queries and O ( n ) additional time. 10/23
CoarseLCE t continued Lemma For t = Ω( log 2 n ) we can lexicographically sort all t -blocks of w starting in S ( t ) using O ( n ) ShortLCE t queries and O ( n ) additional time. Lemma For t = Ω( log 2 n ) if we can answer O ( n ) ShortLCE t queries in T ( n ) time (e.g. O ( n log t ) ), then we can preprocess w in O ( T ( n ) + n ) time (resp. O ( n log t ) ), so that any CoarseLCE t query can be answered in constant time. 10/23
ShortLCE t ShortLCE t is computed recursively, for t = 2 k : ◮ we have k levels (level h handles queries up to length 2 h ), ◮ each level has its separate Union-Find structure, ◮ if at level h we find out that two positions i and j have LCE ( i , j ) ≥ 2 h then we union those positions, ◮ so if Find h ( i ) = Find h ( j ) then LCE ( i , j ) ≥ 2 h otherwise we have no information about LCE ( i , j ) . 11/23
ShortLCE t Algorithm 2: ShortLCE 2 k ( i , j ) : compute LCE ( i , j ) up to length 2 k if Find k ( i ) = Find k ( j ) then return 2 k if k = 0 then if w [ i ] = w [ j ] then ℓ = 1 else ℓ = 0 else ℓ = ShortLCE 2 k − 1 ( i , j ) if ℓ = 2 k − 1 then ℓ = 2 k − 1 + ShortLCE 2 k − 1 ( i + 2 k − 1 , j + 2 k − 1 ) if ℓ = 2 k then Union k ( i , j ) return ℓ 12/23
ShortLCE t , continued Lemma For t = 2 k , a sequence of q ShortLCE t ( i , j ) queries can be executed on-line in total time O (( q + n ) k · α ( n )) = O (( q + n ) · log t · α ( n )) . 13/23
ShortLCE t , continued Lemma For t = 2 k , a sequence of q ShortLCE t ( i , j ) queries can be executed on-line in total time O (( q + n ) k · α ( n )) = O (( q + n ) · log t · α ( n )) . Proof. We inductively bound the number of recursive calls triggered by ShortLCE 2 k ( i , j ) : if w [ i .. i + 2 k − 1 ] � = w [ j .. j + 2 k − 1 ] , 2 k + 1 + 2 # union if w [ i .. i + 2 k − 1 ] = w [ j .. j + 2 k − 1 ] . 1 + 2 # union 13/23
Where are we now? With those results we currently have: Current result Answering n LCE queries can be done in: O ( n log log n · α ( n )) time, using O ( n log log n · α ( n )) symbol comparisons. How can we improve it? 14/23
Faster ShortLCE t queries We introduce new difference cover S ( t ′ ) with t ′ ≪ t . Sparse version of ShortLCE queries (queries restricted to positions from S ( t ′ ) ): � if i , j ∈ S ( t ′ ) ShortLCE t ( i , j ) SparseShortLCE t , t ′ ( i , j ) = ⊥ otherwise 15/23
SparseShortLCE t , t ′ Algorithm 3: SparseShortLCE 2 k , 2 k ′ ( i , j ) ⊲ i , j ∈ S ( 2 k ′ ) if Find k ( i ) = Find k ( j ) then return 2 k if k = k ′ then Compute naively ℓ = ShortLCE 2 k ′ ( i , j ) else ℓ = SparseShortLCE 2 k − 1 , 2 k ′ ( i , j ) if ℓ = 2 k − 1 then ℓ = 2 k − 1 + SparseShortLCE 2 k − 1 , 2 k ′ ( i + 2 k − 1 , j + 2 k − 1 ) if ℓ = 2 k then Union k ( i , j ) return ℓ 16/23
Faster ShortLCE t queries Lemma A sequence of q SparseShortLCE 2 k , 2 k ′ queries can be executed √ 2 k ′ + 2 k ′ / 2 log ∗ n ) . on-line in total time O ( q ( k + 2 k ′ ) + n nk 17/23
Faster ShortLCE t queries Lemma A sequence of q SparseShortLCE 2 k , 2 k ′ queries can be executed √ 2 k ′ + 2 k ′ / 2 log ∗ n ) . on-line in total time O ( q ( k + 2 k ′ ) + n nk Lemma For t = 2 k , a sequence of q ShortLCE t queries can be executed on-line in total time √ k log ∗ n ) = O ( q log t + n log t log ∗ n ) . � O ( qk + n 17/23
Faster ShortLCE t queries Lemma A sequence of q SparseShortLCE 2 k , 2 k ′ queries can be executed √ 2 k ′ + 2 k ′ / 2 log ∗ n ) . on-line in total time O ( q ( k + 2 k ′ ) + n nk Lemma For t = 2 k , a sequence of q ShortLCE t queries can be executed on-line in total time √ k log ∗ n ) = O ( q log t + n log t log ∗ n ) . � O ( qk + n Proof. Pick t ′ = Θ( log t ) = 2 k ′ . For query i , j : ◮ compute naively ℓ = ShortLCE 2 k ′ ( i , j ) ◮ if ℓ = 2 k ′ , shift ( i , j ) by h 2 k ′ ( i , j ) and use SparseShortLCE 2 k , 2 k ′ . 17/23
Recommend
More recommend