Faster Longest Common Extension Queries in Strings over General - PowerPoint PPT Presentation

Faster Longest Common Extension Queries in Strings over General Alphabets Paweł Gawrychowski 1 , 2 Tomasz Kociumaka 1 Wojciech Rytter 1 Tomasz Waleń 1 1 University of Warsaw, Poland [gawry,kociumaka,rytter,walen]@mimuw.edu.pl 2 University of Haifa, Israel CPM 2016 Tel Aviv, Israel 2016–06–27 1/23

Introduction LCE problem We consider Longest Common Extension problem ( LCE ) in case of general ordered alphabet. ← only comparisons of characters! Preprocess a given word w of length n for queries: LCE ( i , j ) — the length of the longest common factor starting at position i and j in w . 2/23

Introduction LCE problem We consider Longest Common Extension problem ( LCE ) in case of general ordered alphabet. ← only comparisons of characters! Preprocess a given word w of length n for queries: LCE ( i , j ) — the length of the longest common factor starting at position i and j in w . Example 1 2 3 4 5 6 7 8 9 10 11 12 w = a b a b b a b b a b a a

Introduction LCE problem We consider Longest Common Extension problem ( LCE ) in case of general ordered alphabet. ← only comparisons of characters! Preprocess a given word w of length n for queries: LCE ( i , j ) — the length of the longest common factor starting at position i and j in w . Example 1 2 3 4 5 6 7 8 9 10 11 12 w = a b a b b a b b a b a a LCE ( 2 , 8 ) =?

Introduction LCE problem We consider Longest Common Extension problem ( LCE ) in case of general ordered alphabet. ← only comparisons of characters! Preprocess a given word w of length n for queries: LCE ( i , j ) — the length of the longest common factor starting at position i and j in w . Example 1 2 3 4 5 6 7 8 9 10 11 12 w = a b b a b b a b b a b b b a b a a b a a LCE ( 2 , 8 ) = 3 2/23

Results Naive solution Answering n LCE queries can be done in: O ( n log n ) time (reduce alphabet to [ 1 .. n ] via sorting). Previous results: Kosolobov (IPL, 2016) Answering n LCE queries can be done in: O ( n log 2 / 3 n ) time. Conjectured that O ( n ) time is possible. Motivation: efficient computation of runs (Bannai et al 2015). Our result: Answering n LCE queries can be done in: O ( n log log n ) time, using O ( n ) symbol comparisons. 3/23

Difference cover t -Cover / difference cover A set S ( t ) ⊆ [ 1 .. n ] is called a t - cover of [ 1 .. n ] if: ◮ S ( t ) is t -periodic, for each i ∈ [ 1 .. n − t ] : i ∈ S ( t ) ⇔ i + t ∈ S ( t ) ◮ there is a constant-time computable function h , such that for 1 ≤ i , j ≤ n − t : 0 ≤ h ( i , j ) ≤ t and i + h ( i , j ) , j + h ( i , j ) ∈ S ( t ) Lemma For each t ≤ n there is a t -cover S ( t ) of size O ( n √ t ) which can be constructed in O ( n √ t ) time. 4/23

t -Cover, example S ( 6 ) = { 2 , 3 , 5 , 8 , 9 , 11 , 14 , 15 , 17 , 20 , 21 , 23 } . 6 6 6 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 h ( 3 , 10 ) = 5 h ( 3 , 10 ) = 5 For i = 3, j = 10: we have h ( 3 , 10 ) = 5, since 3 + 5 , 10 + 5 ∈ S ( 6 ) . 5/23

ShortLCE t vs CoarseLCE t ShortLCE t ShortLCE t ( i , j ) = min ( LCE ( i , j ) , t ) . is used to find the length of LCE but up to maximal length t . CoarseLCE t � ⌊ LCE ( i , j ) / t ⌋ if i , j ∈ S ( t ) , CoarseLCE t ( i , j ) = ⊥ otherwise. is used to find the length of LCE , but up to t characters precision. 6/23

Generic algorithm Algorithm 1: GenericLCE t ( i , j ) ℓ 1 = ShortLCE t ( i , j ) if ℓ 1 < t then return ℓ 1 ∆ = h t ( i , j ) ⊲ i + ∆ , j + ∆ ∈ S ( t ) ℓ 2 = t · CoarseLCE t ( i + ∆ , j + ∆) ℓ 3 = ShortLCE t ( i + ∆ + ℓ 2 , j + ∆ + ℓ 2 ) return ∆ + ℓ 2 + ℓ 3 ∆ ∆ j i t t CoarseLCE ℓ 2 ℓ 2 ShortLCE ℓ 1 ℓ 3 ℓ 1 ℓ 3 7/23

CoarseLCE t CoarseLCE t algorithm for t = Ω( log 2 n ) : ◮ reduce word w to new word code ( w ) , that is: ◮ shorter (of length O ( n / √ t ) ) ◮ over small alphabet [ 1 .. n ] ◮ use naive solution (with suffix arrays) 8/23

CoarseLCE t CoarseLCE t algorithm for t = Ω( log 2 n ) : 1. sort all t -blocks starting in S ( t ) and remove duplicates, 2. encode every t -block with its rank on the sorted list, 3. construct a new string code ( w ) of length O ( n / log n ) over alphabet [ 1 .. n ] , such that any CoarseLCE t query can be reduced to an LCE query on code ( w ) , 4. preprocess code ( w ) for LCE queries. 2 3 5 8 9 11 1415 17 2021 23 w : b a a b b a a b b a a a b b a a b b a a a b b b * * * * α 1 8 6 2 β 3 5 1 4 γ 6 1 8 7 code ( w ) : 1 8 6 2 $ 3 5 1 4 # 6 1 8 7 & γ α β 9/23

CoarseLCE t continued Lemma For t = Ω( log 2 n ) we can lexicographically sort all t -blocks of w starting in S ( t ) using O ( n ) ShortLCE t queries and O ( n ) additional time. 10/23

CoarseLCE t continued Lemma For t = Ω( log 2 n ) we can lexicographically sort all t -blocks of w starting in S ( t ) using O ( n ) ShortLCE t queries and O ( n ) additional time. Lemma For t = Ω( log 2 n ) if we can answer O ( n ) ShortLCE t queries in T ( n ) time (e.g. O ( n log t ) ), then we can preprocess w in O ( T ( n ) + n ) time (resp. O ( n log t ) ), so that any CoarseLCE t query can be answered in constant time. 10/23

ShortLCE t ShortLCE t is computed recursively, for t = 2 k : ◮ we have k levels (level h handles queries up to length 2 h ), ◮ each level has its separate Union-Find structure, ◮ if at level h we find out that two positions i and j have LCE ( i , j ) ≥ 2 h then we union those positions, ◮ so if Find h ( i ) = Find h ( j ) then LCE ( i , j ) ≥ 2 h otherwise we have no information about LCE ( i , j ) . 11/23

ShortLCE t Algorithm 2: ShortLCE 2 k ( i , j ) : compute LCE ( i , j ) up to length 2 k if Find k ( i ) = Find k ( j ) then return 2 k if k = 0 then if w [ i ] = w [ j ] then ℓ = 1 else ℓ = 0 else ℓ = ShortLCE 2 k − 1 ( i , j ) if ℓ = 2 k − 1 then ℓ = 2 k − 1 + ShortLCE 2 k − 1 ( i + 2 k − 1 , j + 2 k − 1 ) if ℓ = 2 k then Union k ( i , j ) return ℓ 12/23

ShortLCE t , continued Lemma For t = 2 k , a sequence of q ShortLCE t ( i , j ) queries can be executed on-line in total time O (( q + n ) k · α ( n )) = O (( q + n ) · log t · α ( n )) . 13/23

ShortLCE t , continued Lemma For t = 2 k , a sequence of q ShortLCE t ( i , j ) queries can be executed on-line in total time O (( q + n ) k · α ( n )) = O (( q + n ) · log t · α ( n )) . Proof. We inductively bound the number of recursive calls triggered by ShortLCE 2 k ( i , j ) : if w [ i .. i + 2 k − 1 ] � = w [ j .. j + 2 k − 1 ] , 2 k + 1 + 2 # union if w [ i .. i + 2 k − 1 ] = w [ j .. j + 2 k − 1 ] . 1 + 2 # union 13/23

Where are we now? With those results we currently have: Current result Answering n LCE queries can be done in: O ( n log log n · α ( n )) time, using O ( n log log n · α ( n )) symbol comparisons. How can we improve it? 14/23

Faster ShortLCE t queries We introduce new difference cover S ( t ′ ) with t ′ ≪ t . Sparse version of ShortLCE queries (queries restricted to positions from S ( t ′ ) ): � if i , j ∈ S ( t ′ ) ShortLCE t ( i , j ) SparseShortLCE t , t ′ ( i , j ) = ⊥ otherwise 15/23

SparseShortLCE t , t ′ Algorithm 3: SparseShortLCE 2 k , 2 k ′ ( i , j ) ⊲ i , j ∈ S ( 2 k ′ ) if Find k ( i ) = Find k ( j ) then return 2 k if k = k ′ then Compute naively ℓ = ShortLCE 2 k ′ ( i , j ) else ℓ = SparseShortLCE 2 k − 1 , 2 k ′ ( i , j ) if ℓ = 2 k − 1 then ℓ = 2 k − 1 + SparseShortLCE 2 k − 1 , 2 k ′ ( i + 2 k − 1 , j + 2 k − 1 ) if ℓ = 2 k then Union k ( i , j ) return ℓ 16/23

Faster ShortLCE t queries Lemma A sequence of q SparseShortLCE 2 k , 2 k ′ queries can be executed √ 2 k ′ + 2 k ′ / 2 log ∗ n ) . on-line in total time O ( q ( k + 2 k ′ ) + n nk 17/23

Faster ShortLCE t queries Lemma A sequence of q SparseShortLCE 2 k , 2 k ′ queries can be executed √ 2 k ′ + 2 k ′ / 2 log ∗ n ) . on-line in total time O ( q ( k + 2 k ′ ) + n nk Lemma For t = 2 k , a sequence of q ShortLCE t queries can be executed on-line in total time √ k log ∗ n ) = O ( q log t + n log t log ∗ n ) . � O ( qk + n 17/23

Faster ShortLCE t queries Lemma A sequence of q SparseShortLCE 2 k , 2 k ′ queries can be executed √ 2 k ′ + 2 k ′ / 2 log ∗ n ) . on-line in total time O ( q ( k + 2 k ′ ) + n nk Lemma For t = 2 k , a sequence of q ShortLCE t queries can be executed on-line in total time √ k log ∗ n ) = O ( q log t + n log t log ∗ n ) . � O ( qk + n Proof. Pick t ′ = Θ( log t ) = 2 k ′ . For query i , j : ◮ compute naively ℓ = ShortLCE 2 k ′ ( i , j ) ◮ if ℓ = 2 k ′ , shift ( i , j ) by h 2 k ′ ( i , j ) and use SparseShortLCE 2 k , 2 k ′ . 17/23

Faster Longest Common Extension Queries in Strings over General - PowerPoint PPT Presentation

Faster Longest Common Extension Queries in Strings over General Alphabets Pawe Gawrychowski 1 , 2 Tomasz Kociumaka 1 Wojciech Rytter 1 Tomasz Wale 1 1 University of Warsaw, Poland [gawry,kociumaka,rytter,walen]@mimuw.edu.pl 2 University of

s[i] Introduction to Computer Programming Strings CSCI-UA 2 Strings and Characters Strings are

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Chapter 9 Strings 1 C-Strings vs C++ Strings T wo string types: C-strings Array

Strings Testing for equality with strings. Lexicographic ordering of strings. Other

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Strings Digital Medicine I Lists, strings, loops Repetition Hans-Joachim Bckenhauer Dennis

Chapter 9: Strings (To avoid confusion, C-style strings will be referred to as C-string,

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

On the Length of the Longest Common Subsequence Peter Rabinovitch Summary Consider two

Fast Parallel Longest Common Subsequence with General Integer Scoring Support Adnan Ozsoy , Arun

Valley Clean Energy Alliance A locally controlled energy provider Board of Directors Meeting

Webinar: Technical Assistance to Launch a Multifamily FSS Program September 25, 2019

Draft Falls Lake Existing Development Model Program John Huisman, DWR Nonpoint Source Planning

r rr

Learning to Compare Examples NIPS06 Workshop Organizers David Grangier and Samy

Time-Space Trade-Offs for Longest Common Extensions Philip Bille 1 , Inge Li Grtz 1 , Benjamin

In-place Longest Common Extensions Nicola Prezza University of Udine, department of Computer

Towards an Effective Collaboration between Industry and Academia Alessandro Di Bucchianico