PSC 2015 Faster Longest Common Extension on Compressed Strings and Applications Shunsuke Inenaga Kyushu University, Japan
Collaborators This work is a collaboration with: Hideo Takaaki Bannai Nishimoto Tomohiro Masayuki I Takeda
Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q .
Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I argue string algorithms at Prague stringology
Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I arg ue string algorithms at Prag ue string ology
Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I arg ue string algorithms at Prag ue string ology LCE (6, 34) = 9
Background & Motivation LCE has numerous applications, e.g., approximate pattern matching, computing palindromes, computing approximate repeats. A string T of length u can be preprocessed in O ( u ) time and space so that each LCE query can be answered in O (1) time [Demaine et al.]. However, the O ( u ) complexity can be prohibitive for large-scaled text. To save preprocessing time and space, we consider LCE on grammar-co compre resse ssed d text.
Straight Line Program (SLP) Definition An SLP is a sequence of n productions X 1 → expr 1 , X 2 → expr 2 , ···, X n → expr n ( a ∈ Σ ) • expr i = a • expr i = X l X r ( l , r < i ) An SLP is a CFG in the Chomsky normal form which derives a single string. SLPs model outputs of grammar-based compression algorithms (e.g., Re-pair, LZ78, LZDF, OLCA, etc).
Straight Line Program (SLP) n : size (# of productions) of a given SLP S h : height of the derivation tree of S u : length of the uncompressed string T represented by SLP S
Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 X 4 → X 1 X 2 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b
Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 h X 4 → X 1 X 2 n 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b u
Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 h X 4 → X 1 X 2 n 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b u log 2 u ≤ h ≤ n always holds. u can be exponential in n (e.g. consider string a u ). Hence, O (poly( n )) solutions are of significance.
Important Remarks X 6 6 5 X 5 X 4 4 4 3 1 1 1 2 1 2 a a a b a b Derivation trees are only imagin inar ary (used only for explanations) and are never constructed explicitly.
Longest Common Extension on SLP Problem 1 (grammar compressed LCE) 𝑜 Preprocess an input SLP 𝑇 = {𝑌 𝑗 → 𝑓𝑦𝑞𝑠 𝑗 } 𝑗=1 so that subsequent longest common extension queries LCE ( X j , X k , p , q ) can be answered quickly. X k X j p q abbabbabca acbbabcbbbac
Longest Common Extension on SLP Problem 1 (grammar compressed LCE) 𝑜 Preprocess an input SLP 𝑇 = {𝑌 𝑗 → 𝑓𝑦𝑞𝑠 𝑗 } 𝑗=1 so that subsequent longest common extension queries LCE ( X j , X k , p , q ) can be answered quickly. X k X j p q abba bbabca ac bbabcb bbac Query output is LCE length 5
What is the difficulty? We are not allowed to expand the SLP (compressed text), since this takes O (2 n ) time in the worst case. But we want to know the length of the longest common extension!
LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T h : height of SLP derivation tree L = O ( u ) log * u = o (log u ) L : LCE length (output) z ≤ n (due to Rytter ’03 ) z : size of LZ77 factorization of T
LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) This work O (log u +log * u log L ) O ( n loglog n log * u log u ) O ( n + z log * u log u ) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T h : height of SLP derivation tree L = O ( u ) log * u = o (log u ) L : LCE length (output) z ≤ n (due to Rytter ‘03) z : size of LZ77 factorization of T
Logstar (iterated logarithm) Definition The logstar ar of a positive integer u , denoted log * u , is the number of times the logarithm function needs to be iteratively applied to u until the result becomes less than or equal to 1 . The logstar is a very slowly growing function, e.g., log * 2 65536 = 5 .
LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) This work O (log u +log * u log L ) O ( n loglog n log * u log u ) O ( n + z log * u log u ) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T Fastest test Fastes test Smal allest est h : height of SLP derivation tree L = O ( u ) deterministic preprocessing log * u = o (log u ) in many cases L : LCE length (output) queries z ≤ n (due to Rytter ‘03) z : size of LZ77 factorization of T
Our strategy All previous algorithms work on the SLP derivation trees of two query non-terminals. Our new algorithm does NOT work on the SLP derivation trees. Instead, we construct a different tree of logarithmic height, based on locally consistent parsing signature encoding.
Locally consistent parsing Lemma 1 [Mehlhorn et al., Alstrup et al.] For any integer string Y ∈ {1.. m } * in which no adjacent elements are equal (i.e. Y [ i ] ≠ Y [ i +1] ), there is a bit string d of length | Y | such that 1. no 1 ’s appear consecutively; 2. at most three 0 ’s appear consecutively; 3. each d [ i ] is determined locally, i.e., by Y [ i − D L … i − 1] and Y [ i ... i + D R ] , where D L ≤ log * m + 6 and D R ≤ 4 ; d can be computed in O (| Y |) time. 4.
Locally consistent parsing Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0
Locally consistent parsing D L Δ R Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0, 0 ,1,0,1,0,1,0,1,0,0 D L ≤ log * m + 6 D R ≤ 4
Locally consistent parsing Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0 Using the bit string d, any integer string Y can be uniquely decomposed in linear time into blocks of length 2-4 .
Signature encoding [Mehlhorn et al. ’97] Iteratively apply locally consistent parsing to input string T until a single integer is obtained. a b c a c a b b c a b a c c c a T =
Signature encoding [Mehlhorn et al. ’97] Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Each character is assigned to a unique integer called a signature. 1 2 3 1 3 1 3 1 2 1 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =
Signature encoding [Mehlhorn et al. ’97] Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Maximal run of the same Run of the same signatures is assigned to signatures is assigned to a new signature. a new signature. 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =
Signature encoding [Mehlhorn et al. ’97] Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Apply locally consistent parsing to this string. 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =
Signature encoding [Mehlhorn et al. ’97] Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Each block is assigned to a new signature. 6 8 6 9 7 7 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =
Recommend
More recommend