Compressed Strings and Applications Shunsuke Inenaga Kyushu - PowerPoint PPT Presentation

PSC 2015 Faster Longest Common Extension on Compressed Strings and Applications Shunsuke Inenaga Kyushu University, Japan

Collaborators This work is a collaboration with: Hideo Takaaki Bannai Nishimoto Tomohiro Masayuki I Takeda

Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q .

Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I argue string algorithms at Prague stringology

Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I arg ue string algorithms at Prag ue string ology

Longest common extension (LCE) Lon onges est common exten ension on (LCE) on string T is a task such that, given two positions p and q , compute the length of the longest common substring of T starting at positions p and q . p = 6 q = 34 I arg ue string algorithms at Prag ue string ology LCE (6, 34) = 9

Background & Motivation  LCE has numerous applications, e.g., approximate pattern matching, computing palindromes, computing approximate repeats.  A string T of length u can be preprocessed in O ( u ) time and space so that each LCE query can be answered in O (1) time [Demaine et al.].  However, the O ( u ) complexity can be prohibitive for large-scaled text.  To save preprocessing time and space, we consider LCE on grammar-co compre resse ssed d text.

Straight Line Program (SLP) Definition An SLP is a sequence of n productions X 1 → expr 1 , X 2 → expr 2 , ···, X n → expr n ( a ∈ Σ ) • expr i = a • expr i = X l X r ( l , r < i )  An SLP is a CFG in the Chomsky normal form which derives a single string.  SLPs model outputs of grammar-based compression algorithms (e.g., Re-pair, LZ78, LZDF, OLCA, etc).

Straight Line Program (SLP) n : size (# of productions) of a given SLP S h : height of the derivation tree of S u : length of the uncompressed string T represented by SLP S

Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 X 4 → X 1 X 2 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b

Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 h X 4 → X 1 X 2 n 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b u

Example of SLP SLP S Derivation tree of SLP S 7 X 1 → a X 2 → b 6 X 3 → X 1 X 1 5 5 h X 4 → X 1 X 2 n 3 4 3 4 4 X 5 → X 3 X 4 1 1 1 1 2 1 2 1 1 2 X 6 → X 5 X 4 X 7 → X 5 X 6 a a a b a a a b a b u  log 2 u ≤ h ≤ n always holds.  u can be exponential in n (e.g. consider string a u ).  Hence, O (poly( n )) solutions are of significance.

Important Remarks X 6 6 5 X 5 X 4 4 4 3 1 1 1 2 1 2 a a a b a b  Derivation trees are only imagin inar ary (used only for explanations) and are never constructed explicitly.

Longest Common Extension on SLP Problem 1 (grammar compressed LCE) 𝑜 Preprocess an input SLP 𝑇 = {𝑌 𝑗 → 𝑓𝑦𝑞𝑠 𝑗 } 𝑗=1 so that subsequent longest common extension queries LCE ( X j , X k , p , q ) can be answered quickly. X k X j p q abbabbabca acbbabcbbbac

Longest Common Extension on SLP Problem 1 (grammar compressed LCE) 𝑜 Preprocess an input SLP 𝑇 = {𝑌 𝑗 → 𝑓𝑦𝑞𝑠 𝑗 } 𝑗=1 so that subsequent longest common extension queries LCE ( X j , X k , p , q ) can be answered quickly. X k X j p q abba bbabca ac bbabcb bbac Query output is LCE length 5

What is the difficulty?  We are not allowed to expand the SLP (compressed text), since this takes O (2 n ) time in the worst case.  But we want to know the length of the longest common extension!

LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T  h : height of SLP derivation tree L = O ( u )  log * u = o (log u ) L : LCE length (output)  z ≤ n (due to Rytter ’03 ) z : size of LZ77 factorization of T 

LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) This work O (log u +log * u log L ) O ( n loglog n log * u log u ) O ( n + z log * u log u ) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T  h : height of SLP derivation tree L = O ( u )  log * u = o (log u ) L : LCE length (output)  z ≤ n (due to Rytter ‘03) z : size of LZ77 factorization of T 

Logstar (iterated logarithm) Definition The logstar ar of a positive integer u , denoted log * u , is the number of times the logarithm function needs to be iteratively applied to u until the result becomes less than or equal to 1 .  The logstar is a very slowly growing function, e.g., log * 2 65536 = 5 .

LCE algorithms on SLPs Algorithms Query time Preprocessing time Space Folklore O ( hL ) O ( n ) O ( n ) (extended) O ( hn 2 ) O ( n 4 ) O ( n 2 ) Miyazaki et al. ’ 97 (extended) O ( hn 2 ) O ( hn 2 ) O ( n 2 ) Lifshits ’07 I et al. ’15 O ( hn 2 ) O ( n 2 ) O ( h log u ) Bille et al. ’15 N/A O (log u + log 2 L ) O ( n ) (randomized) This work O (log u +log * u log L ) O ( n loglog n log * u log u ) O ( n + z log * u log u ) n : size of SLP log u ≤ h ≤ n u : length of uncompressed string T Fastest test  Fastes test Smal allest est h : height of SLP derivation tree L = O ( u )  deterministic preprocessing log * u = o (log u ) in many cases L : LCE length (output)  queries z ≤ n (due to Rytter ‘03) z : size of LZ77 factorization of T 

Our strategy  All previous algorithms work on the SLP derivation trees of two query non-terminals.  Our new algorithm does NOT work on the SLP derivation trees.  Instead, we construct a different tree of logarithmic height, based on  locally consistent parsing  signature encoding.

Locally consistent parsing Lemma 1 [Mehlhorn et al., Alstrup et al.] For any integer string Y ∈ {1.. m } * in which no adjacent elements are equal (i.e. Y [ i ] ≠ Y [ i +1] ), there is a bit string d of length | Y | such that 1. no 1 ’s appear consecutively; 2. at most three 0 ’s appear consecutively; 3. each d [ i ] is determined locally, i.e., by Y [ i − D L … i − 1] and Y [ i ... i + D R ] , where D L ≤ log * m + 6 and D R ≤ 4 ; d can be computed in O (| Y |) time. 4.

Locally consistent parsing Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0

Locally consistent parsing D L Δ R Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0, 0 ,1,0,1,0,1,0,1,0,0 D L ≤ log * m + 6 D R ≤ 4

Locally consistent parsing Y = 1,2,3,5,2,3,4,2,5,1,2,3,5,2,3,4,2,5 d = 1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0  Using the bit string d, any integer string Y can be uniquely decomposed in linear time into blocks of length 2-4 .

Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. a b c a c a b b c a b a c c c a T =

Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Each character is assigned to a unique integer called a signature. 1 2 3 1 3 1 3 1 2 1 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =

Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Maximal run of the same Run of the same signatures is assigned to signatures is assigned to a new signature. a new signature. 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =

Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Apply locally consistent parsing to this string. 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =

Signature encoding [Mehlhorn et al. ’97]  Iteratively apply locally consistent parsing to input string T until a single integer is obtained. Each block is assigned to a new signature. 6 8 6 9 7 7 1 2 3 1 3 1 4 3 1 2 1 5 1 2 2 3 3 3 a b c a c a b b c a b a c c c a T =

Compressed Strings and Applications Shunsuke Inenaga Kyushu - PowerPoint PPT Presentation

PSC 2015 Faster Longest Common Extension on Compressed Strings and Applications Shunsuke Inenaga Kyushu University, Japan Collaborators This work is a collaboration with: Hideo Takaaki Bannai Nishimoto Tomohiro Masayuki I Takeda

s[i] Introduction to Computer Programming Strings CSCI-UA 2 Strings and Characters Strings are

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je University of

Chapter 9 Strings 1 C-Strings vs C++ Strings T wo string types: C-strings Array

Strings Testing for equality with strings. Lexicographic ordering of strings. Other

counting colours in compressed strings Travis Gagie Juha K arkk ainen CPM 2011 counting

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Strings Digital Medicine I Lists, strings, loops Repetition Hans-Joachim Bckenhauer Dennis

Chapter 9: Strings (To avoid confusion, C-style strings will be referred to as C-string,

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je Wrocaw,

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

Decoding in Compressed Sensing Ronald DeVore USC, 2008 p. 1/33 Discrete Compressed Sensing R

String Amplitudes, Topological Strings and the Omega-deformation Strings @ Princeton 26 - 06 -

Strings, Languages, and Regular expressions Lecture 2 1 Strings 2 Definitions for strings

STRINGS AND FACTORS Jeff Goldsmith, PhD Department of Biostatistics 1 Strings vs Factors

ARM Assembler Strings Strings p. 1/16 Characters or Strings A string is a sequence of

Effective 2D description of thin liquid crystal elastomer sheets Marius Lemm (Caltech) joint

Disclosures The Rapidly Changing Landscape Consulting of Diabetes Mellitus: What You

Segmented-Crystal Electromagnetic Precision Calorimeter (S-CEPCal) 12 March 2019 Calorimetry

Loss Valleys and Generalization in Deep Learning Andrew Gordon Wilson Assistant Professor

Logistic Regression Dr. Besnik Fetahu Supervised Classification X = { x (1) , . . . , x ( n ) } Y

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN READ CHAPTER 4 FROM YOAV

| V ub | from QCD Sum Rules on the Light-Cone Patricia Ball IPPP , Durham CKM06, 14 December

A Journey through the World of Incompressible Viscous Flows : an Evolution Equation Perspective