Three strategies for the dead-zone string matching algorithm J. - PowerPoint PPT Presentation

Three strategies for the dead-zone string matching algorithm J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. eonard, L. Mouchard, ´ L´ E. Prieur-Gaston and B. Watson SeqBio 2018 19 – 20 November 2018 – Rouen, France

Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 2 / 35

Notations finite alphabet Σ string x [0 . . m − 1] on Σ ∗ length | x | = m x is the reverse of x ( x [ m − 1] x [ m − 2] · · · x [1] x [0] ) ˜ x [ i . . j ] is a factor (substring) of x from position i to position j (both inclusive) x [0 . . i ] is a prefix x [ i . . m − 1] is a suffix u is a border of x if u is both a prefix and a suffix of x Border ( x ) is the longest border of x Daykin et al Dead-zone SeqBio’18 4 / 35

Exact String Matching Problem Searching for all exact occurrences of a pattern x ( | x | = m ) in a text y ( | y | = n ) 2 variants on-line (preprocessing of the pattern) off-line (preprocessing of the text) Daykin et al Dead-zone SeqBio’18 5 / 35

Exact On-Line String Matching https://smart-tool. github.io/smart/ Simone Faro and Thierry http://www-igm.univ-mlv. Lecroq fr/~lecroq/string/ The Exact Online String Christian Charras and Matching Problem: a Thierry Lecroq Review of the Most Handbook of exact string Recent Results matching algorithms ACM Computing Surveys King’s College 45 (2) (2013) 13 Publications, 2004 Daykin et al Dead-zone SeqBio’18 6 / 35

Sliding window Classical solutions (KMP, BM, ...) Preprocessing of the pattern and use of a sliding window Daykin et al Dead-zone SeqBio’18 7 / 35

Sliding window n y x m y x y x Daykin et al Dead-zone SeqBio’18 8 / 35

Sliding window An on-line exact string matching algorithm can then be viwed as a succession of: attempts (comparison of the window content and the pattern); shift (of the window to the right). Daykin et al Dead-zone SeqBio’18 9 / 35

Knuth-Morris-Pratt algorithm (1977) comparisons j y u b � = x u a � = z c k = min { ℓ | x [ | Border ℓ ( u ) | ] � = a } and z = Border k ( u ) Daykin et al Dead-zone SeqBio’18 10 / 35

Boyer-Moore algorithm (1977) comparisons y v b x a v x c v . Daykin et al Dead-zone SeqBio’18 11 / 35

Dead Zone strategy Bruce W. Watson and Richard E. Watson A New Family of String Pattern Matching Algorithms In: Jan Holub editor, Proceedings of the Prague Stringology Club Workshop 1997, Prague, Czech Republic, July 7, 1997 , Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, 12–23. Daykin et al Dead-zone SeqBio’18 12 / 35

Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

Our contributions Three strategies Right-to-left Right-to-left with memory Alternating searching: right – left Daykin et al Dead-zone SeqBio’18 14 / 35

Right-to-left i x a v � = rshift x v b lshift x c z A suffix v of the pattern matches in the text and a mismatch occurs with a at position i in the pattern. The right shift ( rshift ) consists in finding a re-occurrence of v in the pattern preceded by a symbol b different from a . The left shift ( lshift ) consists in finding the longest suffix z of the pattern preceded by a symbol c different from a . Daykin et al Dead-zone SeqBio’18 16 / 35

Right-to-left right shift: similar as in the Boyer-Moore algorithm left shift: similar as in the Knuth-Morris-Pratt algorithm (but from right to left) Preprocessing phase linear in time and space Daykin et al Dead-zone SeqBio’18 17 / 35

Right-to-left with memory i k x a z y z b j comparisons When x [ i ] � = y [ j + i ] and x [ i + 1 . . m − 1] = y [ j + i + 1 . . j + m − 1] then skip 1 [ j + k ] = k and skip 2 [ j + k ] = k − i for i + 1 ≤ k ≤ m − 1 Daykin et al Dead-zone SeqBio’18 19 / 35

Right-to-left with memory x y j If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

Right-to-left with memory i x y j If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

Right-to-left with memory i x y j x k ℓ If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

Right-to-left with memory i x ? y j x ? k ℓ If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

Right-to-left with memory x [ k − ℓ + 1 . . k ] ? = x [ i − ℓ + 1 . . i ] Longest common prefix of the suffixes of ˜ x starting at positions m − 1 − k and m − 1 − i Can be answer in constant time after linear preprocessing: RMQ on LCP of ˜ x Daykin et al Dead-zone SeqBio’18 21 / 35

Right-to-left with memory skip 1 and skip 2 needs a stack: the mismatch position is not known for all the matching positions O ( n ) space Daykin et al Dead-zone SeqBio’18 22 / 35

Alternating searching: right – left Order of comparisons x [ m − 1] , x [0] , x [ m − 2] , x [1] , x [ m − 3] , . . . 4 shift arrays right shift after a right mismatch stored in array rsrm left shift after a right mismatch stored in array lsrm right shift after a left mismatch stored in array rslm left shift after a left mismatch stored in array lslm Daykin et al Dead-zone SeqBio’18 24 / 35

Alternating searching: right – left 2 conditions occCond ′ x ( i, d ) = (0 < d ≤ i and x [ i − d ] � = x [ i ]) or ( i < d ) suffCond ′ x ( i, d ) = ( 0 < d ≤ m − 2 − i and x [ d . . m − 2 − i ] is a prefix of x and x [ i − d + 1 . . m − d − 1] is a suffix of x ) or ( m − 2 − i < d ≤ i + 1 and x [ i − d + 1 . . m − d − 1] is a suffix of x ) or ( i + 1 < d and x [0 . . m − d − 1] is a suffix of x ) rsrm [ i ] = min { d | occCond ′ x ( i, d ) and suffCond ′ x ( i, d ) are satisfied } . Daykin et al Dead-zone SeqBio’18 25 / 35

Right shift after a right mismatch prefix u and suffix v (of the same length) of x match the text and a mismatch occurs with symbol a at position i of x : i d x u a v � = x v u ′ b the suffix v of x reoccurs preceded by a symbol b different from a and a prefix u ′ of x matches a suffix of u ; i d x u a v � = x v b only a suffix v of x reoccurs preceded by a symbol b different from a ; Daykin et al Dead-zone SeqBio’18 26 / 35

Right shift after a right mismatch i d x u a v x v ′ only a prefix v ′ of x matches a suffix of v . Daykin et al Dead-zone SeqBio’18 27 / 35

Three strategies for the dead-zone string matching algorithm J. - PowerPoint PPT Presentation

Three strategies for the dead-zone string matching algorithm J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. eonard, L. Mouchard, L E. Prieur-Gaston and B. Watson SeqBio 2018 19 20 November 2018 Rouen, France

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

String Matching String matching problem: string T (text) and string P (pattern) over an

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Scaling Dropbox P R E S L AV L E , N O V E M B E R 7 T H , 2 0 1 6 Zone Zone (west) (east)

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

The String Class Trace Code Constructing a String String s = "Java"; String

Cert-Lexsi Cert-Lexsi Dead angle ( Torpig vs PRG) Dead angle ( Torpig vs PRG) Dead angle (

Dead Code Elimination (DCE) Dead code elimination is an optimization that removes DEAD

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

Still wat St water dead zone & collimat dead zone & col mated ej ed eject ecta in g

String Matching II Algorithm : Design & Analysis [19] In the last class Simple String

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

String Matching with Involutions Florin Manea Challenges in Combinatorics on Words April 2013

String Matching: Rabin-Karp Algorithm Greg Plaxton Theory in Programming Practice, Fall 2005

String Matching Algorithm : Design & Analysis [18] In the last class Optimal Binary

MA/CSSE 473 Day 26 String Search Horspool Boyer-Moore MA/CSSE 473 Day 26 Tomorrow!

String Matching: Boyer-Moore Algorithm Greg Plaxton Theory in Programming Practice, Fall 2005

A Method for Companionability, Applied to Group Actions and Valuations with Aye Berkman and

Foundations of Artificial Intelligence 11. Action Planning Solving Logically Specified Problems

Algorithms for the Densest Sublattice Problem Daniele Micciancio (UCSD) (Joint work with D.

The Eight-Point Algorithm COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision The

Construction Algorithms for (Polynomial) Lattice Points Peter Kritzer Johann Radon Institute for

Relativistic Effects Relativistic Bit . . . Can Keep Data Secret: Relativistic Bit . . . Why

Three strategies for the dead-zone string matching algorithm J. - PowerPoint PPT Presentation

Three strategies for the dead-zone string matching algorithm J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. eonard, L. Mouchard, L E. Prieur-Gaston and B. Watson SeqBio 2018 19 20 November 2018 Rouen, France

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

String Matching String matching problem: string T (text) and string P (pattern) over an

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Scaling Dropbox P R E S L AV L E , N O V E M B E R 7 T H , 2 0 1 6 Zone Zone (west) (east)

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

The String Class Trace Code Constructing a String String s = &quot;Java&quot;; String

Cert-Lexsi Cert-Lexsi Dead angle ( Torpig vs PRG) Dead angle ( Torpig vs PRG) Dead angle (

Dead Code Elimination (DCE) Dead code elimination is an optimization that removes DEAD

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

Still wat St water dead zone &amp; collimat dead zone &amp; col mated ej ed eject ecta in g

String Matching II Algorithm : Design &amp; Analysis [19] In the last class Simple String

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

String Matching with Involutions Florin Manea Challenges in Combinatorics on Words April 2013

String Matching: Rabin-Karp Algorithm Greg Plaxton Theory in Programming Practice, Fall 2005

String Matching Algorithm : Design &amp; Analysis [18] In the last class Optimal Binary

MA/CSSE 473 Day 26 String Search Horspool Boyer-Moore MA/CSSE 473 Day 26 Tomorrow!

String Matching: Boyer-Moore Algorithm Greg Plaxton Theory in Programming Practice, Fall 2005

A Method for Companionability, Applied to Group Actions and Valuations with Aye Berkman and

Foundations of Artificial Intelligence 11. Action Planning Solving Logically Specified Problems

Algorithms for the Densest Sublattice Problem Daniele Micciancio (UCSD) (Joint work with D.

The Eight-Point Algorithm COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision The

Construction Algorithms for (Polynomial) Lattice Points Peter Kritzer Johann Radon Institute for

Relativistic Effects Relativistic Bit . . . Can Keep Data Secret: Relativistic Bit . . . Why

The String Class Trace Code Constructing a String String s = "Java"; String

Still wat St water dead zone & collimat dead zone & col mated ej ed eject ecta in g

String Matching II Algorithm : Design & Analysis [19] In the last class Simple String

String Matching Algorithm : Design & Analysis [18] In the last class Optimal Binary