three strategies for the dead zone string matching
play

Three strategies for the dead-zone string matching algorithm J. - PowerPoint PPT Presentation

Three strategies for the dead-zone string matching algorithm J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. eonard, L. Mouchard, L E. Prieur-Gaston and B. Watson SeqBio 2018 19 20 November 2018 Rouen, France


  1. Three strategies for the dead-zone string matching algorithm J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. eonard, L. Mouchard, ´ L´ E. Prieur-Gaston and B. Watson SeqBio 2018 19 – 20 November 2018 – Rouen, France

  2. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 2 / 35

  3. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 3 / 35

  4. Notations finite alphabet Σ string x [0 . . m − 1] on Σ ∗ length | x | = m x is the reverse of x ( x [ m − 1] x [ m − 2] · · · x [1] x [0] ) ˜ x [ i . . j ] is a factor (substring) of x from position i to position j (both inclusive) x [0 . . i ] is a prefix x [ i . . m − 1] is a suffix u is a border of x if u is both a prefix and a suffix of x Border ( x ) is the longest border of x Daykin et al Dead-zone SeqBio’18 4 / 35

  5. Exact String Matching Problem Searching for all exact occurrences of a pattern x ( | x | = m ) in a text y ( | y | = n ) 2 variants on-line (preprocessing of the pattern) off-line (preprocessing of the text) Daykin et al Dead-zone SeqBio’18 5 / 35

  6. Exact On-Line String Matching https://smart-tool. github.io/smart/ Simone Faro and Thierry http://www-igm.univ-mlv. Lecroq fr/~lecroq/string/ The Exact Online String Christian Charras and Matching Problem: a Thierry Lecroq Review of the Most Handbook of exact string Recent Results matching algorithms ACM Computing Surveys King’s College 45 (2) (2013) 13 Publications, 2004 Daykin et al Dead-zone SeqBio’18 6 / 35

  7. Sliding window Classical solutions (KMP, BM, ...) Preprocessing of the pattern and use of a sliding window Daykin et al Dead-zone SeqBio’18 7 / 35

  8. Sliding window n y x m y x y x Daykin et al Dead-zone SeqBio’18 8 / 35

  9. Sliding window An on-line exact string matching algorithm can then be viwed as a succession of: attempts (comparison of the window content and the pattern); shift (of the window to the right). Daykin et al Dead-zone SeqBio’18 9 / 35

  10. Knuth-Morris-Pratt algorithm (1977) comparisons j y u b � = x u a � = z c k = min { ℓ | x [ | Border ℓ ( u ) | ] � = a } and z = Border k ( u ) Daykin et al Dead-zone SeqBio’18 10 / 35

  11. Boyer-Moore algorithm (1977) comparisons y v b x a v x c v . Daykin et al Dead-zone SeqBio’18 11 / 35

  12. Dead Zone strategy Bruce W. Watson and Richard E. Watson A New Family of String Pattern Matching Algorithms In: Jan Holub editor, Proceedings of the Prague Stringology Club Workshop 1997, Prague, Czech Republic, July 7, 1997 , Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, 12–23. Daykin et al Dead-zone SeqBio’18 12 / 35

  13. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  14. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  15. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  16. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  17. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  18. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  19. Dead Zone strategy Daykin et al Dead-zone SeqBio’18 13 / 35

  20. Our contributions Three strategies Right-to-left Right-to-left with memory Alternating searching: right – left Daykin et al Dead-zone SeqBio’18 14 / 35

  21. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 15 / 35

  22. Right-to-left i x a v � = rshift x v b lshift x c z A suffix v of the pattern matches in the text and a mismatch occurs with a at position i in the pattern. The right shift ( rshift ) consists in finding a re-occurrence of v in the pattern preceded by a symbol b different from a . The left shift ( lshift ) consists in finding the longest suffix z of the pattern preceded by a symbol c different from a . Daykin et al Dead-zone SeqBio’18 16 / 35

  23. Right-to-left right shift: similar as in the Boyer-Moore algorithm left shift: similar as in the Knuth-Morris-Pratt algorithm (but from right to left) Preprocessing phase linear in time and space Daykin et al Dead-zone SeqBio’18 17 / 35

  24. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 18 / 35

  25. Right-to-left with memory i k x a z y z b j comparisons When x [ i ] � = y [ j + i ] and x [ i + 1 . . m − 1] = y [ j + i + 1 . . j + m − 1] then skip 1 [ j + k ] = k and skip 2 [ j + k ] = k − i for i + 1 ≤ k ≤ m − 1 Daykin et al Dead-zone SeqBio’18 19 / 35

  26. Right-to-left with memory x y j If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

  27. Right-to-left with memory i x y j If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

  28. Right-to-left with memory i x y j x k ℓ If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

  29. Right-to-left with memory i x ? y j x ? k ℓ If k = skip 2 [ i + j ] > 0 , it means that x [ k − ℓ + 1 . . k ] = y [ i + j − ℓ + 1 . . i + j ] with ℓ = skip 1 [ i + j ] , and furthermore x [ k − ℓ ] � = y [ i + j − ℓ ] if k ≥ ℓ . We need to know whether y [ i + j − ℓ + 1 . . i + j ]= x [ i − ℓ + 1 . . i ] and thus we need to know whether x [ k − ℓ + 1 . . k ]= x [ i − ℓ + 1 . . i ] . Daykin et al Dead-zone SeqBio’18 20 / 35

  30. Right-to-left with memory x [ k − ℓ + 1 . . k ] ? = x [ i − ℓ + 1 . . i ] Longest common prefix of the suffixes of ˜ x starting at positions m − 1 − k and m − 1 − i Can be answer in constant time after linear preprocessing: RMQ on LCP of ˜ x Daykin et al Dead-zone SeqBio’18 21 / 35

  31. Right-to-left with memory skip 1 and skip 2 needs a stack: the mismatch position is not known for all the matching positions O ( n ) space Daykin et al Dead-zone SeqBio’18 22 / 35

  32. Outline Introduction 1 Right-to-left 2 Right-to-left with memory 3 Alternating searching: right – left 4 Daykin et al Dead-zone SeqBio’18 23 / 35

  33. Alternating searching: right – left Order of comparisons x [ m − 1] , x [0] , x [ m − 2] , x [1] , x [ m − 3] , . . . 4 shift arrays right shift after a right mismatch stored in array rsrm left shift after a right mismatch stored in array lsrm right shift after a left mismatch stored in array rslm left shift after a left mismatch stored in array lslm Daykin et al Dead-zone SeqBio’18 24 / 35

  34. Alternating searching: right – left 2 conditions occCond ′ x ( i, d ) = (0 < d ≤ i and x [ i − d ] � = x [ i ]) or ( i < d ) suffCond ′ x ( i, d ) = ( 0 < d ≤ m − 2 − i and x [ d . . m − 2 − i ] is a prefix of x and x [ i − d + 1 . . m − d − 1] is a suffix of x ) or ( m − 2 − i < d ≤ i + 1 and x [ i − d + 1 . . m − d − 1] is a suffix of x ) or ( i + 1 < d and x [0 . . m − d − 1] is a suffix of x ) rsrm [ i ] = min { d | occCond ′ x ( i, d ) and suffCond ′ x ( i, d ) are satisfied } . Daykin et al Dead-zone SeqBio’18 25 / 35

  35. Right shift after a right mismatch prefix u and suffix v (of the same length) of x match the text and a mismatch occurs with symbol a at position i of x : i d x u a v � = x v u ′ b the suffix v of x reoccurs preceded by a symbol b different from a and a prefix u ′ of x matches a suffix of u ; i d x u a v � = x v b only a suffix v of x reoccurs preceded by a symbol b different from a ; Daykin et al Dead-zone SeqBio’18 26 / 35

  36. Right shift after a right mismatch i d x u a v x v ′ only a prefix v ′ of x matches a suffix of v . Daykin et al Dead-zone SeqBio’18 27 / 35

Recommend


More recommend