practical and optimal string matching
play

Practical and Optimal String Matching Kimmo Fredriksson Department - PowerPoint PPT Presentation

Practical and Optimal String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of od z, Computer Engineering Department SPIRE05 p.1/25


  1. Practical and Optimal String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of Ł´ od´ z, Computer Engineering Department SPIRE’05 – p.1/25

  2. ☛ ✂ ✁ ✡ � � ✞ ✠ ☎ ✠ ✟ ✟ ✞ ✝ ☎ ☞ ✂ ✁ Problem Setting The classic string matching problem: Given text and pattern over some finite ✄✆☎ ✄✆☎ alphabet of size , find the occurrences of in . We focus on the case where is relatively small Bit-parallelism. SPIRE’05 – p.2/25

  3. ☎ ✠ ✝ ✄ ✠ ✁ ✂ ✝ ✁ � ✂☛ � ✠ ✂ ✠ ✂ ✝ ✁ � ✄ ☎ ✁ Previous work Vast number of algorithms exist. Some of the most well-known are (classics): Knuth-Morris-Prat: The first worst case time algorithm. Boyer-Moore(-Horspool)-family: Numerous variants, sublinear on average. (bit-parallel:) for (Baeza-Yates & Gonnet, 1992). Shift-or: on average for . ✆✞✝ BNDM family: ✟✡✠ SBNDM (Navarro, 2001; Peltola & Tarhio, 2003), LNDM (He & Fang, 2004), FNDM (Holub & Durian, 2005). SPIRE’05 – p.3/25

  4. � Previous work In practice, the best current algorithms for short patterns are the BNDM-family of algorithms (Navarro & Raffinot, 2000). SPIRE’05 – p.4/25

  5. ✄ ☛ � ✁ � ✠ ☎ � ✠ ✝ ✂ ✂ ✠ ✂ � ✠ ✁ ✠ ✟ � ✝ ✁ � ✁ � ✝ ✂ ☞ This work We develop a novel pattern partitioning technique that allows us to use shift-or while skipping text characters. The algorithm has optimal average case ✆✞✝ running time if . Very simple to implement, simple inner loop (comparable to plain shift-or) very efficient in practice. worst case, but can be improved to without destroying the simplicity of the search algorithm. SPIRE’05 – p.5/25

  6. � � � � � � � � � Our algorithm: the idea The algorithm is based on the preprocessing / filtering / verification paradigm. The preprocessing phase generates different alignements of the pattern, each containing only every th pattern character. I.e. we partition the pattern into pieces. The filtering phase searches all the pieces in parallel using shift-or algorithm, reading only every th text character. If any of the pieces match, then we invoke a verification algorithm. SPIRE’05 – p.6/25

  7. ✟ ✜ � ☛ ✠ ✑ ✁ ✓ ✠ ✟ � ✁ ✙ � ✖ ✟ ✚ � � � ✏ ✒ � ✒ � ✄ ✟ ✁ ✛ ✁ � ✔ ✚ ✘ ✟ ✔ ✁ ✟ ✕ � ✁ ✠ ✄ ✒ � ☛ ✠ ✑ � ☛ ✠ ✑ ✟ ☎ ✁ ✞ ☞ ✂ ☛ ✟ � ☎ ✌ ☎ ☎ ☎ ✄ ✟ ✂ ✁ � ✟ ✂ ✟ ✍ ☎ ☎ ✄ ✏ � ☞ ✎ ☎ ☎ ✁ ☎ ☎ ✎ ✁ ✌ ☎ ✞ � ☞ Preprocessing ✟✡✠ Given a pattern , generate a set of ✆✞✝ patterns as follows: ✄✆☎ I.e. we generate different alignments of the original pattern , each alignment containing only every th character. Each new pattern has length . The total length of the patterns is . For example, if and , then ✕✗✖ ✘✗✙ , and . SPIRE’05 – p.7/25

  8. � � ✌ ✍ � ✟ ✞ ☎ ✌ ✁ ☞ � ☞ ☎ � ✁ ✎ ☎ ☎ ☎ ✠ ✓ ✍ ✂ ☞ ✠ ✟ ✁ ✁ ✂ ☞ ☎ ☎ ☞ ✍ ✏ ✁ ✄ ✞ � ☞ ✟ ☛ ✂ � ✞ ✁ ✏ Preprocessing: the rationale Assume that occurs at . mod ✄✆☎ (1) We can use the set as a filter for the pattern (2) The filter needs to scan only every th character of . SPIRE’05 – p.8/25

  9. Preprocessing: the rationale P a b c d e f i p T x a b c d e x f x x x P 0 a d P 1 e b P 2 c f P ’ a d b e c f SPIRE’05 – p.9/25

  10. ✁ � ✄ ✠ ✁ � ✞ ☞ ✟ ✂ ☞ ✞ ✁ ✂ ✞ ✞ ✝ ☎ ☞ ✂ ☞ ✁ ☞ ✂ ✁ � ✂ ✟ ✄ ☎ ✆ ☞ ✄ ✄ ✚ ✟ ✁ ✁ � � ✔ ✘ ✙ Prelude to filtering: Shift-or algorithm The algorithm is based on a non-deterministic automaton. The automaton for is: ✕✗✖ a b c d e f Σ 1 2 3 4 5 6 7 The transitions are encoded in a table of bit-masks: For , the mask has the th bit set to 0, iff ✂ ✄✂ . The bit-vector has one bit per state in the automaton, the th bit of the vector is set to 0, iff the state is active (initially all bits are 1). It can be shown that the automaton can be simulated as: ☎✞✝ SPIRE’05 – p.10/25

  11. ✁ ✠ ✠ ✠ ✂ ✁ � ✠ ✠ ✁ ✠ ☎ � ✠ ✂ � ✄ � ✁ ✂ ✠ � ✁ ☎ ✂ ✂ ✠ ✄ ☎ ☎ ✟ ☛ ✁ ☞ ✝ ✏ ✠ ✍ ✠ ☎ ☎ ☞ ✞ ✂ ☛ Prelude to filtering: Shift-or algorithm If after the simulation step, the th bit of is zero, then occurs at . ✄✆☎ Can be detected as where has only the th bit set. Clearly each step of the automaton is simulated in time , which leads to total time. SPIRE’05 – p.11/25

  12. ✌ ☎ � � ✟ ☛ ✓ ✁ ✌ ✍ ✄ ✂ ✠ ✓ ✠ ✂ ✖ ✄ ✁ ☎ � ✠ ✠ ✂ ✁ ✁ ✠ ✠ ✍ ✠ ✠ ✚ ✙ ✁ ✟ � � ✟ ✁ ✔ ✕ ✖ ✘ ✕ ✚ ✙ ✓ ✟ ✘ ✁ ✁ ✜ ✔ ✟ ✟ ✄ ✟ Filtering The whole set of patterns can be searched simultaneously using the Shift-or algorithm (Baeza-Yates & Gonnet, 1992). All the patterns are preprocessed together, as if they were concatenated: For , we effectively preprocess a pattern . If the pattern matches, then the -th bit in is zero. This can be detected as where has every -th bit set to 1. SPIRE’05 – p.12/25

  13. ☎ � ✞ ✎ ✆ ☞ ✁ � ✆ ☎ ☞ � � ✄ ✍ ☞ ✆ ☞ ✠ ✞ ✝ ✠ ✠ ✁ ✂ ✟ ✄ ✝ ✝ ✂ ✠ � ✝ � ☎ ✁ ✁ ✆ ☎ ✂ ☞ ✠ ✠ ✁ ✆ ☎ ✠ ✠ ✝ ✝ ☞ ☞ ✎ ✁ ✆ ☞ ☞ � ✆ ☎ ✍ � ✆ ✁ ✁ ✁ ✁ ✂ ✠ ✠ � ☎ ✁ ✂ ✞ ✝ ✞ ☞ ✂ ✁ ✂ ✁ ✂ ✟ ✄ ✂ Filtering: the simplicity illustrated Plain shift-or search: 1 ✎ ✂✁ 2 while do 3 ☎✞✝ 4 then report match if 5 Our shift-or search: 1 ✎ ✂✁ 2 while do 3 4 then Verify if 5 SPIRE’05 – p.13/25

  14. � � ✂ � ✠ ✁ ✟ � Verification If any of the pattern pieces in match, we verify if the original pattern matches (with the corresponding alignement). Can be done by brute force algorithm, with worst case cost. SPIRE’05 – p.14/25

  15. ✂ ☛ ✝ ☛ ☛ ✁ ✂ ✆ � � � � ☞ ✠ ✝ ☛ ✁ ✁ ✂ ✆ ✄ ✝ ☛ � ✂ � ✁ � ✁ ✠ ☛ ✠ ✠ � ✠ ✟ ✂ � ✁ ✝ ☛ � ✂ ✠ � ☛ ✄ ☛ ☛ ✠ ☛ ✁ ✠ ✟ ✁ ✄ ☛ ☛ ✂ � ✁ ✂ ✆ ✄ ✝ ☞ ✟ Complexity The filtering time is . Assuming that each character occurs with probability , the probability that occurs in a given text position is . The verification cost is on average at most We select so that , i.e. . ✆✞✝ Total average time is , which is optimal. ✆✞✝ SPIRE’05 – p.15/25

  16. ✒ ☛ � ✁ ✝ � ✟ ✠ ✂ ✁ ✁ ✠ ✂ ✠ ☞ ✂ ✄ � ✁ ✑ ✁ ✠ ✏ ✄ ✂ ☛ ☎ � ☎ ✍ ✁ � ✠ ✓ � ☎ ☞ � ✁ ✝ ✠ ✟ ✠ ✂ � ✠ ☛ ☎ ✂ � Long patterns If , we must use several computer words Asymptotic running time becomes . ✆✞✝ The trick in (Peltola & Tarhio, 2003) to make BNDM work with can be applied to our algorithm too. Omitting the details, we obtain ✆✞✝ average time where . Not optimal anymore. SPIRE’05 – p.16/25

  17. ✁ � � ✂ ✝ ✁ � � ✝ ✂ ✂ ✝ ✠ ✁ � ☞ � Linear worst case time The worst case running time is . Use any worst case time algorithm for the verifications, and do the verifications incrementally, saving the search state of the worst case algorithm after each verification. ’Standard trick’, worst case becomes . Not a real problem: if verification time is a problem, then the filter does not work well, and can use the linear time algorithm instead. SPIRE’05 – p.17/25

  18. � ✏ ✞ ✒ ☞ � ☞ ☛ � ✄ ☞ ✠ ✑ ☞ � ✁ � ✏ ✞ ✂ ✍ ☎ � ☞ � � � � � ✆ ✁ ✁ ✂ ✝ ✄ ✂ ✟ ✁ ✂ ✄ Implementation In modern pipelined CPUs branching is costly. Unroll times (i.e. repeat inline times) the code . ☎✞✝ The bit positions indicating the occurrences will overflow Reserve extra bits per pattern to avoid interference. bits in total. Verification is done only every th step, for those (at most ) alignements that could match. Much faster in practice. SPIRE’05 – p.18/25

  19. � � ✁✂ ✁ ☎ ✁ ✛ � ☛ � � � Experimental results Implementation in C, compiled using icc 8.1 with full optimizations, run in a 2.4 GH Z Pentium 4 ( ), with 512 MB RAM , running Linux 2.4.20-8. 100 patterns were randomly extracted from the text. Each pattern was then searched for separately. We report the average speed in megabytes per second. Our data: real DNA and protein data, English natural language and random ASCII text ( ). SPIRE’05 – p.19/25

Recommend


More recommend