Fast and Linear-Time String Matching Algorithms Based on the Distances of -Gram Occurrences q Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan
String matching problem Input Text , Pattern T P Output All positions in such that i T T [ i : i + | P | − 1] = P Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T : a b a a b a b b a b b a b P : a b b a Output : 6, 9 Naive solution : O ( nm ) = = n | T | m | P | 2 / 20
String matching problem Input Text , Pattern T P Output All positions in such that i T T [ i : i + | P | − 1] = P Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T : a b a a b a b b a b b a b P : a b b a Output : 6, 9 Naive solution : O ( nm ) = = n | T | m | P | 2 / 20
String matching problem Input Text , Pattern T P Output All positions in such that i T T [ i : i + | P | − 1] = P Example 1 2 3 4 5 6 7 8 9 10 11 12 13 T : a b a a b a b b a b b a b P : a b b a Output : 6, 9 Naive solution : O ( nm ) = = n | T | m | P | 2 / 20
String matching algorithms • Knuth-Morris-Pratt (KMP) algorithm [Knuth+, 1977] • Preprocessing time : O ( m ) • Searching time : O ( n ) • Boyer-Moore algorithm [Boyer & Moore, 1977] • Preprocessing time : O ( m + σ ) • Searching time : O ( nm ) • Runs fast in practice : Text length n : Pattern length m : Alphabet size σ 3 / 20
= = : Word length Our contributions n | T | m | P | ω = : Alphabet size : -gram | Σ | q q σ • Propose two string matching algorithms based on the distances of the -gram occurrences q • Both algorithms work in linear time in the input string size Fastest algorithm map for each dataset English text Genome sequence Fibonacci string 2 4 8 64 16 32 128 256 512 1024 Pattern length m Comparing 15 powerful algorithms announced from 1977 to 2019 with the proposed algorithms Algorithm Search Algorithm Preprocess Search Preprosess a ● WFR q [Cantone+, 2017] ● BNDM q [Navarro & Raffinot, 1998] O ( m+ σ ) O ( m ) O ( nm ) O ( nm ⌈ m/ ω ⌉ ) ● SBNDM q [Holub & Durian, 2005] ● LWFR q [Cantone+, 2019] O ( m+ σ ) O ( m ) O ( n ) O ( nm ⌈ m/ ω ⌉ ) ● FJS [Franek+, 2005] O ( m+ σ ) O ( n ) ● DIST q New O ( mq ) O ( nq ) ● HASH q [Leqroq, 2007] O ( mq ) ● LDIST q New O ( n ( m+q )) O ( m ) O ( n ) ● BSDM q [Faro & Leqroq, 2012] O ( m ) O ( nm ) Naive solution : O ( nm ) 4 / 20
Existing algorithms
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c Match Mismatch Match without comparison Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c KMP _ Shift [5] = 2 Match Mismatch Match without comparison Strong _ Bord ( j ) Input : A mismatch position in the pattern j Output : A maximum value that satisfies and k (0 ≤ k < j ) P [1 : k ] = P [ j − k : j − 1] P [ k + 1] ≠ P [ j ] ( -1 if no such exists ) k A shift amount when there is a mismatch in the -th pattern j KMP _ Shift [ j ] = j − Strong _ Bord ( j ) − 1 1 2 3 4 5 6 j P a b a b c KMP_Shift 1 1 3 3 2 5 Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
: Text length n KMP algorithm [Knuth+, 1977] : Pattern length m T : a b a b a b b b c a a c a c P : a b a b c a b a b c KMP _ Shift [5] = 2 Match Mismatch Match without comparison Strong _ Bord ( j ) Input : A mismatch position in the pattern j Output : A maximum value that satisfies and k (0 ≤ k < j ) P [1 : k ] = P [ j − k : j − 1] P [ k + 1] ≠ P [ j ] ( -1 if no such exists ) k A shift amount when there is a mismatch in the -th pattern j KMP _ Shift [ j ] = j − Strong _ Bord ( j ) − 1 1 2 3 4 5 6 j P a b a b c KMP_Shift 1 1 3 3 2 5 Preprocessing time : Searching time : O ( m ) O ( n ) 6 / 20
(Treat characters as the ASCII code) : String Match HASH algorithm [Leqroq, 2007] q Mismatch Match without comparison T : a a a b a b a a b b a b a b b P = a b a a b b a b P : a b a a b b a b x h ( x ) Shift [ h ( x )] aba 681 5 a b a a b b a b baa 683 4 shift [ h ( baa )] = 4 aab 680 3 abb 682 2 q bba 685 1 bab 684 0 - 6 Others shift [ h ( x )] = m − max({ j | h ( P [ j − q + 1 : j ]) = h ( x ), q ≤ j ≤ m } ∪ { q − 1}) m − q + 1 • Determines the equivalence of -grams using the hash value of -grams q q h ( x ) = (2 q − 1 ⋅ x [1] + 2 q − 2 ⋅ x [2] + ⋯ + 2 ⋅ x [ q − 1] + x [ q ]) mod 2 8 x : Text length n : Pattern length m : Alphabet size Preprocessing time : Searching time : O ( mq ) O ( n ( m + q )) σ 7 / 20
(Treat characters as the ASCII code) : String Match HASH algorithm [Leqroq, 2007] q Mismatch Match without comparison T : a a a b a b a a b b a b a b b P = a b a a b b a b P : a b a a b b a b x h ( x ) Shift [ h ( x )] aba 681 5 a b a a b b a b baa 683 4 shift [ h ( baa )] = 4 aab 680 3 abb 682 2 q bba 685 1 bab 684 0 - 6 Others shift [ h ( x )] = m − max({ j | h ( P [ j − q + 1 : j ]) = h ( x ), q ≤ j ≤ m } ∪ { q − 1}) m − q + 1 • Determines the equivalence of -grams using the hash value of -grams q q h ( x ) = (2 q − 1 ⋅ x [1] + 2 q − 2 ⋅ x [2] + ⋯ + 2 ⋅ x [ q − 1] + x [ q ]) mod 2 8 x : Text length n : Pattern length m : Alphabet size Preprocessing time : Searching time : O ( mq ) O ( n ( m + q )) σ 7 / 20
Recommend
More recommend