string matching
play

String Matching String matching problem: string T (text) and - PowerPoint PPT Presentation

String Matching String matching problem: string T (text) and string P (pattern) over an alphabet . String Matching |T| = n, |P| = m. Report all starting positions of occurrences of P in T. Inge Li Grtz P = a b a b a c a T


  1. String Matching • String matching problem: • string T (text) and string P (pattern) over an alphabet Σ . String Matching • |T| = n, |P| = m. • Report all starting positions of occurrences of P in T. Inge Li Gørtz P = a b a b a c a T = b a c b a b a b a b a b a c a b CLRS 32 Strings String Matching Suffix of S • ε : empty string • Knuth-Morris-Pratt (KMP) S • prefix/su ffi x: v=xy: • Finite automaton • x prefix of v, if y ≠ ε x is a proper prefix of v Prefix of S • y su ffi x of v, if y ≠ ε x is a proper su ffi x of v. • Example: S = aabca • The su ffi xes of S are: aabca , abca , bca , ca and a . • The strings abca , bca , ca and a are proper su ffi xes of S.

  2. A naive string matching algorithm Improving the naive algorithm P = a a a b a b a T = b a c b a b a b a b a b a c a b a a a b a a a b a b a b a c a b b a b a b a c a a a a b a b a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a Improving the naive algorithm Improving the naive algorithm P = a a a b a b a P = a a a b a b a T = a a a b a a a b a b a b a c a b b T = a a a b a a a a b a b a a c a b b a a a b a b a a a a b a b a a a a b a b a a a a b a a a a a a b a a a a a a b a b a a a a b a b a a a a b a b a

  3. Improving the naive algorithm Improving the naive algorithm P = a a a b a b a P = a a a b a b a T = T = a a a b a a a a b a b a a c a b b a a a b a a a a b a b a a c a b b a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a a a a b a b a If we matched 5 characters from P and then fail: compare failed character to 2nd character in P If we matched 3 characters If we matched all characters from P and then fail: from P: compare failed character to compare next character to 3nd character in P 2nd character in P Improving the naive algorithm Improving the naive algorithm P = a a a b a b a P = a a a b a b a a a a b a b a a a a b a b a matched matched 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 #matched #matched if fail 3 2 2 if fail 3 2 2 compare to compare to a a a b a b a 6 1 2 3 4 5 If we matched 5 characters If we matched 3 characters If we matched all characters If we matched 5 characters If we matched 3 characters If we matched all characters from P and then fail: from P and then fail: from P: from T and then fail: from T and then fail: from T: compare failed character to compare next character to compare failed character to compare next character to compare failed character to compare failed character to 2nd character in P 3nd character in P 2nd character in P 2nd character in P 3nd character in P 2nd character in P

  4. Longest suffix of S that is a prefix of P Improving the naive algorithm Improving the naive algorithm S P P = a a a b a b a • KMP: P = aaababa. Longest prefix of P that is a suffix of S a a a b a b a a a a b a b a matched matched 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 #matched #matched if fail 1 1 2 3 1 2 1 2 0 0 1 2 0 1 0 1 if fail go to compare to a a a b a b a a a a b a b a 1 2 3 4 5 6 1 2 3 4 5 6 starting state accepting state If we matched 5 characters If we matched 3 characters If we matched all characters In state i after reading character j of T: from T and then fail: from T and then fail: from T: P[1…i] is the longest prefix of P that is a compare failed character to compare failed character to compare next character to suffix of T[1..j] 2nd character in P 3nd character in P 2nd character in P Improving the naive algorithm KMP • KMP: P = aaababa. • KMP: Can be seen as finite automaton with failure links : • Failure link: longest prefix of P that is a proper suffix of what we have matched until now. a a a b a b a matched 0 1 2 3 4 5 6 7 #matched • In state i after reading T[j]: P[1..i] is the longest prefix of P that is a suffix of T[1…j]. 0 0 1 2 0 1 0 1 if fail go to • Can follow several failure links when matching one character: a a a b a b a a b a b a c a 6 6 1 2 3 4 5 1 2 3 4 5 T = a b a b a a • Matching: T = a a a b a a a b a b a a

  5. KMP Analysis KMP Analysis • Analysis. |T| = n, |P| = m. • Lemma. The running time of KMP matching is O(n). • How many times can we follow a forward edge? • Each time we follow a forward edge we read a new character of T. • How many backward edges can we follow (compare to forward edges)? • #backward edges followed ≤ #forward edges followed ≤ n. • Total number of edges we follow? • If in the start state and the character read in T does not match the forward edge, we stay there. • What else do we use time for? • Total time = #non-matched characters in start state + #forward edges followed + #backward edges followed ≤ 2n. Computation of failure links Computation of failure links • Failure link: longest prefix of P that is a proper su ffi x of what we have • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. matched until now. • Computing failure links: Use KMP matching algorithm. • Computing failure links: Use KMP matching algorithm. longest prefix of P that is a proper suffix of ‘abab' longest prefix of P that is a suffix of ‘bab' a b a b a c a a b a b a c a 1 2 3 4 5 6 1 2 3 4 5 6

  6. Computation of failure links Computation of failure links • Failure link: longest prefix of P that is a proper su ffi x of what we have • Computing failure links: As KMP matching algorithm (only need failure links matched until now. that are already computed). • Computing failure links: Use KMP matching algorithm. • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. longest prefix of P that is a suffix of ‘bab' a b a b a c a a b a b a c a 1 2 3 4 5 6 1 2 3 4 5 6 Can be found by using KMP to match ‘bab' a b a b a c a 1 2 3 4 5 6 7 Need to match: a, ab, aba, 1 2 3 4 5 6 abab, ababa, ababac, a b a b a c a P = ababaca KMP KMP • Computing π : As KMP matching algorithm (only need π values that are • Computing π : As KMP matching algorithm (only need π values that are already computed). already computed). • Running time: O(n + m): • Running time: O(n + m): • Lemma. Total number of comparisons of characters in KMP is at most 2n. • Lemma. Total number of comparisons of characters in KMP is at most 2n. • Corollary. Total number of comparisons of characters in the preprocessing • Corollary. Total number of comparisons of characters in the preprocessing of KMP is at most 2m. of KMP is at most 2m.

  7. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a Finite Automaton accepting state starting state a a a b a b a c a b a b Finite Automaton Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a accepting state starting state a a a b a b a c a a b a b a c a b a a b read ‘a’? longest prefix of P that is a proper suffix of ‘aa’ = ‘a’ longest prefix of P that is a proper suffix of ‘abaa' a a Matched until now: Matched until now: a b a a P: a b a b a c a P: a b a b a c a

  8. Finite Automaton Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a b a b a c a a a read ‘c’? longest prefix of P that is a proper suffix of ‘ac’ = ‘ ’ read ‘b’? longest prefix of P that is a proper suffix of ‘abb’ = ‘ ’ a c a b b Matched until now: Matched until now: P: a b a b a c a P: a b a b a c a Finite Automaton Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a b a b a c a a a read ‘c’? longest prefix of P that is a proper suffix of ‘abc’ = ‘ ’ read ‘a’? longest prefix of P that is a proper suffix of ‘abaa’ = ‘a’ a b c a b a a Matched until now: Matched until now: P: a b a b a c a P: a b a b a c a

  9. Finite Automaton Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a a a a a b a b a c a a b a b a c a b a a b read ‘c’? longest prefix of P that is a proper suffix of ‘abac’ = ‘ ’ T = b a c b a b a b a b a b a c a b a b a c Matched until now: P: a b a b a c a Finite Automaton • Finite automaton: • Q: finite set of states • q 0 ∈ Q: start state a • A ⊆ Q: set of accepting states a • Σ : finite input alphabet a a b a b a c a • δ : transition function b a b • Matching time: O(n) • Preprocessing time: O(m 3 | Σ |). Can be done in O(m| Σ |) using KMP . • Total time: O(n + m| Σ |)

Recommend


More recommend