string matching
play

String Matching Inge Li Grtz CLRS 32 String Matching String - PowerPoint PPT Presentation

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string T (text) and string P (pattern) over an alphabet . |T| = n, |P| = m. Report all starting positions of occurrences of P in T. P = a b a b


  1. String Matching Inge Li Gørtz CLRS 32

  2. String Matching • String matching problem: • string T (text) and string P (pattern) over an alphabet Σ . • |T| = n, |P| = m. • Report all starting positions of occurrences of P in T. P = a b a b a c a T = b a c b a b a b a b a b a c a b

  3. Strings • ε : empty string • prefix/su ffi x: v=xy: • x prefix of v, if y ≠ ε x is a proper prefix of v • y su ffi x of v, if y ≠ ε x is a proper suufix of v. • Example: S = aabca • The su ffi xes of S are: aabca , abca , bca , ca and a . • The strings abca , bca , ca and a are proper su ffi xes of S.

  4. String Matching • Finite automaton • Knuth-Morris-Pratt (KMP)

  5. A naive string matching algorithm b a c b a b a b a b a b a c a b a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a a b a b a c a

  6. Improving the naive algorithm P = a a a b a b a T = a a a b a a a b a b a b a c a b b a a a b a b a

  7. Improving the naive algorithm P = a a a b a b a T = a a a b a a a b a b a b a c a b b a a a b a b a a a a b a b a a a a b a a a a a a b a a a a a a b a b a

  8. Exploiting what we know from pattern P = a b a b a c a T = a b a b a a a b a b a c a What character in the pattern should we check next? T = a b a b a b a b a b a c a What character in the pattern should we check next? T = a b a b a c a b a b a c a What character in the pattern should we check next?

  9. Exploiting what we know from pattern P = a b a b a c a T = x a b a b a a x a b a b a c a What character in the pattern should we compare x to? 2nd a b a b a c a a b a b a c a T = x a b a b a b x a b a b a c a What character in the pattern should we compare x to? 5th a b a b a c a T = x a b a b a c x a b a b a c a What character in the pattern should we compare x to? 7th a b a b a c a

  10. Finite Automaton

  11. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a accepting state starting state a a a b a b a c a b a b

  12. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a accepting state starting state a a a b a b a c a b a b longest prefix of P that is a proper suffix of ‘abaa' Matched until now: a b a a P: a b a b a c a

  13. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a a a a b a b a c a b a b T = b a c b a b a b a b a b a c a b

  14. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a read ‘a’? longest prefix of P that is a proper suffix of ‘aa’ = ‘a’ a a Matched until now: P: a b a b a c a

  15. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a read ‘c’? longest prefix of P that is a proper suffix of ‘ac’ = ‘ ’ a c Matched until now: P: a b a b a c a

  16. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a read ‘b’? longest prefix of P that is a proper suffix of ‘abb’ = ‘ ’ a b b Matched until now: P: a b a b a c a

  17. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a b a b a c a a read ‘c’? longest prefix of P that is a proper suffix of ‘abc’ = ‘ ’ a b c Matched until now: P: a b a b a c a

  18. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a read ‘a’? longest prefix of P that is a proper suffix of ‘abaa’ = ‘a’ a b a a Matched until now: P: a b a b a c a

  19. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a read ‘c’? longest prefix of P that is a proper suffix of ‘abac’ = ‘ ’ a b a c Matched until now: P: a b a b a c a

  20. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a read ‘b’? longest prefix of P that is a proper suffix of ‘ababb’ = ‘ ’ a b a b b Matched until now: P: a b a b a c a

  21. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a b a b a c a a read ‘c’? longest prefix of P that is a proper suffix of ‘ababc’ = ‘ ’ a b a b c Matched until now: P: a b a b a c a

  22. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a b a b a c a a read ‘a’? longest prefix of P that is a proper suffix of ‘ababaa’ = ‘a’ a b a b a a Matched until now: P: a b a b a c a

  23. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a b a b a c a b a read ‘b’? longest prefix of P that is a proper suffix of ‘ababaa’ = ‘abab’ a b a b a b Matched until now: P: a b a b a c a

  24. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a b a b a c a b a read ‘b’? longest prefix of P that is a proper suffix of ‘ababacb’ = ‘ ’

  25. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a b a b a c a b a read ‘c’? longest prefix of P that is a proper suffix of ‘ababacc’ = ‘ ’

  26. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a a a b a b a c a b a b read ‘a’? longest prefix of P that is a proper suffix of ‘ababacaa’ = ‘a’

  27. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a a a b a b a c a b a b read ‘b’? longest prefix of P that is a proper suffix of ‘ababacab’ = ‘ab’

  28. Finite Automaton • Finite automaton: alphabet Σ = {a,b,c}. P= ababaca. a a a a a b a b a c a b a b longest prefix of P that is a proper suffix of ‘ababacac’ = ‘ ’ read ‘c’?

  29. Finite Automaton • Finite automaton: • Q: finite set of states • q 0 ∈ Q: start state a • A ⊆ Q: set of accepting states a • Σ : finite input alphabet a a b a b a c a • δ : transition function b a b • Matching time: O(n) • Preprocessing time: O(m 3 | Σ |). Can be done in O(m| Σ |). • Total time: O(n + m| Σ |)

  30. KMP

  31. KMP • Finite automaton: alphabet Σ = {a,b,c}. P = ababaca. a a a a b a b a c a b a b • KMP: Can be seen as finite automaton with failure links : a b a b a c a 6 1 2 3 4 5

  32. KMP • KMP: Can be seen as finite automaton with failure links : • longest prefix of P that is a suffix of what we have matched until now (ignore the mismatched character). a b a b a c a 1 2 3 4 5 6 longest prefix of P that is a proper suffix of ‘aba'

  33. KMP matching • KMP: Can be seen as finite automaton with failure links : • longest prefix of P that is a suffix of what we have matched until now. a b a b a c a 1 2 3 4 5 6 T = b a c b a b a b a b a b a c a b

  34. KMP • KMP: Can be seen as finite automaton with failure links : • longest prefix of P that is a proper suffix of what we have matched until now. • can follow several failure links when matching one character: a b a b a c a 1 2 3 4 5 6 T = a b a b a a

  35. KMP Analysis • Lemma. The running time of KMP matching is O(n). • Each time we follow a forward edge we read a new character of T. • #backward edges followed ≤ #forward edges followed ≤ n. • If in the start state and the character read in T does not match the forward edge, we stay there. • Total time = #non-matched characters in start state + #forward edges followed + #backward edges followed ≤ 2n.

  36. Computation of failure links • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. • Computing failure links: Use KMP matching algorithm. longest prefix of P that is a suffix of ‘abab' a b a b a c a 1 2 3 4 5 6 Can be found by using KMP to match ‘bab' a b a b a c a 6 1 2 3 4 5

  37. Computation of failure links • Computing failure links: As KMP matching algorithm (only need failure links that are already computed). • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. a b a b a c a 1 2 3 4 5 6 1 2 3 4 5 6 7 a b a b a c a P =

  38. Computation of failure links • Computing failure links: As KMP matching algorithm (only need failure links that are already computed). • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. a b c a a b c 1 2 3 4 5 6 1 2 3 4 5 6 7 a b c a a b c P =

  39. Computation of failure links • Computing failure links: As KMP matching algorithm (only need failure links that are already computed). • Failure link: longest prefix of P that is a proper su ffi x of what we have matched until now. a b c a a b c 1 2 3 4 5 6 1 2 3 4 5 6 7 a b c a a b c P =

  40. KMP • Computing π : As KMP matching algorithm (only need π values that are already computed). • Running time: O(n + m): • Lemma. Total number of comparisons of characters in KMP is at most 2n. • Corollary. Total number of comparisons of characters in the preprocessing of KMP is at most 2m.

  41. KMP: the π array • π array: A representation of the failure links. • Takes up less space than pointers. i 1 2 3 4 5 6 7 π [i] 0 0 1 2 3 0 1 a b a b a c a 6 1 2 3 4 5

Recommend


More recommend