Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Pattern Matching 1
Outline and Reading Strings (§11.1) Pattern matching algorithms � Brute-force algorithm (§11.2.1) � Boyer-Moore algorithm (§11.2.2) � Knuth-Morris-Pratt algorithm (§11.2.3) Pattern Matching 2
Strings A string is a sequence of Let P be a string of size m characters � A substring P [ i .. j ] of P is the subsequence of P consisting of Examples of strings: the characters with ranks � C++ program between i and j � HTML document � A prefix of P is a substring of � DNA sequence the type P [0 .. i ] � Digitized image � A suffix of P is a substring of An alphabet Σ is the set of the type P [ i ..m − 1] Given strings T (text) and P possible characters for a (pattern), the pattern matching family of strings problem consists of finding a Example of alphabets: substring of T equal to P � ASCII (used by C and C++) Applications: � Unicode (used by Java) � Text editors � {0, 1} � Search engines � {A, C, G, T} � Biological research Pattern Matching 3
Brute-Force Algorithm Algorithm BruteForceMatch ( T, P ) The brute-force pattern Input text T of size n and pattern matching algorithm compares P of size m the pattern P with the text T Output starting index of a for each possible shift of P substring of T equal to P or − 1 relative to T , until either if no such substring exists � a match is found, or for i ← 0 to n − m � all placements of the pattern { test shift i of the pattern } have been tried j ← 0 Brute-force pattern matching while j < m ∧ T [ i + j ] = P [ j ] runs in time O ( nm ) j ← j + 1 Example of worst case: if j = m � T = aaa … ah return i {match at i } � P = aaah � may occur in images and else DNA sequences break while loop {mismatch} � unlikely in English text return -1 {no match anywhere} Pattern Matching 4
Boyer-Moore Heuristics The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T [ i ] = c � If P contains c , shift P to align the last occurrence of c in P with T [ i ] � Else, shift P to align P [0] with T [ i + 1] Example a p a t t e r n m a t c h i n g a l g o r i t h m 1 3 5 11 10 9 8 7 r i t h m r i t h m r i t h m r i t h m 2 4 6 r i t h m r i t h m r i t h m Pattern Matching 5
Last-Occurrence Function Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet Σ to build the last-occurrence function L mapping Σ to integers, where L ( c ) is defined as � the largest index i such that P [ i ] = c or � − 1 if no such index exists Example: c a b c d � Σ = { a, b, c, d } − 1 L ( c ) 4 5 3 � P = abacab The last-occurrence function can be represented by an array indexed by the numeric codes of the characters The last-occurrence function can be computed in time O ( m + s ) , where m is the size of P and s is the size of Σ Pattern Matching 6
The Boyer-Moore Algorithm Case 1: j ≤ 1 + l Algorithm BoyerMooreMatch ( T, P, Σ ) a . . . . . . . . . . . . L ← lastOccurenceFunction ( P, Σ ) i i ← m − 1 j ← m − 1 b a . . . . repeat j l m − j if T [ i ] = P [ j ] if j = 0 b a . . . . return i { match at i } else j i ← i − 1 Case 2: 1 + l ≤ j j ← j − 1 a else . . . . . . . . . . . . i { character-jump } l ← L [ T [ i ]] a b . . . . i ← i + m – min( j , 1 + l ) l j j ← m − 1 m − (1 + l ) until i > n − 1 a b return − 1 { no match } . . . . 1 + l Pattern Matching 7
Example a b a c a a b a d c a b a c a b a a b b 1 a b a c a b 4 3 2 13 12 11 10 9 8 a b a c a b a b a c a b 5 7 a b a c a b a b a c a b 6 a b a c a b Pattern Matching 8
Analysis Boyer-Moore’s algorithm a a a a a a a a a runs in time O ( nm + s ) Example of worst case: 6 5 4 3 2 1 b a a a a a � T = aaa … a � P = baaa 12 11 10 9 8 7 The worst case may occur in b a a a a a images and DNA sequences but is unlikely in English text 18 17 16 15 14 13 b a a a a a Boyer-Moore’s algorithm is significantly faster than the 24 23 22 21 20 19 brute-force algorithm on b a a a a a English text Pattern Matching 9
The KMP Algorithm - Motivation Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right , but shifts a b a a b x . . . . . . . the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, a b a a b a what is the most we can shift j the pattern so as to avoid redundant comparisons? a b a a b a Answer: the largest prefix of P [0.. j ] that is a suffix of P [1.. j ] No need to Resume repeat these comparing comparisons here Pattern Matching 10
KMP Failure Function Knuth-Morris-Pratt’s 5 j 0 1 2 3 4 algorithm preprocesses the P [ j ] a b a a b a pattern to find matches of 3 F ( j ) 0 0 1 1 2 prefixes of the pattern with the pattern itself The failure function F ( j ) is a b a a b x . . . . . . . defined as the size of the largest prefix of P [0.. j ] that is also a suffix of P [1.. j ] a b a a b a Knuth-Morris-Pratt’s j algorithm modifies the brute- force algorithm so that if a a b a a b a mismatch occurs at P [ j ] ≠ T [ i ] we set j ← F ( j − 1) F ( j − 1) Pattern Matching 11
The KMP Algorithm The failure function can be Algorithm KMPMatch ( T, P ) F ← failureFunction ( P ) represented by an array and i ← 0 can be computed in O ( m ) time j ← 0 At each iteration of the while- while i < n if T [ i ] = P [ j ] loop, either if j = m − 1 � i increases by one, or return i − j { match } � the shift amount i − j else increases by at least one i ← i + 1 (observe that F ( j − 1) < j ) j ← j + 1 else Hence, there are no more if j > 0 than 2 n iterations of the j ← F [ j − 1] while-loop else i ← i + 1 Thus, KMP’s algorithm runs in return − 1 { no match } optimal time O ( m + n ) Pattern Matching 12
Computing the Failure Function The failure function can be represented by an array and Algorithm failureFunction ( P ) can be computed in O ( m ) time F [ 0 ] ← 0 The construction is similar to i ← 1 j ← 0 the KMP algorithm itself while i < m At each iteration of the while- if P [ i ] = P [ j ] loop, either {we have matched j + 1 chars} F [ i ] ← j + 1 � i increases by one, or i ← i + 1 � the shift amount i − j j ← j + 1 increases by at least one else if j > 0 then (observe that F ( j − 1) < j ) {use failure function to shift P } j ← F [ j − 1] Hence, there are no more else than 2 m iterations of the F [ i ] ← 0 { no match } while-loop i ← i + 1 Pattern Matching 13
Example a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b 5 j 0 1 2 3 4 15 14 16 17 18 19 P [ j ] a b a c a b a b a c a b 2 F ( j ) 0 0 1 0 1 Pattern Matching 14
Recommend
More recommend