Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1
Outline and Reading Strings and Pattern Matching (§9.1) Tries (§9.2) Text Compression (§9.3) Optional: Text Similarity (§9.4). No Slides. 10/16/2015 3:40 PM Text Processing 2
Texts & Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b 10/16/2015 3:40 PM Text Processing 3
Strings Let P be a string of size m A string is a sequence of characters A substring P [ i .. j ] of P is the subsequence of P consisting of Examples of strings: the characters with ranks Java program between i and j HTML document A prefix of P is a substring of DNA sequence the type P [0 .. i ] Digitized image A suffix of P is a substring of An alphabet Σ is the set of the type P [ i ..m − 1] Given strings T (text) and P possible characters for a (pattern), the pattern matching family of strings problem consists of finding a Example of alphabets: substring of T equal to P ASCII Applications: Unicode Text editors { 0, 1} Search engines { A, C, G, T} Biological research 10/16/2015 3:40 PM Text Processing 4
Brute-Force Algorithm Algorithm BruteForceMatch ( T, P ) The brute-force pattern Input text T of size n and pattern matching algorithm compares P of size m the pattern P with the text T Output starting index of a for each possible shift of P substring of T equal to P or − 1 relative to T , until either if no such substring exists a match is found, or for i ← 0 to n − m all placements of the pattern { test shift i of the pattern } have been tried j ← 0 Brute-force pattern matching while j < m ∧ T [ i + j ] = P [ j ] runs in time O ( nm ) j ← j + 1 Example of worst case: if j = m T = aaa … ah P = aaah return i {match at i } may occur in images and else DNA sequences break while loop {mismatch} unlikely in English text return -1 {no match anywhere} 10/16/2015 3:40 PM Text Processing 5
Boyer-Moore Heuristics The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T [ i ] = c If P contains c , shift P to align the last occurrence of c in P with T [ i ] Else, shift P to align P [0] with T [ i + 1] Example a p a t t e r n m a t c h i n g a l g o r i t h m 1 3 5 11 10 9 8 7 r i t h m r i t h m r i t h m r i t h m 2 4 6 r i t h m r i t h m r i t h m 10/16/2015 3:40 PM Text Processing 6
The Boyer-Moore Algorithm Algorithm BoyerMooreMatch ( T, P, Σ ) L ← lastOccurenceFunction ( P, Σ ) i ← m − 1 j ← m − 1 repeat if T [ i ] = P [ j ] if j = 0 return i { match at i } else i ← i − 1 j ← j − 1 else { character-jump } l ← L [ T [ i ]] i ← i + m – min( j , 1 + l ) j ← m − 1 until i > n − 1 return − 1 { no match } 10/16/2015 3:40 PM Text Processing 7
Example a b a c a a b a d c a b a c a b a a b b 1 a b a c a b 4 3 2 13 12 11 10 9 8 a b a c a b a b a c a b 5 7 a b a c a b a b a c a b 6 a b a c a b 10/16/2015 3:40 PM Text Processing 8
Analysis Boyer-Moore’s algorithm a a a a a a a a a runs in time O ( nm + s ) Example of worst case: 6 5 4 3 2 1 T = aaa … a b a a a a a P = baaa 12 11 10 9 8 7 The worst case may occur in b a a a a a images and DNA sequences 18 17 16 15 14 13 but is unlikely in English text b a a a a a Boyer-Moore’s algorithm is significantly faster than the 24 23 22 21 20 19 brute-force algorithm on b a a a a a English text 10/16/2015 3:40 PM Text Processing 9
The KMP Algorithm - Motivation Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right , but shifts a b a a b x . . . . . . . the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, a b a a b a what is the most we can shift j the pattern so as to avoid redundant comparisons? a b a a b a Answer: the largest prefix of P [0.. j ] that is a suffix of P [1.. j ] No need to Resume repeat these comparing comparisons here 10/16/2015 3:40 PM Text Processing 10
KMP Failure Function Knuth-Morris-Pratt’s 5 j 0 1 2 3 4 algorithm preprocesses the P [ j ] a b a a b a pattern to find matches of 3 F ( j ) 0 0 1 1 2 prefixes of the pattern with the pattern itself The failure function F ( j ) is a b a a b x . . . . . . . defined as the size of the largest prefix of P [0.. j ] that is also a suffix of P [1.. j ] a b a a b a Knuth-Morris-Pratt’s j algorithm modifies the brute- force algorithm so that if a a b a a b a mismatch occurs at P [ j ] ≠ T [ i ] we set j ← F ( j − 1) F ( j − 1) 10/16/2015 3:40 PM Text Processing 11
The KMP Algorithm Algorithm KMPMatch ( T, P ) The failure function can be F ← failureFunction ( P ) represented by an array and i ← 0 can be computed in O ( m ) time j ← 0 while i < n At each iteration of the while- if T [ i ] = P [ j ] loop, either if j = m − 1 i increases by one, or return i − j { match } the shift amount i − j else i ← i + 1 increases by at least one j ← j + 1 (observe that F ( j − 1) < j ) else Hence, there are no more if j > 0 than 2 n iterations of the j ← F [ j − 1] while-loop else i ← i + 1 Thus, KMP’s algorithm runs in return − 1 { no match } optimal time O ( m + n ) 10/16/2015 3:40 PM Text Processing 12
Computing the Failure Function The failure function can be represented by an array and Algorithm failureFunction ( P ) can be computed in O ( m ) time F [ 0 ] ← 0 i ← 1 The construction is similar to j ← 0 the KMP algorithm itself while i < m if P [ i ] = P [ j ] At each iteration of the while- {we have matched j + 1 chars} loop, either F [ i ] ← j + 1 i increases by one, or i ← i + 1 the shift amount i − j j ← j + 1 else if j > 0 then increases by at least one (observe that F ( j − 1) < j ) {use failure function to shift P } j ← F [ j − 1] Hence, there are no more else than 2 m iterations of the F [ i ] ← 0 { no match } i ← i + 1 while-loop 10/16/2015 3:40 PM Text Processing 13
Example a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b 5 j 0 1 2 3 4 14 15 16 17 18 19 P [ j ] a b a c a b a b a c a b 2 F ( j ) 0 0 1 0 1 10/16/2015 3:40 PM Text Processing 14
Tries e i mi nimize ze mize nimize ze nimize ze 10/16/2015 3:40 PM Text Processing 15
Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries After preprocessing the pattern, KMP’s algorithm performs pattern matching in time proportional to the text size If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern A trie is a compact data structure for representing a set of strings, such as all the words in a text A tries supports pattern matching queries in time proportional to the pattern size 10/16/2015 3:40 PM Text Processing 16
Standard Trie (1) The standard trie for a set of strings S is an ordered tree such that: Each node but the root is labeled with a character The children of a node are alphabetically ordered The paths from the external nodes to the root yield the strings of S Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } b s e i u e t a l d l y l o r l l l c p k 10/16/2015 3:40 PM Text Processing 17
Standard Trie (2) A standard trie uses O ( n ) space and supports searches, insertions and deletions in time O ( dm ) , where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet b s e i u e t a l d l y l o r l l l c p k 10/16/2015 3:40 PM Text Processing 18
Word Matching with a Trie s e e a b e a r ? s e l l s t o c k ! We insert the 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 words of the s e e a b u l l ? b u y s t o c k ! text into a 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 trie b i d s t o c k ! b i d s t o c k ! Each leaf 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 stores the h e a r t h e b e l l ? s t o p ! occurrences 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 of the associated b h s word in the text e i u e e t y e a l l a l o d 36 0, 24 47, 58 r r p c l l l 6 69 84 78 30 12 k 17, 40, 51, 62 10/16/2015 3:40 PM Text Processing 19
Compressed Trie A compressed trie has b s internal nodes of degree at least two e id u ell to It is obtained from standard trie by ar ll ll y ck p compressing chains of “redundant” nodes b s e i u e t a l d l y l o r l l l c p k 10/16/2015 3:40 PM Text Processing 20
Recommend
More recommend