string matching ii
play

String Matching II Algorithm : Design & Analysis [19] In the - PowerPoint PPT Presentation

String Matching II Algorithm : Design & Analysis [19] In the last class Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan String Matching II Boyer-Moores heuristics Skipping unnecessary


  1. String Matching II Algorithm : Design & Analysis [19]

  2. In the last class… � Simple String Matching � KMP Flowchart Construction � Jump at Fail � KMP Scan

  3. String Matching II � Boyer-Moore’s heuristics � Skipping unnecessary comparison � Combining fail match knowledge into jump � Horspool Algorithm � Boyer-Moore Algorithm

  4. Skipping over Characters in Text � Longer pattern contains more information about impossible positions in the text. � For example: if we know that the pattern doesn’t contain a specific character. � It doesn’t make the best use of the information by examining characters one by one forward in the text.

  5. An Example match must mismatch must must must mustmust must must must must must must If you wish to understand others you must … just passed by Checking the characters in P , in reverse order The copy of the P begins at t 38 . The copy of the P begins at t 38 . Matching is achieved in 18 comparisons Matching is achieved in 18 comparisons

  6. Distance of Jumping Forward � With the knowledge of P , the distance of jumping forward for the pointer of T is determined by the character itself, independent of the location in T . m - k p 1 … A … A … p m p 1 … A … A … … p s p m next scan ≠ …… t j =A …… t 1 t r t n new j current j Rightmost ‘A’, at location p k charJump[‘A’] = m - k

  7. Computing the Jump: Algorithm Input : Pattern string P ; m , the length of P ; alphabet size alpha =| Σ | Input : Pattern string P ; m , the length of P ; alphabet size alpha =| Σ | Output : Array charJump , indexed 0,…, alpha -1, storing the Output : Array charJump , indexed 0,…, alpha -1, storing the jumping offsets for each char in alphabet. jumping offsets for each char in alphabet. void computeJumps( char [ ] P, int m, int alpha, int [ ] charJump char ch; Θ (| Σ |+m) int k; for (ch=0; ch<alpha; ch++) charJump[ch]=m; //For all char no in P , jump by m for (k=1; k ≤ m; k++) charJump[ p k ]=m-k; The increasing order of k ensure that for duplicating symbols in P , the jump is computed according to the rightmost

  8. Scan by CharJump: Horspool’s Algorithm int horspoolScan( char [] P, char [] T, int m, int [] charjump) int j=m-1, k, match=-1; while (endText(T,j) = = false ) //up to n loops k=0; while ( k<m and P[m-k-1] = = T[j-k])//up to m loops k++; if (k= = m) match=j-m; break ; else j=j+charjump[T[j]]; return match; An example: Search ‘aaaa……aa’ for ‘baaaa’ So, in the worst case: Θ ( mn So, in the worst case: mn ) Note: charjump[‘a’]=1

  9. Partially Matched Substring matched suffix P : b a t s a n d c a t s T : …… d a t s …… New j Move only 1 char Current j And ‘cat’ will be over Remember the charJump[‘d’]=4 ‘ats’, dismatch expected matched suffix, we can get a better jump P : b a t s a n d c a t s New j Move 7 chars T : …… d a t s ……

  10. Basic Idea p k only part p k Matchjump[ k ] Slide[ k ] The difference is the length of the matched suffix. p k p k matched suffix matched t j New cycle of T : the text scan backward scanning mismatch

  11. Forward to Match the Suffix …… p k p k+1 …… p 1 p m Matched suffix ≠ …… Dismatch …… t j+1 …… …… t 1 t j t n Substring same as the matched suffix occurs in P …… p r p r+1 …… …… p m p 1 p r+m-k slide[k] …… p k p k+1 …… p 1 p m …… t j+1 …… …… t 1 t j t n matchJump[k] New j Old j

  12. Partial Match for the Suffix …… p k p k+1 …… p 1 p m Matched suffix ≠ …… Dismatch …… t j+1 …… …… t 1 t j t n No entire substring same as the matched suffix occurs in P p 1 …… p q …… p m May be empty slide[k] p 1 …… …… p k p k+1 p m …… t j+1 …… …… t 1 t j t n matchJump[k] New j Old j

  13. matchjump and slide Length of the frame is m - k …… p r p r+1 …… …… p m p 1 p r+m-k slide[k] …… p k p k+1 …… p 1 p m …… t j+1 …… …… t 1 t j t n matchJump[k] New j Old j • slide[k] : the distance P slides forward after dismatch at p k , with m-k chars matched to the right • matchjump[k] : the distance j , the pointer of P , jumps, that is: matchjump[k]=slide[k]+m-k

  14. Determining the slide the slide m - q p 1 …… p q …… p m the slide, k - r …… p r p r+1 …… …… p m p 1 p r+m-k slide[k] …… p k p k+1 …… p 1 p m …… t j+1 …… …… t 1 t j t n matchJump[k] New j Old j •Let r ( r <k) be the largest index, such that p r+1 starts a largest substring matching the matched suffix of P , and p r ≠ p k , then slide[k]=k-r • If the r not found, the longest prefix of P, of length q, matching the matched suffix of P will be lined up. Then slide[k]=m-q. p r =p k is senseless since p k is a mismatch

  15. Computing matchJump : Example P = “ w o w w o w ” Direction of computing P = “ w o w w o w ” w o w w o w w o w w o w matchJump[6]=1 ≠ ≠ p k Slide[6]=1 …… t j …… t 1 Matched is empty ( m - k )=0 w o w w o w w o w w o w matchJump[5]=3 ≠ ≠ p k Slide[5]=5-3=2 t 1 …… t j w …… Matched is 1 ( m - k )=1

  16. Computing matchJump : Example P = “ w o w w o w ” Direction of computing P = “ w o w w o w ” w o w w o w w o w w o w matchJump[4]=7 Not lined up ≠ = p k No found, but t 1 …… t j o w …… a prefix of length 1, Matched is 2 so, Slide[4] = m -1=5 w o w w o w w o w w o w matchJump[3]=6 ≠ p k ≠ Slide[3]=3-0=3 t 1 …… t j w o w …… Matched is 3 ( m - k )=3

  17. Computing matchJump : Example P = “ w o w w o w ” Direction of computing P = “ w o w w o w ” w o w w o w w o w w o w matchJump[2]=7 ≠ No found, but t 1 …… t j w w o w …… a prefix of length 3, Matched is 4 so, Slide[2] = m -3=3 w o w w o w w o w w o w matchJump[1]=8 ≠ No found, but t 1 …… t j o w w o w …… a prefix of length 3, so, Slide[1] = m -3=3 Matched is 5

  18. Finding r by Recursion sufx[s] ...... ...... p s+ 1 p s p k p k+ 1 p k+ 2 p 1 P Case 2: p k+1 ≠ p s Case 1: p k+1 = p s Case 1: p k+1 = p s recursively sufx[ k ]=sufx[ k +1]-1 sufx[ k ]=sufx[ k +1]-1 ...... ...... p s+ 1 p s p 1 p k p k+ 1 p k+ 2 P sufx[ k +1]= s ...... p 1 p k p k+ 1 p k+ 2 P

  19. Computing the slides: the Algorithm for (k=1; k ≤ m; k++) matchjump[k]=m+1; sufx[m]=m+1; initialized as impossible values for (k=m-1; k ≥ 0; k--) s=sufix[k+1] while (s ≤ m) Remember: if ( p k+1 = = p s ) break; slide[ k ]= k - r matchjump[s] = min (matchjump[s], s-(k+1)); here: k is s , and r is s = sufx[s]; k +1 sufx[k]=s-1;

  20. Computing the matchjump : Whole Procedure void computeMatchjumps( char [] P, int m, int [] matchjump) int k,r,s,low,shift; int [] sufx = new int [m+1] <computing slides: as the precedure in the frame afore> low=1; shift=sufx[0]; computing slides for sufix while (shift ≤ m) matched shorter prefix for (k=low; k ≤ shift; k++) matchjump[k] = min(matchjump[k], shift); low=shift+1; shift=sufx[shift]; for (k=1; k ≤ m; k++) turn into matchjump by adding m - k matchjump[k]+=(m-k); return

  21. Boyer-Moore Scan Algorithm int boyerMooreScan( char [] P, char [] T, int [] charjump, int [] matchjump) int match, j, k; match=-1; j=m; k=m; // first comparison location while (endText(T,j) == false ) if (k<1) match = j+1 //success scan from right to left break ; if ( t j = = p k ) j--; k--; take the better of the two heuristics else j+=max(charjump[t j ], matchjump[k]); k=m; return match;

  22. Home Assignment � pp.508- � 11.16 � 11.19 � 11.20 � 11.25

Recommend


More recommend