11 Text search Robert Elssser Robert Elssser Text search - PDF document

Summer Term 2010 11 Text search Robert Elsässer Robert Elsässer

Text search Different scenarios: Dynamic texts • T Text editors t dit • Symbol manipulators Static texts • Literature databases • • Library systems Library systems • Gene databases • World Wide Web 19.05.2010 Theory 1 - Text search 2

Text search Data type string : yp g • array of character • file of character • list of character li t f h t Operations: (Let T , P be of type string ) Length : length () i - th character : T [ i ] concatenation : concatenation : cat ( T P ) T P cat ( T , P ) T.P 19.05.2010 Theory 1 - Text search 3

Problem definition Input: p ∈ Σ n Text t 1 t 2 .... t n ∈ Σ m Pattern p 1 p 2 ... p m Goal: Find one or all occurrences of the pattern in the text, i.e. shifts i (0 ≤ i ≤ n – m ) such that i e shifts i (0 ≤ i ≤ n m ) such that p 1 = t i+1 p 2 = t i+2 p m = t i+m t 19.05.2010 Theory 1 - Text search 4

Problem definition i i i+1 i+m i 1 i Text: t 1 t 2 .... t i+1 .... t i+m ….. t n Pattern: p 1 .... p m Estimation of cost (time) : ( ) 1. # possible shifts: n – m + 1 # pattern positions: m � O ( n · m ) � O ( n m ) 2. At least 1 comparison per m consecutive text positions: � Ω ( m + n / m ) � Ω ( m + n / m ) 19.05.2010 Theory 1 - Text search 5

Naïve approach For each possible shift 0 ≤ i ≤ n – m check at most m pairs of characters. Whenever a mismatch occurs, start with the next shift. textsearchbf := proc (T : : string, P : : string) # Input: Text T und Muster P # Output: List L of shifts i, at which P occurs in T n := length (T); m := length (P); L L := []; [] for i from 0 to n-m { j := 1; while j ≤ m and T[i+j] = P[j] while j ≤ m and T[i+j] = P[j] do j := j+1 od; if j = m+1 then L := [L [] , i] fi; } RETURN (L) end; 19.05.2010 Theory 1 - Text search 6

Naïve approach Cost estimation (time): ( ) 0 0 ... 0 ... 0 ... 0 0 ... i 0 ... 0 ... 0 1 Worst Case: Ω ( m·n ) In practice: mismatch often occurs very early In practice: mismatch often occurs very early � running time ~ c·n 19.05.2010 Theory 1 - Text search 7

Method of Knuth-Morris-Pratt (KMP) Let t i and p j+1 be the characters to be compared: p p j+1 i t 1 t 2 ... ... t i ... ... = = = = ≠ p 1 ... p j p j+1 ... p m If, at a shift, the first mismatch occurs at t i and p j+1 , then : • • The last j characters inspected in T equal the first j characters in P The last j characters inspected in T equal the first j characters in P . t i ≠ p j+1 • 19.05.2010 Theory 1 - Text search 8

Method of Knuth-Morris-Pratt (KMP) Idea: Determine j´ = next [ j ] < j such that t i can then be compared with p j´+1 . Determine j ´< j such that P 1... j ´ = P j-j´+ 1 ...j . Find the longest prefix of P that is a proper suffix of P 1 Find the longest prefix of P that is a proper suffix of P 1... j . j t 1 t 2 ... ... t i ... ... = = = = ≠ p 1 ... p j p j+1 ... p m 19.05.2010 Theory 1 - Text search 9

Method of Knuth-Morris-Pratt (KMP) Example for determining next [ j ]: p g [ j ] t 1 t 2 ... 01011 01011 0 ... 01011 01011 01011 1 01011 1 01011 01011 1 next [ j ] = length of the longest prefix of P that is a proper suffix of P 1 ... j . 19.05.2010 Theory 1 - Text search 10

Method of Knuth-Morris-Pratt (KMP) ⇒ for P = 0101101011, next = [0,0,1,2,0,1,2,3,4,5] : [ ] 1 2 3 4 5 6 7 8 9 10 0 1 0 1 1 0 1 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 19.05.2010 Theory 1 - Text search 11

Method of Knuth-Morris-Pratt (KMP) KMP := proc (T : : string, P : : string) p ( g g) # Input: text T and pattern P # Output: list L of shifts i at which P occurs in T n := length (T); m := length(P); n : length (T); m : length(P); L := []; next := KMPnext(P); j := 0; for i from 1 to n do for i from 1 to n do while j>0 and T[i] <> P[j+1] do j := next [j] od; if T[i] = P[j+1] then j := j+1 fi; if j = m then L := [L[] if j = m then L := [L[] , i-m] ; i m] ; j := next [j] fi; od; d RETURN (L); end; 19.05.2010 Theory 1 - Text search 12

Method of Knuth-Morris-Pratt (KMP) Pattern: abracadabra, next = [0,0,0,1,0,1,0,1,2,3,4] , [ , , , , , , , , , , ] a b r a c a d a b r a b r a b a b r a c ... | | | | | | | | | | | | | | | | | | | | | | a b r a c a d a b r a next [11] = 4 a b r a c a d a b r a b r a b a b r a c a b r a c a d a b r a b r a b a b r a c ... - - - - | a b r a c next [4] = 1 19.05.2010 Theory 1 - Text search 13

Method of Knuth-Morris-Pratt (KMP) a b r a c a d a b r a b r a b a b r a c ... - | | | | a b r a c next [4] = 1 t [4] 1 a b r a c a d a b r a b r a b a b r a c ... - | | a b r a c next [2] = 0 next [2] = 0 a b r a c a d a b r a b r a b a b r a c ... | | | | | a b r a c 19.05.2010 Theory 1 - Text search 14

Method of Knuth-Morris-Pratt (KMP) Correctness: t 1 t 2 ... ... t i ... ... = = = = ≠ p 1 ... p j p j+1 ... p m Situation at start of the for-loop: P 1... j = T i-j...i-1 and j ≠ m and j ≠ m P = T if j = 0: we are at the first character of P if j ≠ 0: P can be shifted while j > 0 and t i ≠ p j+1 19.05.2010 Theory 1 - Text search 15

Method of Knuth-Morris-Pratt (KMP) If T [ i ] = P [ j+ 1] , j and i can be increased (at the end of the loop). Wh When P has been compared completely ( j = m ), a position was found, P h b d l t l ( j ) iti f d and we can shift. 19.05.2010 Theory 1 - Text search 16

Method of Knuth-Morris-Pratt (KMP) Time complexity: p y • Text pointer i is never reset • T Text pointer i and pattern pointer j are always incremented together t i t i d tt i t j l i t d t th • Always: next [j] < j ; j can be decreased only as many times as it has been increased. The KMP algorithm can be carried out in time O ( n ), if the next -array is known. 19.05.2010 Theory 1 - Text search 17

Computing the next -array next [i] = length of the longest prefix of P that is a proper suffix of P 1 ... i . [ ] g g p p p 1 i next [1] = 0 L t Let next [ i -1] = j : t [ i 1] j p 1 p 2 ... ... p i ... ... = = = = ≠ ≠ = = = = p 1 ... p j p j+1 ... p m 19.05.2010 Theory 1 - Text search 18

Computing the next -array Consider two cases: 1) p i = p j+1 � next [ i ] = j + 1 2) p i ≠ p j+1 � replace j by next [ j ] , until p i = p j+1 or j = 0. If p i = p j+1 , we can set next [ i ] = j + 1, j otherwise next [ i ] = 0. 19.05.2010 Theory 1 - Text search 19

Computing the next -array KMPnext := proc (P : : string) p ( g) #Input : pattern P #Output : next -Array for P m := length (P); m : length (P); next := array (1..m); next [1] := 0; j := 0; j := 0; for i from 2 to m do while j > 0 and P[i] <> P[j+1] d do j := next [j] od; j t [j] d if P[i] = P[j+1] then j := j+1 fi; next [i] := j od; RETURN (next); end; 19.05.2010 Theory 1 - Text search 20

Running time of KMP The KMP algorithm can be carried out in time O( n + m ). g ( ) C Can text search be even faster? t t h b f t ? 19.05.2010 Theory 1 - Text search 21

11 Text search Robert Elssser Robert Elssser Text search - PDF document

Summer Term 2010 11 Text search Robert Elssser Robert Elssser Text search Different scenarios: Dynamic texts T Text editors t dit Symbol manipulators Static texts Literature databases Library systems

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Mathematical problems of very large networks Lszl Lovsz Etvs Lornd University,

Last week Data can be allocated on the stack or on the heap (aka dynamic memory) or on the heap

Interference Alignment and RAMCOORAN Co-Ordinated Multi-Point with 802.11ac-feedback: Testbed

B IOLOGICAL S YSTEMS , B ASICS PN & Systems Biology chemical reactions -> atomic actions

2008 Preliminary results Outlook for 2009 2008 Highlights Revenue 689.6 million, up

A posteriori error estimators for a model for flow in a porous medium with fractures Zoubida

Software Development Methodologies Lecturer: Raman Ramsin Lecture 10 Agile Methodologies: XP

In-orbit Performance of the Silicon-Tungsten Tracker of the DAMPE Mission Xin Wu on behalf of

11 Text search Robert Elssser Robert Elssser Text search - PDF document

Summer Term 2010 11 Text search Robert Elssser Robert Elssser Text search Different scenarios: Dynamic texts T Text editors t dit Symbol manipulators Static texts Literature databases Library systems

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Mathematical problems of very large networks Lszl Lovsz Etvs Lornd University,

Last week Data can be allocated on the stack or on the heap (aka dynamic memory) or on the heap

Interference Alignment and RAMCOORAN Co-Ordinated Multi-Point with 802.11ac-feedback: Testbed

B IOLOGICAL S YSTEMS , B ASICS PN &amp; Systems Biology chemical reactions -&gt; atomic actions

2008 Preliminary results Outlook for 2009 2008 Highlights Revenue 689.6 million, up

A posteriori error estimators for a model for flow in a porous medium with fractures Zoubida

Software Development Methodologies Lecturer: Raman Ramsin Lecture 10 Agile Methodologies: XP

In-orbit Performance of the Silicon-Tungsten Tracker of the DAMPE Mission Xin Wu on behalf of

B IOLOGICAL S YSTEMS , B ASICS PN & Systems Biology chemical reactions -> atomic actions