Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text processing Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stutugart. 1 / 65 pecina@ufal.mff.cuni.cz
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Contents Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search 2 / 65
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Introduction 3 / 65
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Definition of Information Retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from 4 / 65 within large collections (usually stored on computers).
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Boolean retrieval 5 / 65
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Boolean retrieval retrieval system on. expression. Does Google use the Boolean model? 6 / 65 ▶ Boolean model is arguably the simplest model to base an information ▶ Qveries are Boolean expressions, e.g., Caesar and Brutus ▶ The search engine returns all documents that satisfy the Boolean
Introduction Does Google use the Boolean model? of relevance). they rank good hits higher than bad hits (according to some estimator Boolean retrieval 7 / 65 Proximity search Phrase queries Text processing Boolean queries Inverted index ▶ On Google, the default interpretation of a query [ w 1 w 2 … w n ] is w 1 AND w 2 AND . . . AND w n ▶ Cases where you get hits that do not contain one of the w i : ▶ anchor text ▶ page contains variant of w i (morphology, spelling, synonymy) ▶ long queries ( n large) ▶ boolean expression generates very few hits ▶ Simple Boolean vs. Ranking of result set ▶ Simple Boolean retrieval returns documents in no particular order. ▶ Google (and most well designed Boolean engines) rank the result set –
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Inverted index 8 / 65
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Unstructured data in 1650: Plays of William Shakespeare 9 / 65
Introduction Proximity search then strip out lines containing Calpurnia. Boolean retrieval Unstructured data in 1650 but not Calpurnia? Phrase queries Text processing Boolean queries Inverted index 10 / 65 ▶ Which plays of Shakespeare contain the words Brutus and Caesar, ▶ One could grep all of Shakespeare’s plays for Brutus and Caesar, ▶ Why is grep not the solution? ▶ Slow (for large collections) ▶ grep is line-oriented, IR is document-oriented ▶ “not Calpurnia” is non-trivial ▶ Other operations (e.g. search for Romans near country) infeasible
Introduction 0 0 0 0 1 Cleopatra 0 0 0 0 1 0 Calpurnia 1 1 0 mercy Boolean retrieval 0 Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar . … 0 1 1 1 1 1 worser 1 1 1 1 0 1 0 1 Julius and … Macbeth Othello Hamlet The Anthony 1 Term-document incidence matrix Proximity search Phrase queries Text processing Boolean queries Inverted index Caesar Tempest Cleopatra Anthony Caesar 0 0 1 0 1 1 Brutus 1 0 0 0 1 1 11 / 65 Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The tempest .
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Incidence vectors 1. Take the vectors for Brutus, Caesar, and Calpurnia 2. Complement the vector of Calpurnia 3. Do a (bitwise) and on the three vectors: 110100 and 110111 and 101111 = 100100 12 / 65 ▶ So we have a 0/1 vector for each term. ▶ To answer the query Brutus and Caesar and not Calpurnia:
Introduction 0 1 mercy 0 0 0 0 1 1 Cleopatra 0 0 0 0 1 0 0 1 1 … 0 0 1 0 0 1 result: 0 1 1 1 1 0 1 worser 1 Calpurnia 1 Boolean retrieval The Caesar and … Macbeth Othello Hamlet Julius Cleopatra Anthony 0/1 vector for Brutus Proximity search Phrase queries Text processing Boolean queries Inverted index Tempest Anthony 1 0 0 1 1 Caesar 0 0 1 1 1 1 Brutus 1 0 0 0 1 13 / 65
Introduction Boolean retrieval Capitol; Brutus killed me. Lord Polonius: Hamlet, Act III, Scene ii: He cried almost to roaring; and he wept Why, Enobarbus, Agrippa [Aside to Domitius Enobarbus]: Anthony and Cleopatra, Act III, Scene ii: Answers to query Proximity search Phrase queries Text processing Boolean queries Inverted index 14 / 65 When Antony found Julius Caesar dead, When at Philippi he found Brutus slain. I did enact Julius Caesar: I was killed i’ the
Introduction Proximity search Boolean retrieval Bigger collections 15 / 65 Phrase queries Text processing Boolean queries Inverted index ▶ Consider N = 10 6 documents, each with about 1000 tokens ⇒ total of 10 9 tokens ▶ On average 6 bytes per token, including spaces and punctuation ⇒ size of document collection is about 6 · 10 9 = 6 GB ▶ Assume there are M = 500 , 000 distinct terms in the collection ⇒ M = 500 , 000 × 10 6 = half a trillion 0s and 1s. ▶ But the matrix has no more than one billion 1s. ⇒ Matrix is extremely sparse. ▶ What is a betuer representations? ⇒ We only record the 1s.
Introduction Calpurnia Boolean retrieval 5 6 16 57 132 … 2 1 31 54 101 . . . dictionary postings 2 4 16 / 65 1 Inverted index Boolean queries Text processing Phrase queries Proximity search Inverted Index For each term t , we store a list of all documents that contain t . Brutus 2 11 Caesar 174 45 31 173 4 − → − → − → � �� � � �� �
Introduction Romans index, consisting of a dictionary and postings. 4. Index the documents that each term occurs in by creating an inverted so … countryman roman which are the indexing terms: friend 3. Do linguistic preprocessing, producing a list of normalized tokens, So … countrymen Friends Boolean retrieval 2. Tokenize the text, turning each document into a list of tokens: So let it be with Caesar … Friends, Romans, countrymen. Inverted index construction Proximity search Phrase queries Text processing Boolean queries Inverted index 17 / 65 1. Collect the documents to be indexed:
Introduction Boolean retrieval ambitious ble brutus hath told you caesar was Doc 2. so let it be with caesar the no- killed i’ the capitol brutus killed me Doc 1. i did enact julius caesar i was ambitious: noble Brutus hath told you Caesar was Doc 2. So let it be with Caesar. The killed i’ the Capitol; Brutus killed me. Doc 1. I did enact Julius Caesar: I was Tokenization and preprocessing Proximity search Phrase queries Text processing Boolean queries Inverted index 18 / 65 ⇒
Introduction postings lists 2 was 1 was 2 with 2 term doc. freq. ambitious 2 1 2 be 1 2 brutus 2 capitol 1 1 you told 2 1 i’ 1 it 2 julius 1 killed 1 killed let 2 Boolean retrieval me 1 noble 2 so 2 the 1 the caesar did i 2 1 1 noble 1 2 so 1 2 the told 2 1 2 you 1 2 was 2 with 1 2 me 1 1 i’ 1 enact 1 1 hath 1 2 i 1 1 1 let 1 it 1 2 julius 1 1 killed 1 1 1 2 1 killed killed 1 i’ 1 the 1 capitol 1 brutus 1 1 was me 1 so 2 i 2 it 2 be 2 1 1 2 ambitious Inverted index Boolean queries Text processing Phrase queries Proximity search Generate postings, sort, create lists, determine document frequency Doc 1. i did enact julius caesar i was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the no- ble brutus hath told you caesar was term i docID i 1 did 1 enact 1 julius 1 caesar 1 with let 19 / 65 caesar 2 caesar 2 was caesar 2 ambitious 2 1 term caesar docID caesar ambitious 2 be 1 2 brutus 1 capitol 2 you 2 2 2 1 hath the 1 2 noble enact 2 brutus 2 1 hath 2 did told 2 brutus → → → → 1 → 2 → → 1 → 2 → → → → → ⇒ ⇒ ⇒ → → → → → → → → 1 → 2 → → → 1 → 2 →
Introduction Calpurnia 4 Boolean retrieval 6 16 57 132 … 2 1 31 54 101 . . . dictionary postings file 2 5 20 / 65 2 Brutus Phrase queries Text processing Boolean queries Inverted index 1 4 Proximity search 11 31 45 173 174 Caesar Split the result into dictionary and postings file − → − → − → � �� � � �� �
Introduction Boolean retrieval Inverted index Boolean queries Text processing Phrase queries Proximity search Boolean queries 21 / 65
Recommend
More recommend