introduction to natural language processing
play

Introduction to Natural Language Processing Evaluation Vector space - PowerPoint PPT Presentation

Introduction Week 7, lecture WWW: pecina@ufal.mfg.cuni.cz E-mail: Pavel Pecina Todays teacher: Boolen and Vector Space Models for Information Retrieval Todays topic: Today: Boolean retrieval by members of the Institute of Formal and


  1. Introduction Week 7, lecture WWW: pecina@ufal.mfg.cuni.cz E-mail: Pavel Pecina Today’s teacher: Boolen and Vector Space Models for Information Retrieval Today’s topic: Today: Boolean retrieval by members of the Institute of Formal and Applied Linguistics a course taught as B4M36NLP at Open Informatics Introduction to Natural Language Processing Evaluation Vector space model Term weighting Ranked retrieval Text processing 1 / 110 htup://ufal.mfg.cuni.cz/ ∼ pecina/

  2. Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation Contents Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation 2 / 110

  3. Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation Introduction 3 / 110

  4. Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation Definition of Information Retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from 4 / 110 within large collections (usually stored on computers).

  5. Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation Boolean retrieval 5 / 110

  6. Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation Boolean retrieval retrieval system on. expression. 6 / 110 ▶ Boolean model is arguably the simplest model to base an information ▶ Qveries are Boolean expressions, e.g., Caesar and Brutus ▶ The search engine returns all documents that satisfy the Boolean

  7. Introduction 0 0 0 0 1 Cleopatra 0 0 0 0 1 0 Calpurnia 1 1 0 mercy Boolean retrieval 0 Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar . … 0 1 1 1 1 1 worser 1 1 1 1 0 1 0 1 Julius and … Macbeth Othello Hamlet The Anthony 1 Term-document incidence matrix Evaluation Vector space model Term weighting Ranked retrieval Text processing Caesar Tempest Cleopatra Anthony Caesar 0 0 1 0 1 1 Brutus 1 0 0 0 1 1 8 / 110 Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The tempest .

  8. Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation Incidence vectors 1. Take the vectors for Brutus, Caesar, and Calpurnia 2. Complement the vector of Calpurnia 3. Do a (bitwise) and on the three vectors: 110100 and 110111 and 101111 = 100100 9 / 110 ▶ So we have a 0/1 vector for each term. ▶ To answer the query Brutus and Caesar and not Calpurnia:

  9. Introduction 0 1 mercy 0 0 0 0 1 1 Cleopatra 0 0 0 0 1 0 0 1 1 … 0 0 1 0 0 1 result: 0 1 1 1 1 0 1 worser 1 Calpurnia 1 Boolean retrieval The Caesar and … Macbeth Othello Hamlet Julius Cleopatra Anthony 0/1 vector for Brutus Evaluation Vector space model Term weighting Ranked retrieval Text processing Tempest Anthony 1 0 0 1 1 Caesar 0 0 1 1 1 1 Brutus 1 0 0 0 1 10 / 110

  10. Introduction Boolean retrieval Capitol; Brutus killed me. Lord Polonius: Hamlet, Act III, Scene ii: He cried almost to roaring; and he wept Why, Enobarbus, Agrippa [Aside to Domitius Enobarbus]: Anthony and Cleopatra, Act III, Scene ii: Answers to query Evaluation Vector space model Term weighting Ranked retrieval Text processing 11 / 110 When Antony found Julius Caesar dead, When at Philippi he found Brutus slain. I did enact Julius Caesar: I was killed i’ the

  11. Introduction Evaluation Boolean retrieval Bigger collections 12 / 110 Vector space model Term weighting Ranked retrieval Text processing ▶ Consider N = 10 6 documents, each with about 1000 tokens ⇒ total of 10 9 tokens ▶ On average 6 bytes per token, including spaces and punctuation ⇒ size of document collection is about 6 · 10 9 = 6 GB ▶ Assume there are M = 500 , 000 distinct terms in the collection ⇒ M = 500 , 000 × 10 6 = half a trillion 0s and 1s. ▶ But the matrix has no more than one billion 1s. ⇒ Matrix is extremely sparse. ▶ What is a betuer representations? ⇒ We only record the 1s.

  12. Introduction Calpurnia Boolean retrieval 5 6 16 57 132 … 2 1 31 54 101 . . . dictionary postings 2 4 13 / 110 1 Text processing Ranked retrieval Term weighting Vector space model Evaluation Inverted Index For each term t , we store a list of all documents that contain t . Brutus 2 11 Caesar 174 45 31 173 4 − → − → − → � �� � � �� �

  13. Introduction Romans index, consisting of a dictionary and postings. 4. Index the documents that each term occurs in by creating an inverted so … countryman roman which are the indexing terms: friend 3. Do linguistic preprocessing, producing a list of normalized tokens, So … countrymen Friends Boolean retrieval 2. Tokenize the text, turning each document into a list of tokens: So let it be with Caesar … Friends, Romans, countrymen. Inverted index construction Evaluation Vector space model Term weighting Ranked retrieval Text processing 14 / 110 1. Collect the documents to be indexed:

  14. Introduction Boolean retrieval ambitious ble brutus hath told you caesar was Doc 2. so let it be with caesar the no- killed i’ the capitol brutus killed me Doc 1. i did enact julius caesar i was ambitious: noble Brutus hath told you Caesar was Doc 2. So let it be with Caesar. The killed i’ the Capitol; Brutus killed me. Doc 1. I did enact Julius Caesar: I was Tokenization and preprocessing Evaluation Vector space model Term weighting Ranked retrieval Text processing 15 / 110 ⇒

  15. Introduction postings lists 2 was 1 was 2 with 2 term doc. freq. ambitious 2 1 2 be 1 2 brutus 2 capitol 1 1 you told 2 1 i’ 1 it 2 julius 1 killed 1 killed let 2 Boolean retrieval me 1 noble 2 so 2 the 1 the caesar did i 2 1 1 noble 1 2 so 1 2 the told 2 1 2 you 1 2 was 2 with 1 2 me 1 1 i’ 1 enact 1 1 hath 1 2 i 1 1 1 let 1 it 1 2 julius 1 1 killed 1 1 1 2 1 killed killed 1 i’ 1 the 1 capitol 1 brutus 1 1 was me 1 so 2 i 2 it 2 be 2 1 1 2 ambitious Text processing Ranked retrieval Term weighting Vector space model Evaluation Generate postings, sort, create lists, determine document frequency Doc 1. i did enact julius caesar i was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the no- ble brutus hath told you caesar was term i docID i 1 did 1 enact 1 julius 1 caesar 1 with let 16 / 110 caesar 2 caesar 2 was caesar 2 ambitious 2 1 term caesar docID caesar ambitious 2 be 1 2 brutus 1 capitol 2 you 2 2 2 1 hath the 1 2 noble enact 2 brutus 2 1 hath 2 did told 2 brutus → → → → 1 → 2 → → 1 → 2 → → → → → ⇒ ⇒ ⇒ → → → → → → → → 1 → 2 → → → 1 → 2 →

  16. Introduction 2 4 Boolean retrieval 6 16 57 132 … Calpurnia 31 1 54 101 . . . dictionary postings The dictionary is the data structure for storing the term vocabulary. 2 5 17 / 110 1 Text processing Ranked retrieval Term weighting Vector space model Evaluation Split the result into dictionary and postings file For each term t , we store a list of all documents that contain t . Brutus 2 173 Caesar 4 11 31 45 174 − → − → − → � �� � � �� �

  17. Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation Dictionary as array of fixed-width entries fixed-length entry. 19 / 110 ▶ For each term, we need to store a couple of items: ▶ document frequency ▶ pointer to postings list ▶ … ▶ Assume for the time being that we can store this information in a ▶ Assume that we store these entries in an array.

  18. Introduction 656,265 2. Which data structure do we use to locate the entry (row) in the array 4 bytes 4 bytes 20 bytes Space needed: 221 zulu … … … 65 Boolean retrieval aachen a term Text processing Ranked retrieval Term weighting Vector space model Evaluation Dictionary as array of fixed-width entries postings list Dictionary: document frequency pointer to 20 / 110 − → − → − → 1. How do we look up a query term q i in this array at query time? where q i is stored?

  19. Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation Data structures for looking up term 1. Is there a fixed number of terms or will it keep growing? 2. What are the frequencies with which various keys will be accessed? 3. How many terms are we likely to have? 21 / 110 ▶ Two main classes of data structures: hashes and trees. ▶ Some IR systems use hashes, some use trees. ▶ Criteria for when to use hashes vs. trees:

Recommend


More recommend