INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 1: Boolean Retrieval Paul Ginsparg Cornell University, Ithaca, NY 25 Aug 2011 1 / 43
Plan for today Course overview Administrativa Boolean retrieval 2 / 43
Overview “After change, things are different . . . ” 3 / 43
“Plan” Search full text: basic concepts Web search Probabalistic Retrieval Interfaces Metadata / Semantics IR ⇔ NLP ⇔ ML Prereqs: Introductory courses in data structures and algorithms, in linear algebra (eigenvalues) and in probability theory (Bayes theorem) 4 / 43
Administrativa (tentative) Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2011fa/ Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Physical Sciences Building 452 Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Saeed Abdullah, use cs4300-l@lists.cs.cornell.edu Course text at: http://informationretrieval.org/ Introduction to Information Retrieval , C.Manning, P.Raghavan, H.Sch¨ utze see also Information Retrieval , S. B¨ uttcher, C. Clarke, G. Cormack http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307 5 / 43
Tentative Assignment and Exam Schedules During this course there will be four assignments which require programming Assignment 1 due Sun 18 Sep Assignment 2 due Sat 8 Oct Assignment 3 due Sun 6 Nov Assignment 4 due Fri 2 Dec and two examinations: Midterm on Thu 13 Oct Final exam on Wed 14 Dec (7:00 PM – 9:30 PM) The course grade will be based on course assignments, examinations, and subjective measures (e.g., class participation) with rough weightings: Assignments 50%, Examinations 50%, with subjective adjustments (as much as 20%) 6 / 43
Outline Introduction 1 Inverted index 2 Processing Boolean queries 3 Discussion Section (next week) 4 7 / 43
Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Used to be only reference librarians, paralegals, professionals. Now hundreds of millions of people (billions?) engage in information retrieval every day when they use a web search engine or search their email Three scales (web, enterprise/inst/domain, personal) 8 / 43
Clustering and Classification IR also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Clustering: find a good grouping of the documents based on their contents. (c.f., arrange books on a bookshelf according to their topic) Classification: given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), decide which class(es), if any, to which each of a set of documents belongs 9 / 43
Structured vs Unstructured “unstructured data”: no clear, semantically overt (easy-for-a-computer) structure. structured data: e.g.: relational database (product inventories and personnel records) But: no data truly “unstructured” (text data has latent linguistic structure, in addition headings, paragraphs, footnotes, with explicit markup) IR facilitates “semistructured” search: e.g., find document whose title contains Java and body contains threading 10 / 43
!"#$%&'()%"*#%*!"+%$,-)%"*./#$0/1-2* !"#$%&'$&%()*+$(,$-*.#/*#$%&'$&%()* +)0$010#(-*)0$0*2"*3445* #!!" '&!" '%!" '$!" '#!" 678*2.9*.20:" '!!" ;*2.9*.20:" &!" %!" $!" #!" !" ()*)"+,-./0" 1)230*"4)5" 6*
Boolean retrieval The Boolean model is among the simplest models on which to base an information retrieval system. Queries are Boolean expressions, e.g., Caesar and Brutus The seach engine returns all documents that satisfy the Boolean expression. Does Google use the Boolean model? 14 / 43
Outline Introduction 1 Inverted index 2 Processing Boolean queries 3 Discussion Section (next week) 4 15 / 43
Unstructured data in 1650: Shakespeare Which plays of Shakespeare contain the words Brutus and Caesar , but not Calpurnia ? One could grep all of Shakespeare’s plays for Brutus and Caesar , then strip out lines containing Calpurnia . Why is grep not the solution? Slow (for large collections) grep is line-oriented, IR is document-oriented “ not Calpurnia ” is non-trivial Other operations (e.g., find the word Romans near countryman ) not feasible Ranked retrieval (best documents to return) — later in course 16 / 43
Term-document incidence matrix Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra 1 1 0 0 0 1 Anthony Brutus 1 1 0 1 0 0 1 1 0 1 1 1 Caesar Calpurnia 0 1 0 0 0 0 1 0 0 0 0 0 Cleopatra mercy 1 0 1 1 1 1 1 0 1 1 1 0 worser . . . Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar . Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The tempest . (Shakespeare used about 32,000 different words) 17 / 43
Binary–valued vector for Brutus Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra 1 1 0 0 0 1 Anthony Brutus 1 1 0 1 0 0 1 1 0 1 1 1 Caesar Calpurnia 0 1 0 0 0 0 1 0 0 0 0 0 Cleopatra mercy 1 0 1 1 1 1 1 0 1 1 1 0 worser . . . 18 / 43
Incidence vectors So we have a binary–valued vector for each term. To answer the query Brutus and Caesar and not Calpurnia : Take the vectors for Brutus , Caesar , and Calpurnia Complement the vector of Calpurnia Do a (bitwise) and on the three vectors 110100 and 110111 and 101111 = 100100 19 / 43
Answers to query Anthony and Cleopatra, Act III, Scene ii Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. 20 / 43
Ad hoc retrieval Provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. Information need (topic about which user desires to know more) � = query (what the user conveys to computer) To assess the effectiveness of an IR system, two key statistics: Precision : What fraction of the returned results are relevant to P = # relevant results the information need? # results Recall : What fraction of the relevant documents in the collection R = # relevant results were returned by the system? total # relevant Example: from 100 document collection containing 20 documents relevant to query Q, IR system returns 10, of which 9 are relevant. 21 / 43
Bigger collections Consider N = 10 6 documents, each with about 1000 tokens On average 6 bytes per token, including spaces and punctuation ⇒ size of document collection is about 6 GB Assume there are M = 500 , 000 distinct terms in the collection (Notice that we are making a term/token distinction.) 22 / 43
Can’t build the incidence matrix M = 500 , 000 × 10 6 = half a trillion 0s and 1s. But the matrix has no more than one billion 1s. Matrix is extremely sparse. (10 9 / 5 · 10 11 = . 2%) What is a better representation? We only record the 1s. 23 / 43
Inverted Index For each term t , we store a list of all documents that contain t . Brutus − → 1 2 4 11 31 45 173 174 Caesar − → 1 2 4 5 6 16 57 132 . . . Calpurnia − → 2 31 54 101 . . . � �� � � �� � dictionary postings 24 / 43
Inverted index construction Collect the documents to be indexed: 1 Friends, Romans, countrymen. So let it be with Caesar . . . Tokenize the text, turning each document into a list of tokens: 2 Friends Romans countrymen So . . . Do linguistic preprocessing, producing a list of normalized 3 tokens, which are the indexing terms: friend roman countryman so . . . Index the documents that each term occurs in by creating an 4 inverted index, consisting of a dictionary and postings. 25 / 43
Tokenization and preprocessing Doc 1. I did enact Julius Caesar: I Doc 1. i did enact julius caesar i was was killed i’ the Capitol; Brutus killed killed i’ the capitol brutus killed me me. ⇒ Doc 2. so let it be with caesar the = Doc 2. So let it be with Caesar. The noble brutus hath told you caesar was noble Brutus hath told you Caesar ambitious was ambitious: 26 / 43
Generate postings term docID i 1 did 1 enact 1 julius 1 caesar 1 i 1 was 1 killed 1 i’ 1 the 1 capitol 1 brutus 1 Doc 1. i did enact julius caesar i was killed 1 killed i’ the capitol brutus killed me ⇒ me 1 Doc 2. so let it be with caesar the = so 2 noble brutus hath told you caesar was let 2 ambitious it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 27 / 43
Recommend
More recommend