overview introduction to information retrieval
play

Overview Introduction to Information Retrieval Recap - PowerPoint PPT Presentation

Overview Introduction to Information Retrieval Recap http://informationretrieval.org 1 IIR 2: The term vocabulary and postings lists The term vocabulary 2 Hinrich Sch utze Skip pointers 3 Institute for Natural Language Processing,


  1. Overview Introduction to Information Retrieval Recap http://informationretrieval.org 1 IIR 2: The term vocabulary and postings lists The term vocabulary 2 Hinrich Sch¨ utze Skip pointers 3 Institute for Natural Language Processing, Universit¨ at Stuttgart 2008.04.28 4 Phrase queries 1 / 60 2 / 60 Outline Inverted index For each term t , we store a list of all documents that contain t . Brutus − → 1 2 4 11 31 45 173 174 1 Recap Caesar − → 1 2 4 5 6 16 57 132 . . . The term vocabulary 2 − → 2 31 54 101 Calpurnia Skip pointers 3 . . . � �� � � �� � 4 Phrase queries dictionary postings 3 / 60 4 / 60

  2. Intersecting two postings lists Constructing the inverted index: Sort postings term docID term docID I 1 ambitious 2 did 1 be 2 enact 1 brutus 1 julius 1 brutus 2 caesar 1 capitol 1 I 1 caesar 1 was 1 caesar 2 killed 1 caesar 2 i’ 1 did 1 Brutus − → 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 the 1 enact 1 capitol 1 hath 1 brutus 1 I 1 killed 1 I 1 Calpurnia − → 2 → 31 → 54 → 101 me 1 = ⇒ i’ 1 so 2 it 2 let 2 julius 1 it 2 killed 1 Intersection = ⇒ 2 → 31 be 2 killed 1 with 2 let 2 caesar 2 me 1 Linear in the length of the postings lists. the 2 noble 2 noble 2 so 2 brutus 2 the 1 hath 2 the 2 told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 ambitious 2 with 2 5 / 60 6 / 60 Westlaw: Example queries Outline Information need: Information on the legal theories involved in 1 Recap preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: “trade secret” /s disclos! /s prevent /s employe! Information need: Requirements for The term vocabulary 2 disabled people to be able to access a workplace Query: disab! /p Skip pointers access! /s work-site work-place (employment /3 place) Information 3 need: Cases about a host’s responsibility for drunk guests Query: 4 Phrase queries host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest 7 / 60 8 / 60

  3. Terms and documents Parsing a document Before we can even start worrying about terms . . . Last lecture: Simple Boolean retrieval system Our assumptions were: . . . need to deal with format and language of each document. We know what a document is. What format is it in? pdf, word, excel, html etc. We know what a term is. What language is it in? Both issues can be complex in reality. What character set is in use? We’ll look a little bit at what a document is. Each of these is a classification problem, which we will study But mostly at terms: How do we define and process the later in this course (IIR 13). vocabulary of terms of a collection? Alternative: use heuristics 9 / 60 10 / 60 Format/Language: Complications A single index usually contains terms of several languages. Sometimes a document or its components contain multiple languages/formats. Terms French email with Spanish pdf attachment What is the document unit for indexing? A file? An email? An email with 5 attachments? A group of files (ppt or latex in HTML)? 11 / 60 12 / 60

  4. Definitions Type/token distinction: Example Word – A delimited string of characters as it appears in the text. Term – A “normalized” word (case, morphology, spelling etc); In June, the dog likes to chase the cat in the barn. an equivalence class of words. How many tokens? How many types? Token – An instance of a word or term occurring in a document. Type – The same as a term in most cases: an equivalence class of tokens. 13 / 60 14 / 60 Recall: Inverted index construction Why tokenization is difficult – even in English Input: Friends, Romans, countrymen. So let it be with Caesar . . . Example: Mr. O’Neill thinks that the boys’ stories about Chile’s Output: capital aren’t amusing. Tokenize this sentence friend roman countryman so . . . Each token is a candidate for a postings entry. What are valid tokens to emit? 15 / 60 16 / 60

  5. 和尚 莎拉波娃 在居住在美国 ✁ 南部的佛 里 。今年4月 9日,莎拉波娃在美国第一大城市 度 了18 生 日。生日派 ✟ 上,莎拉波娃露出了甜美的微笑。 One word or two? (or several) Numbers Hewlett-Packard 3/12/91 State-of-the-art 12/3/91 co-education Mar 12, 1991 the hold-him-back-and-drag-him-away maneuver B-52 data base 100.2.86.144 San Francisco (800) 234-2333 Los Angeles-based company 800.234.2333 cheap San Francisco-Los Angeles fares Older IR systems may not index numbers, but generally it’s a York University vs. New York University useful feature. 17 / 60 18 / 60 Chinese: No whitespace Ambiguous segmentation in Chinese � ✂ ✄ ☎ ✆ ✝ ✞ The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’. 19 / 60 20 / 60

  6. Other cases of “no whitespace” Japanese � ✁ ✂ ✄ ☎ ✆ ✝ ✞ ✟ ✝ ✠ ✡ ☛ ☞ ✌ ✍ ✎ ✏ ✁ ✑ ✒ ✓ ✔ ✕ ✖ ✗ ✘ ✙ ✞ ✚ ✛ ✜ ✢ ✣ ✤ ✤ ✥ ✦ ✧ ✥ ✦ ★ ✩ ☞ ✪ ✁ ☞ ✫ ✬ ✭ ✮ ✠ ✯ ✰ ✱ ✲ ✳ ✴ ✵ ✮ ✏ ✌ ✶ ✼ ☞ ✷ ✸ ✹ ✺ ✻ ✫ ✰ ✽ ✾ ✡ ✿ ❀ ✿ ❁ ✞ ❂ ❃ ✠ ❄ ❅ ❆ ❇ ❈ ✕ ✲ ❉ ❊ Compounds in Dutch and German ✻ ✽ ✾ ✡ ✿ ❀ ✿ ❁ ✮ ❋ ● ✯ ❍ ■ ✠ ✯ ✿ ✜ ❏ ✮ ❑ ✰ ▲ ▼ ◆ ❄ ❖ P ✜ ◗ ❘ Computerlinguistik → Computer + Linguistik ❙ ✁ ❚ ✞ ❯ ❱ ❱ ❲ ❳ ❨ ✫ ❩ ❬ ◆ ❄ ✮ ✛ ✰ ❭ ❪ ❀ ❫ ❴ ✰ ✒ ❵ ✹ ❛ ✰ ❜ Lebensversicherungsgesellschaftsangestellter ❀ ❝ ✞ ❞ ❡ ✯ ❢ ❱ ❣ ❤ ❱ ✲ ❄ ✐ ◆ ❥ ❦ ❧ ♠ ♥ ✓ ✿ ❆ ♦ ✝ ✟ ✝ ♣ ◆ ✺ ✰ → leben + versicherung + gesellschaft + angestellter q ❱ r s t ✉ ✫ ✈ ✇ ① ✮ ◗ ② ③ ④ ❤ ⑤ ✫ ⑥ ✝ ✕ ⑦ ⑧ ▼ ❄ ❅ ❆ 4 Inuit: tusaatsiarunnanngittualuujunga (I can’t hear very well.) different “alphabets”: Chinese characters, hiragana syllabary for Swedish, Finnish, Greek, Urdu, many other languages inflectional endings and function words, katakana syllabary for transcription of foreign words and other uses, and latin. No spaces (as in Chinese). End user can express query entirely in hiragana! 21 / 60 22 / 60 Arabic script Arabic script: Bidirectionality ٌب�َ�ِآ ⇐ ٌ ب ا ت ِ ك ��� �� ��ا���ا �����ا1962 ��� 132������ا ل����ا �� ���� . ← → ← → ← START un b ā t i k Bidirectionality ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’ is not a problem if text is coded in Unicode. /kitābun/ ‘a book’ 23 / 60 24 / 60

  7. Normalization Need to “normalize” terms in indexed text as well as query terms into the same form. Example: We want to match U.S.A. and USA Back to English We most commonly implicitly define equivalence classes of terms. Alternatively: do asymmetric expansion window → window, windows windows → Windows, windows Windows (no expansion) More powerful, but less efficient Why don’t you want to put window , Window , windows , and Windows in the same equivalence class? 25 / 60 26 / 60 Normalization: Other languages Case folding Accents: r´ esum´ e vs. resume (simple omission of accent) Umlauts: Universit¨ at vs. Universitaet (substitution with Reduce all letters to lower case special letter sequence “ae”) Possible exceptions: capitalized words in mid-sentence Most important criterion: How are users likely to write their MIT vs. mit queries for these words? Fed vs. fed Even in languages that standardly have accents, users often do not type them. (Polish?) It’s often best to lowercase everything since users will use lowercase regardless of correct capitalization. Normalization and language detection interact. PETER WILL NICHT MIT. → MIT = mit He got his PhD from MIT. → MIT � = mit 27 / 60 28 / 60

  8. Stop words More equivalence classing stop words = extremely common words which would appear to be of little value in helping select documents matching a user need Soundex: IIR 3 (phonetic equivalence, Tchebyshev = Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, Chebysheff) is, it, its, of, on, that, the, to, was, were, will, with Thesauri: IIR 9 (semantic equivalence, car = automobile) Stop word elimination used to be standard in older IR systems. But you need stop words for phrase queries, e.g. “King of Denmark” Most web search engines index stop words. 29 / 60 30 / 60 What does Google do? Lemmatization Reduce inflectional/variant forms to base form Stop words Example: am, are, is → be Normalization Example: car, cars, car’s, cars’ → car Tokenization Example: the boy’s cars are different colors → the boy car be Lowercasing different color Stemming Lemmatization implies doing “proper” reduction to dictionary Non-latin alphabets headword form (the lemma). Umlauts Inflectional morphology ( cutting → cut ) vs. derivational Compounds morphology ( destruction → destroy ) Numbers 31 / 60 32 / 60

Recommend


More recommend