Information Retrieval Introducing Information Retrieval and - PowerPoint PPT Presentation

Introduction ¡to Information ¡Retrieval Introducing ¡Information ¡Retrieval ¡ and ¡Web ¡Search

Information ¡Retrieval • Information ¡Retrieval ¡(IR) ¡is ¡finding ¡material (usually ¡documents) ¡of ¡an ¡unstructured nature ¡ (usually ¡text) ¡that ¡satisfies ¡an ¡information ¡need from ¡within ¡large ¡collections (usually ¡stored ¡on ¡ computers). – These ¡days ¡we ¡frequently ¡think ¡first ¡of ¡web ¡search, ¡ but ¡there ¡are ¡many ¡other ¡cases: • E-‑mail ¡search • Searching ¡your ¡laptop • Corporate ¡knowledge ¡bases • Legal ¡information ¡retrieval 2

Unstructured ¡(text) ¡vs. ¡structured ¡(database) ¡ data ¡in ¡the ¡mid-‑nineties 3

Unstructured ¡(text) ¡vs. ¡structured ¡(database) ¡ data ¡today 4

Sec. 1.1 Basic ¡assumptions ¡of ¡Information ¡Retrieval • Collection: ¡A ¡set ¡of ¡documents – Assume ¡it ¡is ¡a ¡static ¡collection ¡for ¡the ¡moment • Goal: ¡Retrieve ¡documents ¡with ¡information ¡ that ¡is ¡relevant to ¡the ¡user’s ¡information ¡need and ¡helps ¡the ¡user ¡complete ¡a ¡task 5

The ¡classic ¡search ¡model Get rid of mice in a User ¡task politically correct way Misconception? Info about removing mice Info ¡need without killing them Misformulation? Searc Query how ¡trap ¡mice ¡alive h Search engine Query Results Collection refinement ¡

Sec. 1.1 How ¡good ¡are ¡the ¡retrieved ¡docs? § Precision ¡ : ¡Fraction ¡of ¡retrieved ¡docs ¡that ¡are ¡ relevant ¡to ¡the ¡user’s ¡information ¡need § Recall : ¡Fraction ¡of ¡relevant ¡docs ¡in ¡collection ¡ that ¡are ¡retrieved § More ¡precise ¡definitions ¡and ¡measurements ¡to ¡ follow ¡later 7

Introduction ¡to Information ¡Retrieval Term-‑document ¡incidence ¡matrices

Sec. 1.1 Unstructured ¡data ¡in ¡1620 • Which ¡plays ¡of ¡Shakespeare ¡contain ¡the ¡words ¡ Brutus AND Caesar but ¡ NOT Calpurnia ? • One ¡could ¡ grep all ¡of ¡Shakespeare’s ¡plays ¡for ¡ Brutus and ¡ Caesar, then ¡strip ¡out ¡lines ¡containing ¡ Calpurnia ? • Why ¡is ¡that ¡not ¡the ¡answer? – Slow ¡(for ¡large ¡corpora) – NOT Calpurnia is ¡non-‑trivial – Other ¡operations ¡(e.g., ¡find ¡the ¡word ¡ Romans ¡ near countrymen ) ¡not ¡feasible – Ranked ¡retrieval ¡(best ¡documents ¡to ¡return) • Later ¡lectures 9

Sec. 1.1 Term-‑document ¡incidence ¡matrices Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth 1 1 0 0 0 1 Antony Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 1 0 0 0 0 0 Cleopatra mercy 1 0 1 1 1 1 1 0 1 1 1 0 worser 1 ¡if ¡play contains ¡ Brutus AND Caesar BUT NOT word, ¡0 ¡otherwise Calpurnia

Sec. 1.1 Incidence ¡vectors • So ¡we ¡have ¡a ¡0/1 ¡vector ¡for ¡each ¡term. • To ¡answer ¡query: ¡take ¡the ¡vectors ¡for ¡ Brutus, ¡ Caesar and ¡ Calpurnia (complemented) ¡ è bitwise ¡ AND . – 110100 ¡ AND – 110111 ¡ AND Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 – 101111 ¡= ¡ 1 1 0 1 0 0 Brutus Caesar 1 1 0 1 1 1 0 1 0 0 0 0 Calpurnia – 100100 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 11

Sec. 1.1 Answers ¡to ¡query • Antony ¡and ¡Cleopatra, Act ¡III, ¡Scene ¡ii Agrippa [Aside ¡to ¡DOMITIUS ¡ENOBARBUS]: ¡Why, ¡Enobarbus, When ¡Antony ¡found ¡Julius ¡ Caesar dead, He ¡cried ¡almost ¡to ¡roaring;Ϳ ¡and ¡he ¡wept When ¡at ¡Philippi ¡he ¡found ¡ Brutus slain. • Hamlet, ¡Act ¡III, ¡Scene ¡ii Lord ¡Polonius: I ¡did ¡enact ¡Julius ¡ Caesar I ¡was ¡killed ¡i’ ¡the Capitol;Ϳ ¡ Brutus killed ¡me. 12

Sec. 1.1 Bigger ¡collections • Consider ¡ N ¡ = ¡1 ¡million ¡documents, ¡each ¡with ¡ about ¡1000 ¡words. • Avg ¡6 ¡bytes/word ¡including ¡ spaces/punctuation ¡ – 6GB ¡of ¡data ¡in ¡the ¡documents. • Say ¡there ¡are ¡ M ¡ = ¡500K ¡ distinct terms ¡among ¡ these. 13

Sec. 1.1 Can’t ¡build ¡the ¡matrix • 500K ¡x ¡1M ¡matrix ¡has ¡half-‑a-‑trillion ¡0’s ¡and ¡1’s. • But ¡it ¡has ¡no ¡more ¡than ¡one ¡billion ¡1’s. Why? – matrix ¡is ¡extremely ¡sparse. • What’s ¡a ¡better ¡representation? – We ¡only ¡record ¡the ¡1 ¡positions. 14

Introduction ¡to Information ¡Retrieval The ¡Inverted ¡Index The ¡key ¡data ¡structure ¡underlying ¡ modern ¡IR

Sec. 1.2 Inverted ¡index • For ¡each ¡term ¡ t , ¡we ¡must ¡store ¡a ¡list ¡of ¡all ¡ documents ¡that ¡contain ¡ t . – Identify ¡each ¡doc ¡by ¡a ¡ docID , ¡a ¡document ¡serial ¡ number • Can ¡we ¡used ¡fixed-‑size ¡arrays ¡for ¡this? Brutus 1 2 4 11 31 45 173 174 Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54101 What ¡happens ¡if ¡the ¡word ¡ Caesar is ¡added ¡to ¡document ¡14? ¡ 16

Sec. 1.2 Inverted ¡index • We ¡need ¡variable-‑size ¡postings ¡lists – On ¡disk, ¡a ¡continuous ¡run ¡of ¡postings ¡is ¡normal ¡ and ¡best Posting – In ¡memory, ¡can ¡use ¡linked ¡lists ¡or ¡variable ¡length ¡ arrays Brutus 1 2 4 11 31 45 173 174 • Some ¡tradeoffs ¡in ¡size/ease ¡of ¡insertion Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54101 Postings Dictionary Sorted by docID (more later on why). 17

Sec. 1.2 Inverted ¡index ¡construction Documents to Friends, Romans, countrymen. be indexed Tokenizer Token stream Friends Romans Countrymen Linguistic ¡modules friend roman countryman Modified tokens 2 4 Indexer friend 1 2 roman Inverted index 16 13 countryman

Initial ¡stages ¡of ¡text ¡processing • Tokenization – Cut ¡character ¡sequence ¡into ¡word ¡tokens • Deal ¡with ¡ “John’s” , ¡ a ¡state-‑of-‑the-‑art ¡solution • Normalization – Map ¡text ¡and ¡query ¡term ¡to ¡same ¡form • You ¡want ¡ U.S.A. and ¡ USA ¡ to ¡match • Stemming – We ¡may ¡wish ¡different ¡forms ¡of ¡a ¡root ¡to ¡match • authorize , authorization • Stop ¡words – We ¡may ¡omit ¡very ¡common ¡words ¡(or ¡not) • the, ¡a, ¡to, ¡of

Sec. 1.2 Indexer ¡steps: ¡Token ¡sequence Sequence ¡of ¡(Modified ¡token, ¡Document ¡ID) ¡pairs. • Doc ¡1 Doc ¡2 I ¡did ¡enact ¡Julius So ¡let ¡it ¡be ¡with Caesar ¡I ¡was ¡killed ¡ Caesar. ¡The ¡noble i’ ¡the ¡Capitol;Ϳ ¡ Brutus ¡hath ¡told ¡you Brutus ¡killed ¡me. Caesar ¡was ¡ambitious

Sec. 1.2 Indexer ¡steps: ¡Sort • Sort ¡by ¡terms – And ¡then ¡docID ¡ Core ¡indexing ¡step

Sec. 1.2 Indexer ¡steps: ¡Dictionary ¡& ¡Postings • Multiple ¡term ¡entries ¡ in ¡a ¡single ¡document ¡ are ¡merged. • Split ¡into ¡Dictionary ¡ and ¡Postings • Doc. ¡frequency ¡ information ¡is ¡added. Why ¡frequency? Will ¡discuss ¡later.

Sec. 1.2 Where ¡do ¡we ¡pay ¡in ¡storage? Lists ¡of ¡ docIDs Terms ¡ and ¡ counts IR ¡system ¡ implementation • How ¡do ¡we ¡ index ¡efficiently? • How ¡much ¡ storage ¡do ¡we ¡ need? Pointers 23

Introduction ¡to Information ¡Retrieval Query ¡processing ¡with ¡an ¡inverted ¡index

Sec. 1.3 The ¡index ¡we ¡just ¡built • How ¡do ¡we ¡process ¡a ¡query? Our ¡focus – Later ¡-‑ what ¡kinds ¡of ¡queries ¡can ¡we ¡process? 25

Sec. 1.3 Query ¡processing: ¡AND • Consider ¡processing ¡the ¡query: Brutus AND Caesar – Locate ¡ Brutus in ¡the ¡Dictionary; • Retrieve ¡its ¡postings. – Locate ¡ Caesar in ¡the ¡Dictionary; • Retrieve ¡its ¡postings. – “Merge” ¡the ¡two ¡postings ¡(intersect ¡the ¡document ¡ 2 4 8 16 32 64 128 sets): Brutus Caesar 1 2 3 5 8 13 21 34 26

Sec. 1.3 The ¡merge • Walk ¡through ¡the ¡two ¡postings ¡ simultaneously, ¡in ¡time ¡linear ¡in ¡the ¡total ¡ number ¡of ¡postings ¡entries 2 4 8 16 128 32 64 Brutus Caesar 1 2 5 8 13 21 34 3 If the list lengths are x and y , the merge takes O( x+y ) operations. Crucial: postings sorted by docID. 27

Intersecting ¡two ¡postings ¡lists (a ¡“merge” ¡algorithm) 28

Information Retrieval Introducing Information Retrieval and - PowerPoint PPT Presentation

Introduction to Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval Information Retrieval (IR) is finding material (usually documents) of

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

Information Retrieval CS276: Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of

Web Information Retrieval Lecture 9 Information Retrieval in the Web Search use (iProspect

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra

Web Information Retrieval Lecture 8 Evaluation in information retrieval Recap of the last

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Course Review Luo Si Department

Information Needs IR, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval

Overview of the ACLIA IR4QA (Information Retrieval for IR4QA (Information Retrieval for Question

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Modern Information Retrieval Boolean information retrieval and document preprocessing 1 Hamid