• 1 I f Information Retrieval Models ti R t i l M d l Chapter 2. In R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval, 1999. Addision Wesley. Jon Atle Gulla / Terje Brasethvik / Jon Atle Gulla / Terje Brasethvik / Geir Solskinnsbakk • 2 Outline • Introduction to information retrieval • Logical view of documents L i l i f d t – Document representations – The “bag-of-words” approach The bag of words approach • The Classic IR Models – Boolean – Vector – Probablistic
• 3 Information retrieval • Information retrieval = information access • (Document retrieval / Text retrieval/ Search) (D t t i l / T t t i l/ S h) • Retrieve documents that satisfy user’s information need from document collection need from document collection – Query interpretation – Document representation and indexing – Ranking of retrieved documents – Linguistics, arithmetics and statistics • 4 AllTheWeb • AllTheWeb: FAST’s showcase (www.alltheweb.com) - 2002 Query R Retrieved documents d d (www.alltheweb.com part of Yahoo today)
• 5 IR vs. IE vs. TDM • Information retrieval – “finding documents that is similar to the query” finding documents that is similar to the query • Fulfilling an information need – Document retrieval / text retrieval • Give me information about Trondheim? • Information Extraction – Extracting data – Extracting data • Extract todays car-sales advertisements from adressa.no ? • Text Mining – Discovering new knowledge from text • Ex: Pubgene (http://www.pubgene.org) Discovery of genome relations through retrieval of MEDILINE articles • 6 Document retrieval • Give me information about Apple Computer? – Article? Web site? Web store? Prices? Article? Web site? Web store? Prices? • Is this flower poisonous? – Image? Fact sheet? Medical/biological encyclopedia? Image? Fact sheet? Medical/biological encyclopedia? • How much does a ticket cost from Trondheim to Paris? – Airline Price table? Web travel agency?
• 7 Text Retrieval vs. Database Text Retrieval vs Database Queries? • Well defined schema vs. no schema • Structured data vs. plain unformatted data St t d d t l i f tt d d t • Identity of records vs. “fuzzy” similarity measures • Well defined query languages and operations vs. W ll d fi d l d ti “Natural Language” queries and lexical and mathematical query transformations mathematical query transformations • 8 Document Retrieval problems • What is the definition of “CSCW” ? – Finds no documents about “CSCW” – Finds 1M + documents about “CSCW” Finds 1M documents about CSCW – Find no documents that actually define CSCW – Find 50 different definitions of CSCW
• 9 Retrieval Models • A retrieval model is an idealization or abstraction of an actual retrieval process an actual retrieval process • Approximation of the retrieval situation • Approximation of the retrieval situation • A retrieval model is not the same as a retrieval A retrieval model is not the same as a retrieval implementation • 1 0 Components of a retrieval Components of a retrieval model • User – Search expert (e.g. librarian) vs. non-expert Search expert (e g librarian) vs non expert – Background (knowledge of topic) – In-depth searching vs. ”just-wanna-get-an-idea” searching • Documents: – Different languages Diff t l – Semi-structured (e.g. HTML or XML) vs. plain
• 11 Retrieving vs. Browsing ? • Open web directories – Yahoo, … Yahoo • Domain specific – Medline, Lexis-Nexis, Jussnett, Dialog, … Medline, Lexis Nexis, Jussnett, Dialog, … • Libraries – Bibsys, ACM/IEEE - Diglib • Company Intranets – Project workspaces – General Information G l I f ti • WWW – Google alltheweb askJeeves – Google, alltheweb, askJeeves, … • 1 2 Taxonomy of retrieval models Set theoretic • Fuzzy sets • Extended boolean Classic models • Boolean Algebraic • Vector Vector • Probabilistic • generalized vector • Latent semantic Retrieval: indexing -Ad Hoc • Neural networks - Filtering Structured models Probabilistic • Non overlapping Browsing • Inference networks lists • Belief networks • Proximal Nodes Browsing models • Flat • Flat • Structure guided • Hypertext
• 1 3 Information Retrieval Model • An information retrieval model is a quadruple [D Q F R( i dj)] [D, Q, F, R(qi,dj)] where h – D is a set composed of logical views for the documents in the D is a set composed of logical views for the documents in the collection – Q is a set composed of logical views for the user information needs (queries) (queries) – F is a framework for modeling document representations, queries, and their relationships – R(qi,dj) is a ranking function which associates a real number with a R( i dj) i ki f ti hi h i t l b ith query qi Q and a document representation dj D. Such ranking defines an ordering among the documents with regard to the query qi qi. • 1 4 The retrieval cycle •Query Transformation •Normalization •Query Expansion •Query Expansion •Phrasing / Anti Phrasing •Result Presentation: •Ranking •Clustering Cl t i •Classification
• 1 5 About Document representations • Document meta-information – (author, title, date, URI, …) (author title date URI ) • Index term selection ? – Automated indexing - bag of words Automated indexing bag of words – User selected words: Key-words – Controlled vocabularies • Document structure • Document type • 1 6 Index term selection Language Encoding Transliteration Phrasing Stemming detection Document Meta-data Extraction D Document type t t Structure St t recognition recognition Word Analysis Document categorization categorization Index term selection
• 1 7 Bag-of-words approach • A document is an unordered list of words/tokens – Grammatical information is lost Grammatical information is lost • Tokenization: What is a word? – Is ”White House” one or two words? Is White House one or two words? • Case folding – ”President Bush” becomes ”president”, ”bush” • Stemming or lemmatization – Morphological information is thrown away: ”agreements” becomes ”agreement” (lemmatization) or even ”agree” (stemming) agreement (lemmatization) or even agree (stemming) • 1 8 Some repetition • IR = retrieval of documents that seem to be similar to the users information need the users information need • Information retrieval models – Users Users -> Query > Query – Documents -> Document representation – Similarity function -> sim(q, di) • Document representations – (logical views of documents) – Index term selection Index term selection
• 1 9 Example ”bag of words” Scientists have found compelling new evidence of possible ancient microscopic life on Mars, derived from magnetic crystals in a meteorite that fell to Earch from the red planet NASA announced on Monday that fell to Earch from the red planet, NASA announced on Monday. a, ancient, announced, compelling, crystals, derived, earth, evidence, fell, p g y found, from (2X), have, in, magnetic, mars, meteorite, microscopic, monday, nasa, new, of, on (2X), planet, possible, red, scientists, that, the, to • 2 0 What is this about? allmennviteskapelige, at (2x), av, bredt, datateknikk (2x), de (2x), doktorgradsstudier, Dr.ing., dr scient dr.scient., emner, en, et, etter-, fagtilbud, fleste, grunn-, har, hoveddel, hovedfagsstudier, i emner en et etter- fagtilbud fleste grunn- har hoveddel hovedfagsstudier i (3x), Instituttet (2x), informasjonsvitenskap., informatikk, innen, innenfor, kurs, leverer (2x), mellom-, NTNU, NTNUs (2x), og (5x), også, områder, samt (2x), selvsagt, sivilingenixrstudium, Som, studiene, til, tilbyr (3x), undervisning, undervisningen, universitetsinstitutt, ved (2x), vi (2x), videre., videreutdanningstilbud,
• 2 1 What is this about? allmennviteskapelige, at (2x), av, bredt, datateknikk (2x), de (2x), doktorgradsstudier, Dr.ing., dr scient dr.scient., emner, en, et, etter-, fagtilbud, fleste, grunn-, har, hoveddel, hovedfagsstudier, i emner en et etter- fagtilbud fleste grunn- har hoveddel hovedfagsstudier i (3x), Instituttet (2x), informasjonsvitenskap., informatikk, innen, innenfor, kurs, leverer (2x), mellom-, NTNU, NTNUs (2x), og (5x), også, områder, samt (2x), selvsagt, sivilingenixrstudium, Som, studiene, til, tilbyr (3x), undervisning, undervisningen, universitetsinstitutt, ved (2x), vi (2x), videre., videreutdanningstilbud, Instituttet har et bredt fagtilbud og tilbyr undervisning i emner innenfor de fleste områder innen datateknikk og informasjonsvitenskap. Instituttet leverer en hoveddel av undervisningen ved g j p g NTNUs sivilingeniørstudium i datateknikk, samt at vi tilbyr grunn-, mellom- og hovedfagsstudier i informatikk ved de allmennviteskapelige studiene. Som universitetsinstitutt tilbyr vi selvsagt også doktorgradsstudier (dr.ing. og dr.scient.), samt at vi leverer kurs til NTNUs etter og videreutdanningstilbud NTNUs etter- og videreutdanningstilbud - NTNU videre. NTNU videre • 2 2 “The language problem” Q ? D rep D rep D rep D rep D D rep
Recommend
More recommend