Information Retrieval Session 12 LBSC 671 Creating Information - PowerPoint PPT Presentation

Information Retrieval Session 12 LBSC 671 Creating Information Infrastructures

Agenda • The search process • Information retrieval • Recommender systems • Evaluation

The Memex Machine

Information Hierarchy More refined and abstract Wisdom Knowledge Information Data

Databases IR What we’re Structured data. Clear Mostly unstructured. semantics based on a Free text with some retrieving formal model. metadata. Formally Vague, imprecise Queries (mathematically) information needs we’re posing defined queries. (often expressed in Unambiguous. natural language). Exact. Always correct Sometimes relevant, Results we in a formal sense. often not. get One-shot queries. Interaction is important. Interaction with system Concurrency, recovery, Effectiveness and Other issues atomicity are critical. usability are critical.

Information “Retrieval” • Find something that you want – The information need may or may not be explicit • Known item search – Find the class home page • Answer seeking – Is Lexington or Louisville the capital of Kentucky? • Directed exploration – Who makes videoconferencing systems?

The Big Picture • The four components of the information retrieval environment: – User (user needs) – Process – System – Data What people care about! What geeks care about!

Information Retrieval Paradigm Document Search Browse Delivery Select Examine Query Document

Supporting the Search Process Source Predict Nominate Choose IR System Selection Query Query Formulation Search Ranked List Query Reformulation Selection Document and Relevance Feedback Examination Document Source Reselection Delivery

Supporting the Search Process Source IR System Selection Query Query Formulation Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery

Human-Machine Synergy • Machines are good at: – Doing simple things accurately and quickly – Scaling to larger collections in sublinear time • People are better at: – Accurately recognizing what they are looking for – Evaluating intangibles such as “quality” • Both are pretty bad at: – Mapping consistently between words and concepts

Search Component Model Utility Human Judgment Information Need Document Document Processing Query Formulation Query Processing Query Representation Function Representation Function Query Representation Document Representation Comparison Function Retrieval Status Value

Ways of Finding Text • Searching metadata – Using controlled or uncontrolled vocabularies • Searching content – Characterize documents by the words the contain • Searching behavior – User-Item: Find similar users – Item-Item: Find items that cause similar reactions

Two Ways of Searching Controlled Free-Text Vocabulary Author Indexer Searcher Searcher Write the document Construct query from Construct query from Choose appropriate using terms to available concept terms that may concept descriptors convey meaning descriptors appear in documents Content-Based Metadata-Based Query-Document Query-Document Query Document Document Query Matching Matching Terms Terms Descriptors Descriptors Retrieval Status Value

“Exact Match” Retrieval • Find all documents with some characteristic – Indexed as “Presidents -- United States” – Containing the words “Clinton” and “Peso” – Read by my boss • A set of documents is returned – Hopefully, not too many or too few – Usually listed in date or alphabetical order

The Perfect Query Paradox • Every information need has a perfect document ste – Finding that set is the goal of search • Every document set has a perfect query – AND every word to get a query for document 1 – Repeat for each document in the set – OR every document query to get the set query • The problem isn’t the system … it’s the query!

Queries on the Web (1999) • Low query construction effort – 2.35 (often imprecise) terms per query – 20% use operators – 22% are subsequently modified • Low browsing effort – Only 15% view more than one page – Most look only “above the fold” • One study showed that 10% don’t know how to scroll!

Types of User Needs • Informational (30-40% of queries) – What is a quark? • Navigational – Find the home page of United Airlines • Transactional – Data: What is the weather in Paris? – Shopping: Who sells a Viao Z505RX? – Proprietary: Obtain a journal article

Ranked Retrieval • Put most useful documents near top of a list – Possibly useful documents go lower in the list • Users can read down as far as they like – Based on what they read, time available, ... • Provides useful results from weak queries – Untrained users find exact match harder to use

Similarity-Based Retrieval • Assume “most useful” = most similar to query • Weight terms based on two criteria: – Repeated words are good cues to meaning – Rarely used words make searches more selective • Compare weights with query – Add up the weights for each query term – Put the documents with the highest total first

Simple Example: Counting Words Query: recall and fallout measures for information retrieval Query 1 2 3 1 Documents: complicated 1 contaminated 1: Nuclear fallout contaminated Texas. 1 1 fallout 1 1 1 information 2: Information retrieval is interesting. 1 interesting 3: Information retrieval is complicated. 1 nuclear 1 1 1 retrieval 1 Texas

Discussion Point: Which Terms to Emphasize? • Major factors – Uncommon terms are more selective – Repeated terms provide evidence of meaning • Adjustments – Give more weight to terms in certain positions • Title, first paragraph, etc. – Give less weight each term in longer documents – Ignore documents that try to “spam” the index • Invisible text, excessive use of the “meta” field, …

“Okapi” Term Weights     0 . 5 TF N DF    , i j j * log w    i , j L 0 . 5 DF     i 1 . 5 0 . 5 TF j , i j L TF component IDF component 6.0 1.0 5.8 0.8 5.6 L / L 5.4 0.6 Okapi TF 0.5 Classic IDF 1.0 5.2 Okapi 2.0 0.4 5.0 4.8 0.2 4.6 0.0 4.4 0 5 10 15 20 25 0 5 10 15 20 25 Raw TF Raw DF

Index Quality • Crawl quality – Comprehensiveness, dead links, duplicate detection • Document analysis – Frames, metadata, imperfect HTML, … • Document extension – Anchor text, source authority, category, language, … • Document restriction (ephemeral text suppression) – Banner ads, keyword spam, …

Other Web Search Quality Factors • Spam suppression – “Adversarial information retrieval” – Every source of evidence has been spammed • Text, queries, links, access patterns, … • “Family filter” accuracy – Link analysis can be helpful

Indexing Anchor Text • A type of “document expansion” – Terms near links describe content of the target • Works even when you can’t index content – Image retrieval, uncrawled links, …

Information Retrieval Types Source: Ayse Goker

Expanding the Search Space Scanned Docs Identity: Harriet “… Later, I learned that John had not heard …”

Page Layer Segmentation • Document image generation model – A document consists many layers, such as handwriting, machine printed text, background patterns, tables, figures, noise, etc.

Searching Other Languages English Definitions Query Query Formulation Query Translated “Headlines” Translated Query Translation Search Ranked List MT Selection Document Examination Document Query Reformulation Use

Speech Retrieval Architecture Query Speech Formulation Recognition Automatic Boundary Search Tagging Content Interactive Tagging Selection

High Payoff Investments OCR MT Searchable Fraction Handwriting Speech Transducer Capabilities accurately recognized words words produced

http://www.ctr.columbia.edu/webseek/

Color Histogram Example

Rating-Based Recommendation • Use ratings as to describe objects – Personal recommendations, peer review, … • Beyond topicality: – Accuracy, coherence, depth, novelty, style, … • Has been applied to many modalities – Books, Usenet news, movies, music, jokes, beer, …

Using Positive Information Small Space Mad Dumbo Speed- Cntry World Mtn Tea Pty way Bear D A B D ? ? Joe A F D F Ellen A A A A A A Mickey D A C Goofy A C A C A John F A F Ben D A A Nathan

Using Negative Information Small Space Mad Dumbo Speed- Cntry World Mtn Tea Pty way Bear D A B D ? ? Joe A F D F Ellen A A A A A A Mickey D A C Goofy A C A C A John F A F Ben D A A Nathan

Problems with Explicit Ratings • Cognitive load on users -- people don’t like to provide ratings • Rating sparsity -- needs a number of raters to make recommendations • No ways to detect new items that have not rated by any users

Putting It All Together Free Text Behavior Metadata Topicality Quality Reliability Cost Flexibility

Information Retrieval Session 12 LBSC 671 Creating Information - PowerPoint PPT Presentation

Information Retrieval Session 12 LBSC 671 Creating Information Infrastructures Agenda The search process Information retrieval Recommender systems Evaluation The Memex Machine Information Hierarchy More refined and abstract

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Retrieval Max Gubin mail@maxgubin.com Information Retrieval History 4000 1950 2000 BC

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

I dont get out of bed until my PIM tells me to boundaries and the design of personal

REDUCING THE LINES: A VISUAL DAG EDITOR Why does this process take so long? -- Every

Antecedent preferences of Personal Pronouns and Anaphoric Demonstratives in German in

Fine Grain Provenance Using Temporal Databases Outline of the talk Use case: Classic

Overview Statistical filtering MAP estimate Different noise models Different regularizators

Search-Based Software Project Scheduling Francisco Chicano joint work with E. Alba, A. Cervantes,

Third Quarter Report 2004 I am pleased to present BMO Financial Groups Third Quarter 2004

Providing accessibility to affordable power derived from renewable resources Presented By Visat

Sambuz

Useful Links

Newsletter

Mail Us

Information Retrieval Session 12 LBSC 671 Creating Information - PowerPoint PPT Presentation

Information Retrieval Session 12 LBSC 671 Creating Information Infrastructures Agenda The search process Information retrieval Recommender systems Evaluation The Memex Machine Information Hierarchy More refined and abstract

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Retrieval Max Gubin mail@maxgubin.com Information Retrieval History 4000 1950 2000 BC

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

I dont get out of bed until my PIM tells me to boundaries and the design of personal

REDUCING THE LINES: A VISUAL DAG EDITOR Why does this process take so long? -- Every

Antecedent preferences of Personal Pronouns and Anaphoric Demonstratives in German in

Fine Grain Provenance Using Temporal Databases Outline of the talk Use case: Classic

Overview Statistical filtering MAP estimate Different noise models Different regularizators

Search-Based Software Project Scheduling Francisco Chicano joint work with E. Alba, A. Cervantes,

Third Quarter Report 2004 I am pleased to present BMO Financial Groups Third Quarter 2004

Providing accessibility to affordable power derived from renewable resources Presented By Visat

Sambuz

Useful Links

Newsletter

Mail Us

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models