Applications (1 of 2): Information Retrieval Kenneth Church - PowerPoint PPT Presentation

Applications (1 of 2): Information Retrieval Kenneth Church Kenneth.Church@jhu.edu Kenneth.Church@jhu.edu Dec 2, 2009 1

Pattern Recognition Problems in Computational Linguistics • Information Retrieval: – Is this doc more like relevant docs or irrelevant docs? • Author Identification: – Is this doc more like author A s docs or author B s docs? Is this doc more like author A’s docs or author B’s docs? • Word Sense Disambiguation – Is the context of this use of bank • more like sense 1’s contexts • or like sense 2’s contexts? • Machine Translation – Is the context of this use of drug more like those that were translated as drogue – or those that were translated as medicament ? Dec 2, 2009 2

Applications of Naïve Bayes Applications of Naïve Bayes Dec 2, 2009 3

Classical Information Retrieval (IR) Classical Information Retrieval (IR) • Boolean Combinations of Keywords Boolean Combinations of Keywords – Dominated the Market (before the web) – Popular with Intermediaries (Librarians) p ( ) • Rank Retrieval (Google) – Sort a collection of documents Sort a collection of documents • (e.g., scientific papers, abstracts, paragraphs) • by how much they ‘‘match’’ a query – The query can be a (short) sequence of keywords • or arbitrary text (e.g., one of the documents) Dec 2, 2009 4

Motivation for Information Retrieval (circa 1990, about 5 years before web) • Text is available like never before Text is available like never before • Currently, N ≈ 100 million words – and projections run as high as 10 15 bytes by 2000! 10 15 b t d j ti hi h b 2000! • What can we do with it all? – It is better to do something simple, – than nothing at all. • IR vs. Natural Language Understanding – Revival of 1950 ‐ style empiricism y p Dec 2, 2009 5

How Large is Very Large? From a Keynote to EMNLP Conference, f formally Workshop on Very Large Corpora Year Source Size (words) 1788 1788 Federalist Papers F d li P 1/5 1/5 million illi 1982 1982 Brown Corpus Brown Corpus 1 million 1 million 1987 Birmingham Corpus 20 million 1988- Associate Press (AP) 50 million (per year) 1993 MUC, TREC, Tipster Dec 2, 2009 6

Rising Tide of Data Lifts All Boats If you have a lot of data, then you don’t need a lot of methodology • 1985: “ There is no data like more data” – Fighting words uttered by radical fringe elements (Mercer at Arden House) • 1993 Workshop on Very Large Corpora – Perfect timing: Just before the web P f i i J b f h b – Couldn’t help but succeed – Fate • 1995 The Web changes e er thing • 1995: The Web changes everything • All you need is data (magic sauce) – No linguistics – No artificial intelligence (representation) N ifi i l i lli ( i ) – No machine learning – No statistics – No error analysis No error analysis Dec 2, 2009 7

“ It never pays to think until you’ve run out of data ” – Eric Brill f d ” i ill Moore’s Law Constant: Data Collection Rates � Improvement Rates � I B Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) k & B ill Miti ti th P it f D t P bl (HLT 2001) D C ll i R R No consistently best learner b l More ext data is of conte better data! oted out Fire everybody and Quo spend the money on data Dec 2, 2009 8

Borrowed Slide: Jelinek (LREC) Benefit of Data Benefit of Data LIMSI: Lamel (2002) – Broadcast News ( ) WER hours Supervised: transcripts Lightly supervised: closed captions i h l i d l d i Dec 2, 2009 9

The rising tide of data will lift all boats! TREC Question Answering & Google: What is the highest point on Earth? Dec 2, 2009 10

The rising tide of data will lift all boats! Acquiring Lexical Resources from Data: Dictionaries Ontologies WordNets Language Models etc Dictionaries, Ontologies, WordNets, Language Models, etc. http://labs1.google.com/sets England g Japan p Cat cat France China Dog more Germany India Horse ls Italy Indonesia Fish rm Ireland Malaysia Bird mv Spain Korea Rabbit cd Scotland Taiwan Cattle cp Belgium Thailand Rat mkdir Canada Singapore Livestock man Austria A stria A stralia Australia Mo se Mouse tail tail Australia Bangladesh Human pwd Dec 2, 2009 11

Rising Tide of Data Lifts All Boats If you have a lot of data, then you don’t need a lot of methodology • More data � better results – TREC Question Answering • Remarkable performance: Google and not much else – Norvig (ACL ‐ 02) – AskMSR (SIGIR ‐ 02) – Lexical Acquisition • Google Sets G l S t – We tried similar things » but with tiny corpora » which we called large g Dec 2, 2009 12

Don’t worry; Applications pp Be happy Be happy • What good is word sense disambiguation (WSD)? – Information Retrieval (IR) ns • Anderson S lt Salton: Tried hard to find ways to use NLP to help IR T i d h d t fi d t NLP t h l IR – but failed to find much (if anything) • Croft: WSD doesn’t help because IR is already using those methods • Sanderson (next two slides) 5 Ian A – Machine Translation (MT) • Original motivation for much of the work on WSD • But IR arguments may apply just as well to MT • Wh t What good is POS tagging? Parsing? NLP? Speech? d i POS t i ? P i ? NLP? S h? • Commercial Applications of Natural Language Processing , CACM 1995 – $100M $100M opportunity (worthy of government/industry’s attention) i ( h f /i d ’ i ) 1. Search (Lexis ‐ Nexis) ALPAC 2. Word Processing (Microsoft) • • Warning: premature commercialization is risky Warning: premature commercialization is risky Dec 2, 2009 13

Sanderson (SIGIR ‐ 94) http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf Not much? • Could WSD help IR? Could WSD help IR? • Answer: no F ersons – Introducing ambiguity g g y an Ande by pseudo ‐ words doesn’t hurt (much) 5 Ia Query Length (Words) Dec 2, 2009 14 Short queries matter most, but hardest for WSD

Sanderson (SIGIR ‐ 94) http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf Soft WSD? F • Resolving ambiguity badly is worse than not resolving at all – 75% accurate WSD degrades performance – 90% accurate WSD: breakeven point Query Length (Words) Dec 2, 2009 15

IR Models IR Models • Keywords (and Boolean combinations thereof) Keywords (and Boolean combinations thereof) • Vector ‐ Space ‘‘Model’’ (Salton, chap 10.1) – Represent the query and the documents as V ‐ R t th d th d t V ∑ dimensional vectors x i ⋅ y i – Sort vectors by sim ( x , y ) = cos( x , y ) = Sort vectors by ( ) ( ) i i | x | ⋅ | y | • Probabilistic Retrieval Model – (Salton, chap 10.3) Pr( w | rel ) ∏ Pr( w | rel ) score ( d ) = – Sort documents by Pr( w | rel ) w ∈ d Dec 2, 2009 16

Information Retrieval Information Retrieval and Web Search and Web Search and Web Search and Web Search Alternative IR models Instructor: Rada Mihalcea Some of the slides were adopted from a course tought at Cornell University by William Y. Arms Dec 2, 2009 17

Latent Semantic Indexing Latent Semantic Indexing Objective j Replace indexes that use sets of index terms by indexes that use concepts . Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data. Dec 2, 2009 18

Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). V i d d h f h (l ll) S Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together g g q y pp g p Latent semantic indexing addresses the first of these (synonymy), and the third (dependence) Dec 2, 2009 19

Bellcore’s Example h http://en.wikipedia.org/wiki/Latent_semantic_analysis // iki di / iki/ i l i c1 Human machine interface for Lab ABC computer applications c2 A survey of user opinion of computer system response time c3 The EPS user interface management system c4 System and human system engineering testing of EPS c5 Relation of user -perceived response time to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 3 Graph minors IV: Widths of trees and well-quasi-ordering IV Width f t d ll i d i G h i m4 Graph minors : A survey Dec 2, 2009 20

Term by Document Matrix y Dec 2, 2009 21

Query Expansion Query Expansion Query: Find documents relevant to human computer interaction Simple Term Matching: p g Matches c1, c2, and c4 Misses c3 and c5 Dec 2, 2009 22

Large Correl Correl ‐ ations Dec 2, 2009 23

Correlations: Too Large to Ignore Dec 2, 2009 24

Correcting Correcting for Large Correlations Correlations Dec 2, 2009 25

Thesaurus Thesaurus Dec 2, 2009 26

Term by Doc Matrix: Matrix: Before & After Thesaurus Dec 2, 2009 27

Singular Value Decomposition (SVD) X = UDV T T t x d t x m m x m m x d V T D D V = X U • m is the rank of X < min( t , d ) • D is diagonal – D 2 are eigenvalues (sorted in descending D 2 i l ( t d i d di order) U U T = I and V V T = I • – Columns of U are eigenvectors of X X T – Columns of V are eigenvectors of X T X Dec 2, 2009 28

Applications (1 of 2): Information Retrieval Kenneth Church - PowerPoint PPT Presentation

Applications (1 of 2): Information Retrieval Kenneth Church Kenneth.Church@jhu.edu Kenneth.Church@jhu.edu Dec 2, 2009 1 Pattern Recognition Problems in Computational Linguistics Information Retrieval: Is this doc more like relevant docs

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Information Retrieval Ling573 NLP Systems & Applications April 15, 2014 Roadmap

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

Information Retrieval CS276: Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search

Language Information Retrieval applications: integration of ontology-based methods and Linguistic

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of

Web Information Retrieval Lecture 9 Information Retrieval in the Web Search use (iProspect

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra

Web Information Retrieval Lecture 8 Evaluation in information retrieval Recap of the last

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Course Review Luo Si Department

Information Needs IR, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications

Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval

Applications (1 of 2): Information Retrieval Kenneth Church - PowerPoint PPT Presentation

Applications (1 of 2): Information Retrieval Kenneth Church Kenneth.Church@jhu.edu Kenneth.Church@jhu.edu Dec 2, 2009 1 Pattern Recognition Problems in Computational Linguistics Information Retrieval: Is this doc more like relevant docs

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Information Retrieval Ling573 NLP Systems &amp; Applications April 15, 2014 Roadmap

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

Information Retrieval CS276: Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search

Language Information Retrieval applications: integration of ontology-based methods and Linguistic

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of

Web Information Retrieval Lecture 9 Information Retrieval in the Web Search use (iProspect

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra

Web Information Retrieval Lecture 8 Evaluation in information retrieval Recap of the last

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Course Review Luo Si Department

Information Needs IR, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

(Pseudo)-Relevance Feedback &amp; Passage Retrieval Ling573 NLP Systems &amp; Applications

Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval

Information Retrieval Ling573 NLP Systems & Applications April 15, 2014 Roadmap

(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications