INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell University, Ithaca, NY 6 Oct 2009 1 / 39

Overview Recap 1 Motivation for query expansion 2 Relevance feedback: Basics 3 Relevance feedback: Details 4 2 / 39

Outline Recap 1 Motivation for query expansion 2 Relevance feedback: Basics 3 Relevance feedback: Details 4 3 / 39

Term–term Comparison To compare two terms, take the dot product between two rows of C , which measures the extent to which they have similar pattern of occurrence across the full set of documents. The i , j entry of CC T is equal to the dot product between i , j rows of C Since CC T = U Σ V T V Σ U T = U Σ 2 U T = ( U Σ)( U Σ) T , the i , j entry is the dot product between the i , j rows of U Σ. Hence the rows of U Σ can be considered as coordinates for terms, whose dot products give comparisons between terms. (Σ just rescales the coordinates) 4 / 39

Document–document Comparison To compare two documents, take the dot product between two columns of C , which measures the extent to which two documents have a similar profile of terms. The i , j entry of C T C is equal to the dot product between the i , j columns of C Since C T C = V Σ U T U Σ V T = V Σ 2 V T = ( V Σ)( V Σ) T , the i , j entry is the dot product between the i , j rows of V Σ Hence the rows of V Σ can be considered as coordinates for documents, whose dot products give comparisons between documents. (Σ again just rescales coordinates) 5 / 39

Term–document Comparison To compare a term and a document Use directly the value of i , j entry of C = U Σ V T This is the dot product between i th row of U Σ 1 / 2 and j th row of V Σ 1 / 2 So use U Σ 1 / 2 and V Σ 1 / 2 as coordinates Recall U Σ for term–term, and V Σ for document–document comparisons — can’t use a single set of coordinates to make both between document and term and within term or document comparisons, but difference is only Σ 1 / 2 stretch. 6 / 39

Pseudo-document – document Comparison How to represent “pseudo-documents”, and how to compute comparisons? e.g., given a novel query, find its location in concept space, and find its cosine w.r.t existing documents, or other documents not in original analysis (SVD). A query � q is a vector of terms, like the columns of C , hence considered a pseudo-document Derive representation for any term vector � q to be used in document comparison formulas. (like a row of V as earlier) d ( j ) (= j th column C ij ), q = � Constraint: for a real document � and before truncation (i.e., for C k = C ), should give row of V qU Σ − 1 for comparing pseudodocs to docs Use � q ( s ) = � 7 / 39

qU Σ − 1 Pseudo-document – document Comparison: � q ( s ) = � Consider the j , i component of C T U Σ − 1 = ( V Σ U T ) U Σ − 1 = V By inspection, the j th row of l.h.s. corresponds to the case � q = � d ( j ) : � � C T U Σ − 1 � d ( j ) U Σ − 1 � � ji = i and the r.h.s. V ji is the j th row of V , as desired for comparing docs. qU Σ − 1 , which sums corresponding rows of U Σ, So use � q ( s ) = � hence corresponds to placing pseudo-document at centroid of corresponding term points (up to rescaling of rows by Σ). (Just as row of V scaled by Σ 1 / 2 or Σ can be used in semantic space for making term–doc or doc–doc comparisons.) Note: all of above after any preprocessing used to construct C 8 / 39

Selection of singular values t × d t × m m × m m × d V T Σ k k C k U k t × d t × k k × k k × d m is the original rank of C . k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k ≪ m . Σ − 1 defined only on k -dimensional subspace. k 9 / 39

More on query document comparison query = vector � q in term space components q i = 1 if term i is in the query, and otherwise 0 any query terms not in the original term vector space ignored q and j th document � In VSM, similarity between query � d ( j ) given by the “cosine measure”: q · � � d ( j ) q | | � | � d ( j ) | Using term–document matrix C ij , this dot product given by the j th e ( j ) = j th basis vector, single 1 q · C : � component of � d ( j ) = C � e ( j ) ( � in j th position, 0 elsewhere). Hence q · � d ( j ) ) = cos( θ ) = � d ( j ) = � q · C · � e ( j ) q ,� Similarity ( � e ( j ) | . (1) q | | � | � q | | C � | � d ( j ) | 10 / 39

Now approximate C → C k In the LSI approximation, use C k (the rank k approximation to C ), so similarity measure between query and document becomes q · � q · � � d ∗ � = � q · C · � � q · C k · � d ( j ) e ( j ) e ( j ) ( j ) = ⇒ e ( j ) | = , (2) q | | � | � q | | C � e ( j ) | | � q | | C k � q | | � | � d ( j ) | | � d ∗ ( j ) | where � e ( j ) = U k Σ k V T � d ∗ ( j ) = C k � e ( j ) is the LSI representation of the j th document vector in the original term–document space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1 , . . . , N documents, and returning the best matches. 11 / 39

Pseudo-document To see that this agrees with the prescription given in the course text (and the original LSI article), recall: j th column of V T k represents document j in “concept space”: � ˆ d ( j ) = V T k � e ( j ) query � q is considered a “pseudo-document” in this space. LSI document vector in term space given above as e ( j ) = U k Σ k � � ˆ e ( j ) = U k Σ k V T d ∗ ( j ) = C k � k � d ( j ) , so follows that � d ( j ) = Σ − 1 ˆ k � k U T d ∗ ( j ) The “pseudo-document” query vector � q is translated into the concept space using the same transformation: � q = Σ − 1 k U T k � ˆ q . 12 / 39

Compare documents in concept space Recall the i , j entry of C T C is dot product between i , j columns of C (term vectors for documents i and j ). In the truncated space, k ) T ( U k Σ k V T C T k C k = ( U k Σ k V T k ) = V k Σ k U T k U k Σ k V T k = ( V k Σ k )( V k Σ k ) T Thus i , j entry the dot product between the i , j columns of ( V k Σ k ) T = Σ k V T k . In concept space, comparison between pseudo-document � q and ˆ document � q and Σ k � d ( j ) thus given by the cosine between Σ k � ˆ ˆ ˆ d ( j ) : q ) · Σ k � q T U k Σ − 1 k � q · � k Σ k )(Σ k Σ − 1 (Σ k � ˆ k U T ( � d ∗ ( j ) ) � d ∗ ˆ d ( j ) ( j ) = = , (3) q | | Σ k � k � q | | � | U T q | | U T | U T | Σ k � ˆ k � d ∗ ( j ) | k � d ∗ ( j ) | ˆ d ( j ) | in agreement with (2), up to an overall � q -dependent normalization which doesn’t affect similarity rankings . 13 / 39

14 / 39

How can we improve recall in search? Main topic today: two ways of improving recall: relevance feedback and query expansion Example Query q : [aircraft] Document d contains “plane”, but doesn’t contain “aircraft”. A simple IR system will not return d for q . Even if d is the most relevant document for q ! Options for improving recall Local: Do a “local”, on-demand analysis for a user query Main local method: relevance feedback Global: Do a global analysis once (e.g., of collection) to produce thesaurus Use thesaurus for query expansion 16 / 39

Relevance feedback: Basic idea The user issues a (short, simple) query. The search engine returns a set of documents. User marks some docs as relevant, some as nonrelevant. Search engine computes a new representation of the information need – should be better than the initial query. Search engine runs new query and returns new results. New results have (hopefully) better recall. 18 / 39

Relevance feedback We can iterate this: several rounds of relevance feedback. We will use the term ad hoc retrieval to refer to regular retrieval without relevance feedback. We will now look at three different examples of relevance feedback that highlight different aspects of the process. 19 / 39

Relevance Feedback: Example 1 20 / 39

Results for initial query 21 / 39

User feedback: Select what is relevant 22 / 39

Results after relevance feedback 23 / 39

Vector space example: query “canine” (1) source: Fernando D´ ıaz 24 / 39

Similarity of docs to query “canine” source: Fernando D´ ıaz 25 / 39

User feedback: Select relevant documents source: Fernando D´ ıaz 26 / 39

Results after relevance feedback source: Fernando D´ ıaz 27 / 39

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell University, Ithaca, NY 6 Oct 2009

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Improving Materials Accountancy for Reprocessing using HiRX Ben Cipit a , Michael McDaniel a ,

Mon onitori oring g Con oncep cept for Di for Dist stri ribu buted ed AAL Pla Platforms

The Second Steering Committee on the Enhancement of the Financial Infrastructure (SCEFI II)

shape? A. J. Lancaster 1 Continuous FSI J. Dale 2 , A. Reichold 1 Smith-Purcell Radiation H.

SErAPIS: A Concept-Oriented Search Engine for the Isabelle Libraries Based on Natural Language

Text is everywhere We use documents as primary information artifact in our lives Our access to

Luo Si Department of Computer Science Purdue University Retrieval Models Information Need

Smart Lifelog Retrieval System with Habit-based Concepts and Moment Visualization QUIK team