INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell University, Ithaca, NY 6 Oct 2009


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 12: Latent Semantic Indexing and Relevance Feedback

Paul Ginsparg

Cornell University, Ithaca, NY

6 Oct 2009

1 / 39

slide-2
SLIDE 2

Overview

1

Recap

2

Motivation for query expansion

3

Relevance feedback: Basics

4

Relevance feedback: Details

2 / 39

slide-3
SLIDE 3

Outline

1

Recap

2

Motivation for query expansion

3

Relevance feedback: Basics

4

Relevance feedback: Details

3 / 39

slide-4
SLIDE 4

Term–term Comparison

To compare two terms, take the dot product between two rows of C, which measures the extent to which they have similar pattern of

  • ccurrence across the full set of documents.

The i, j entry of CC T is equal to the dot product between i, j rows of C Since CC T = UΣV TV ΣUT = UΣ2UT = (UΣ)(UΣ)T , the i, j entry is the dot product between the i, j rows of UΣ. Hence the rows of UΣ can be considered as coordinates for terms, whose dot products give comparisons between terms. (Σ just rescales the coordinates)

4 / 39

slide-5
SLIDE 5

Document–document Comparison

To compare two documents, take the dot product between two columns of C, which measures the extent to which two documents have a similar profile of terms. The i, j entry of C TC is equal to the dot product between the i, j columns of C Since C TC = V ΣUTUΣV T = V Σ2V T = (V Σ)(V Σ)T, the i, j entry is the dot product between the i, j rows of V Σ Hence the rows of V Σ can be considered as coordinates for documents, whose dot products give comparisons between documents. (Σ again just rescales coordinates)

5 / 39

slide-6
SLIDE 6

Term–document Comparison

To compare a term and a document Use directly the value of i, j entry of C = UΣV T This is the dot product between ith row of UΣ1/2 and jth row

  • f V Σ1/2

So use UΣ1/2 and V Σ1/2 as coordinates Recall UΣ for term–term, and V Σ for document–document comparisons — can’t use a single set of coordinates to make both between document and term and within term or document comparisons, but difference is only Σ1/2 stretch.

6 / 39

slide-7
SLIDE 7

Pseudo-document – document Comparison

How to represent “pseudo-documents”, and how to compute comparisons? e.g., given a novel query, find its location in concept space, and find its cosine w.r.t existing documents, or other documents not in

  • riginal analysis (SVD).

A query q is a vector of terms, like the columns of C, hence considered a pseudo-document Derive representation for any term vector q to be used in document comparison formulas. (like a row of V as earlier) Constraint: for a real document q = d(j) (= jth column Cij), and before truncation (i.e., for Ck = C), should give row of V Use q(s) = qUΣ−1 for comparing pseudodocs to docs

7 / 39

slide-8
SLIDE 8

Pseudo-document – document Comparison: q(s) = qUΣ−1

Consider the j, i component of C TUΣ−1 = (V ΣUT)UΣ−1 = V By inspection, the jth row of l.h.s. corresponds to the case q = d(j):

  • C TUΣ−1

ji =

  • d(j)UΣ−1

i

and the r.h.s. Vji is the jth row of V , as desired for comparing docs. So use q(s) = qUΣ−1, which sums corresponding rows of UΣ, hence corresponds to placing pseudo-document at centroid of corresponding term points (up to rescaling of rows by Σ). (Just as row of V scaled by Σ1/2 or Σ can be used in semantic space for making term–doc or doc–doc comparisons.) Note: all of above after any preprocessing used to construct C

8 / 39

slide-9
SLIDE 9

Selection of singular values

t × d t × m m × m m × d Ck Uk Σk V T

k

t × d t × k k × k k × d m is the original rank of C. k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k ≪ m. Σ−1

k

defined only on k-dimensional subspace.

9 / 39

slide-10
SLIDE 10

More on query document comparison

query = vector q in term space components qi = 1 if term i is in the query, and otherwise 0 any query terms not in the original term vector space ignored In VSM, similarity between query q and jth document d(j) given by the “cosine measure”:

  • q ·

d(j) | q| | d(j)| Using term–document matrix Cij, this dot product given by the jth component of q · C: d(j) = C e(j) ( e(j) = jth basis vector, single 1 in jth position, 0 elsewhere). Hence Similarity( q, d(j)) = cos(θ) = q · d(j) | q| | d(j)| = q · C · e(j) | q| |C e(j)| . (1)

10 / 39

slide-11
SLIDE 11

Now approximate C → Ck

In the LSI approximation, use Ck (the rank k approximation to C), so similarity measure between query and document becomes

  • q ·

d(j) | q| | d(j)| = q · C · e(j) | q| |C e(j)| = ⇒

  • q · Ck ·

e(j) | q| |Ck e(j)| =

  • q ·

d∗

(j)

| q| | d∗

(j)|

, (2) where d∗

(j) = Ck

e(j) = UkΣkV T e(j) is the LSI representation of the jth document vector in the original term–document space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1, . . . , N documents, and returning the best matches.

11 / 39

slide-12
SLIDE 12

Pseudo-document

To see that this agrees with the prescription given in the course text (and the original LSI article), recall: jth column of V T

k represents document j in “concept space”:

  • ˆ

d(j) = V T

k

e(j) query q is considered a “pseudo-document” in this space. LSI document vector in term space given above as

  • d∗

(j) = Ck

e(j) = UkΣkV T

k

e(j) = UkΣk ˆ d(j), so follows that

  • ˆ

d(j) = Σ−1

k UT k

d∗

(j)

The “pseudo-document” query vector q is translated into the concept space using the same transformation: ˆ q = Σ−1

k UT k

q.

12 / 39

slide-13
SLIDE 13

Compare documents in concept space

Recall the i, j entry of C TC is dot product between i,j columns of C (term vectors for documents i and j). In the truncated space, C T

k Ck = (UkΣkV T k )T (UkΣkV T k ) = VkΣkUT k UkΣkV T k = (VkΣk)(VkΣk)T

Thus i, j entry the dot product between the i, j columns of (VkΣk)T = ΣkV T

k .

In concept space, comparison between pseudo-document ˆ q and document ˆ d(j) thus given by the cosine between Σk ˆ q and Σk ˆ d(j): (Σk ˆ q) · Σk ˆ d(j) |Σk ˆ q| |Σk ˆ d(j)| = ( qT UkΣ−1

k Σk)(ΣkΣ−1 k UT k

d∗

(j))

|UT

k

q| |UT

k

d∗

(j)|

=

  • q ·

d∗

(j)

|UT

k

q| | d∗

(j)|

, (3) in agreement with (2), up to an overall q-dependent normalization which doesn’t affect similarity rankings.

13 / 39

slide-14
SLIDE 14

14 / 39

slide-15
SLIDE 15

Outline

1

Recap

2

Motivation for query expansion

3

Relevance feedback: Basics

4

Relevance feedback: Details

15 / 39

slide-16
SLIDE 16

How can we improve recall in search?

Main topic today: two ways of improving recall: relevance feedback and query expansion Example

Query q: [aircraft] Document d contains “plane”, but doesn’t contain “aircraft”. A simple IR system will not return d for q. Even if d is the most relevant document for q!

Options for improving recall

Local: Do a “local”, on-demand analysis for a user query

Main local method: relevance feedback

Global: Do a global analysis once (e.g., of collection) to produce thesaurus

Use thesaurus for query expansion

16 / 39

slide-17
SLIDE 17

Outline

1

Recap

2

Motivation for query expansion

3

Relevance feedback: Basics

4

Relevance feedback: Details

17 / 39

slide-18
SLIDE 18

Relevance feedback: Basic idea

The user issues a (short, simple) query. The search engine returns a set of documents. User marks some docs as relevant, some as nonrelevant. Search engine computes a new representation of the information need – should be better than the initial query. Search engine runs new query and returns new results. New results have (hopefully) better recall.

18 / 39

slide-19
SLIDE 19

Relevance feedback

We can iterate this: several rounds of relevance feedback. We will use the term ad hoc retrieval to refer to regular retrieval without relevance feedback. We will now look at three different examples of relevance feedback that highlight different aspects of the process.

19 / 39

slide-20
SLIDE 20

Relevance Feedback: Example 1

20 / 39

slide-21
SLIDE 21

Results for initial query

21 / 39

slide-22
SLIDE 22

User feedback: Select what is relevant

22 / 39

slide-23
SLIDE 23

Results after relevance feedback

23 / 39

slide-24
SLIDE 24

Vector space example: query “canine” (1)

source: Fernando D´ ıaz

24 / 39

slide-25
SLIDE 25

Similarity of docs to query “canine”

source: Fernando D´ ıaz

25 / 39

slide-26
SLIDE 26

User feedback: Select relevant documents

source: Fernando D´ ıaz

26 / 39

slide-27
SLIDE 27

Results after relevance feedback

source: Fernando D´ ıaz

27 / 39

slide-28
SLIDE 28

Example 3: A real (non-image) example

Initial query: New space satellite applications Results for initial query: (r = rank) r + 1 0.539 NASA Hasn’t Scrapped Imaging Spectrometer + 2 0.533 NASA Scratches Environment Gear From Satellite Plan 3 0.528 Science Panel Backs NASA Satellite Plan, But Urges Launches

  • f Smaller Probes

4 0.526 A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget 5 0.525 Scientist Who Exposed Global Warming Proposes Satellites for Climate Research 6 0.524 Report Provides Support for the Critics Of Using Big Satellites to Study Climate 7 0.516 Arianespace Receives Satellite Launch Pact From Telesat Canada + 8 0.509 Telecommunications Tale of Two Companies User then marks relevant documents with “+”.

28 / 39

slide-29
SLIDE 29

Expanded query after relevance feedback

2.074 new 15.106 space 30.816 satellite 5.660 application 5.991 nasa 5.196 eos 4.196 launch 3.972 aster 3.516 instrument 3.446 arianespace 3.004 bundespost 2.806 ss 2.790 rocket 2.053 scientist 2.003 broadcast 1.172 earth 0.836

  • il

0.646 measure

29 / 39

slide-30
SLIDE 30

Results for expanded query

r * 1 0.513 NASA Scratches Environment Gear From Satellite Plan * 2 0.500 NASA Hasn’t Scrapped Imaging Spectrometer 3 0.493 When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own 4 0.493 NASA Uses ‘Warm’ Superconductors For Fast Cir- cuit * 5 0.492 Telecommunications Tale of Two Companies 6 0.491 Soviets May Adapt Parts of SS-20 Missile For Com- mercial Use 7 0.490 Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers 8 0.490 Rescue of Satellite By Space Agency To Cost $90 Million

30 / 39

slide-31
SLIDE 31

Outline

1

Recap

2

Motivation for query expansion

3

Relevance feedback: Basics

4

Relevance feedback: Details

31 / 39

slide-32
SLIDE 32

Key concept for relevance feedback: Centroid

The centroid is the center of mass of a set of points. Recall that we represent documents as points in a high-dimensional space. Thus: we can compute centroids of documents. Definition:

  • µ(D) =

1 |D|

  • d∈D
  • v(d)

where D is a set of documents and v(d) = d is the vector we use to represent document d.

32 / 39

slide-33
SLIDE 33

Centroid: Examples

x x x x

⋄ ⋄ ⋄ ⋄ ⋄ ⋄

33 / 39

slide-34
SLIDE 34

Rocchio algorithm

The Rocchio algorithm implements relevance feedback in the vector space model. Rocchio chooses the query qopt that maximizes

  • qopt

= arg max

  • q

[sim( q, µ(Dr)) − sim( q, µ(Dnr))] Closely related to maximum separation between relevant and nonrelevant docs Making some additional assumptions, we can rewrite qopt as:

  • qopt

= µ(Dr) + [µ(Dr) − µ(Dnr)] Dr: set of relevant docs; Dnr: set of nonrelevant docs

34 / 39

slide-35
SLIDE 35

Rocchio algorithm

The optimal query vector is:

  • qopt

= µ(Dr) + [µ(Dr) − µ(Dnr)] = 1 |Dr|

  • dj∈Dr
  • dj + [ 1

|Dr|

  • dj∈Dr
  • dj −

1 |Dnr|

  • dj∈Dnr
  • dj]

We move the centroid of the relevant documents by the difference between the two centroids.

35 / 39

slide-36
SLIDE 36

Exercise: Compute Rocchio vector

x x x x x x circles: relevant documents, X’s: nonrelevant documents

36 / 39

slide-37
SLIDE 37

Rocchio illustrated

x x x x x x

  • µR
  • µNR
  • µR −

µNR

  • qopt
  • µR: centroid of relevant documents
  • µNR: centroid of nonrelevant documents
  • µR −

µNR: difference vector Add difference vector to µR to get qopt

  • qopt separates relevant/nonrelevant perfectly.

37 / 39

slide-38
SLIDE 38

Rocchio 1971 algorithm (SMART)

Used in practice:

  • qm

= α q0 + βµ(Dr) − γµ(Dnr) = α q0 + β 1 |Dr|

  • dj∈Dr
  • dj − γ

1 |Dnr|

  • dj∈Dnr
  • dj

qm: modified query vector; q0: original query vector; Dr and Dnr: sets of known relevant and nonrelevant documents respectively; α, β, and γ: weights attached to each term New query moves towards relevant documents and away from nonrelevant documents. Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ. Set negative term weights to 0. “Negative weight” for a term doesn’t make sense in the vector space model.

38 / 39

slide-39
SLIDE 39

Positive vs. negative relevance feedback

Positive feedback is more valuable than negative feedback. For example, set β = 0.75, γ = 0.25 to give higher weight to positive feedback. Many systems only allow positive feedback.

39 / 39