Vector Space Scoring
Introduction to Information Retrieval INF 141 Donald J. Patterson
Content adapted from Hinrich Schütze http://www.informationretrieval.org
Vector Space Scoring Introduction to Information Retrieval INF 141 - - PowerPoint PPT Presentation
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Querying Corpus-wide statistics Querying Corpus-wide statistics Collection
Content adapted from Hinrich Schütze http://www.informationretrieval.org
Querying
Querying
the entire corpus
Querying
the entire corpus
the term in the corpus
Querying
Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760
Querying
documents
Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760
Querying
documents
Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760
Querying
Querying
Querying
Querying
Querying
Querying
Querying
Querying
Querying
Querying
Querying
id ft = log |corpus| d ft
Querying
id ft = log |corpus| d ft
ft = log10 1, 000, 000 d ft
d ft id ft calpurnia 1 animal 10 sunday 1000 fly 10, 000 under 100, 000 the 1, 000, 000
Querying
tfid f(t, d) = (1 + log(tft,d)) ∗ log |corpus| d ft,d
Querying
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
Querying
Vector Space Scoring
common vector space.
Vector Space Scoring
dimensional space, with axes related to terms.
= (1 + log(tft,d)) ∗ log |corpus| d ft,d
Vector Space Scoring
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
Vector Space Scoring
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
Vector Space Scoring
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
Vector Space Scoring
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
Vector Space Scoring
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
Vector Space Scoring
Antony Brutus
Antony and Cleopatra Julius Caesar Tempest Hamlet Othello MacBeth
Vector Space Scoring
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
Vector Space Scoring
Hamlet Antony and Cleopatra Julius Caesar Tempest Othello MacBeth
mercy worser
Vector Space Scoring
Hamlet Antony and Cleopatra Julius Caesar Tempest Othello MacBeth
mercy worser
query
Vector Space Scoring
content, different in length can have large differences in magnitude.
Vector Space Scoring
angle between them is 0.
the same, the documents will be scored as equal.
θ
Vector Space Scoring
equivalent
a function of angle over (0 ... 180)
θ
Vector Space Scoring
document similarity
work with
Vector Space Scoring
Vector Space Scoring
their vectors to the query (also a vector)
sim(q, di) =
V (di) | V (q)|| V (di)|
Vector Space Scoring
θ
V (d2) = cos(θ) · | V (d1)|| V (d2)| cos(θ) =
V (d2) | V (d1)|| V (d2)| sim(d1, d2) =
V (d2) | V (d1)|| V (d2)|
Vector Space Scoring
θ
V (d2) =
tn
( V (d1)i V (d2)i)
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
( d1 ) ·
( d2 ) = ( 1 3 . 1 ∗ 1 1 . 4 ) + ( 3 . ∗ 8 . 3 ) + ( 2 . 3 ∗ 2 . 3 ) + ( ∗ 1 1 . 2 ) + ( 1 7 . 7 ∗ ) + ( . 5 ∗ ) + ( 1 . 2 ∗ ) = 1 7 9 . 5 3
Vector Space Scoring
θ
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
| V (d1)| =
( V (d1)i V (d1)i)
| V (d1)| =
= 22.38
Vector Space Scoring
θ
Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0
| V (d1)| =
( V (d1)i V (d1)i)
| V (d1)| =
= 18.15
Vector Space Scoring
θ
sim(d1, d2) =
V (d2) | V (d1)|| V (d2)| = 179.53 22.38 ∗ 18.15 = 0.442
Vector Space Scoring
Vector Space Scoring
companies, bots are shaping documents to maximize scores
Vector Space Scoring
world:
Vector Space Scoring
and “dark roast” rows of our index
can’t effectively discriminate results.
Vector Space Scoring
results: “rising interest” “interest rates”
separate terms.
Vector Space Scoring
well together
cosine similarity
and intersections
θ
X ∩ Y
Vector Space Scoring
Vector Space Scoring
Vector Space Scoring
Vector Space Scoring
matching set of dictionary terms?
Vector Space Scoring
matching set of dictionary terms?
idfs to deal with
Vector Space Scoring
matching set of dictionary terms?
idfs to deal with
Vector Space Scoring
words queries
query operators
Vector Space Scoring
query language
query” section of interface
Vector Space Scoring
the relevance
repeated terms
WTF(t, d) 1 if tft,d = 0 2 then return(0) 3 else return(1 + log(tft,d))