Computing Relevance, Similarity: The Vector Space Model Chapter 27, - PDF document

Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Document Vectors � Documents are represented as “bags of words” � Represented as vectors when used computationally • A vector is like an array of floating point • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse Database Management Systems, R. Ramakrishnan 2 Document Vectors: One location for each word. nova galaxy heat h’wood film role diet fur A 10 5 3 B 5 10 C 10 8 7 D “Nova” occurs 10 times in text A 9 10 5 “Galaxy” occurs 5 times in text A 10 10 E F “Heat” occurs 3 times in text A 9 10 G 5 7 (Blank means 0 occurrences.) 9 H 6 10 2 8 I 7 5 1 3 Database Management Systems, R. Ramakrishnan 3

Document Vectors Document ids nova galaxy heat h’wood film role diet fur A 10 5 3 B 5 10 C 10 8 7 D 9 10 5 E 10 10 9 10 F G 5 7 9 H 6 10 2 8 I 7 5 1 3 Database Management Systems, R. Ramakrishnan 4 We Can Plot the Vectors Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet Assumption: Documents that are “close” in space are similar. Database Management Systems, R. Ramakrishnan 5 Vector Space Model � Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms � Queries represented the same as documents � A vector distance measure between the query and documents is used to rank retrieved documents • Query and Document similarity is based on length and direction of their vectors • Vector operations to capture boolean query Database Management Systems, R. Ramakrishnan 6 conditions

Vector Space Documents and Queries t 1 docs t1 t2 t3 RSV=Q.Di t 3 D1 1 0 1 4 D 2 D 9 D 1 D2 1 0 0 1 D3 0 1 1 5 D 4 D 11 D4 1 0 0 1 D 5 D5 1 1 1 6 D 3 D 6 D6 1 1 0 3 D 10 D7 0 1 0 2 D8 0 1 0 2 D 8 D9 0 0 1 3 t 2 D 7 D10 0 1 1 5 D11 1 0 1 3 Q 1 2 3 q1 q2 q3 Boolean term combinations Q is a query – also represented as a vector Database Management Systems, R. Ramakrishnan 7 Assigning Weights to Terms � Binary Weights � Raw term frequency � tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are •frequent in relevant documents … BUT •infrequent in the collection as a whole Database Management Systems, R. Ramakrishnan 8 Binary Weights � Only the presence (1) or absence (0) of a term is included in the vector docs t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 D8 0 1 0 D9 0 0 1 D10 0 1 1 D11 1 0 1 Database Management Systems, R. Ramakrishnan 9

Raw Term Weights � The frequency of occurrence for the term in each document is included in the vector docs t1 t2 t3 D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0 D7 0 8 0 D8 0 10 0 D9 0 0 1 D10 0 3 5 D11 4 0 1 Database Management Systems, R. Ramakrishnan 10 TF x IDF Weights � tf x idf measure: • Term Frequency (tf) • Inverse Document Frequency (idf) -- a way to deal with the problems of the Zipf distribution � Goal: Assign a tf * idf weight to each term in each document Database Management Systems, R. Ramakrishnan 11 TF x IDF Calculation w = tf * log( N / n ) ik ik k = T term k in document D k i = tf frequency of term T in document D ik k i = idf inverse document frequency of term T in C k k = N total number of documents in the collection C = n the number of documents in C that contain T k k   N = idf log   k  n  k Database Management Systems, R. Ramakrishnan 12

Inverse Document Frequency � IDF provides high values for rare words and low values for common words  10000    = log 0  10000  For a  10000  log   = 0 . 301 collection  5000  of 10000   10000 = documents log   2 . 698  20    10000 = log   4  1  Database Management Systems, R. Ramakrishnan 13 TF x IDF Normalization � Normalize the term weights (so longer documents are not unfairly given more weight) • The longer the document, the more likely it is for a given term to appear in it, and the more often a given term is likely to appear in it. So, we want to reduce the importance attached to a term appearing in a document based on the length of the document. tf log( N / n ) = w ik k ik ∑ = t 2 2 ( tf ) [log( N / n )] ik k k 1 Database Management Systems, R. Ramakrishnan 14 Pair-wise Document Similarity nova galaxy heat h’wood film role diet fur A 1 3 1 B 5 2 2 1 5 C 4 1 D How to compute document similarity? Database Management Systems, R. Ramakrishnan 15

Pair-wise Document Similarity = ∗ + ∗ = sim ( A , B ) ( 1 5 ) ( 2 3 ) 11 = D w , w ..., w 1 11 12 , 1 t sim ( A , C ) = 0 D = w , w ..., w = sim ( A , D ) 0 2 21 22 , 2 t sim ( B , C ) = 0 t ∑ = ∗ sim ( D , D ) w w = sim ( B , D ) 0 1 2 1 i 2 i = i 1 = ∗ + ∗ = sim ( C , D ) ( 2 4 ) ( 1 1 ) 9 nova galaxy heat h’wood film role diet fur A 1 3 1 B 5 2 2 1 5 C 4 1 D Database Management Systems, R. Ramakrishnan 16 Pair-wise Document Similarity (cosine normalization) = D w , w ..., w 1 11 12 , 1 t = D w , w ..., w 2 21 22 , 2 t t ∑ = ∗ sim ( D , D ) w w unnormaliz ed 1 2 1 i 2 i = i 1 t ∑ ∗ w w 1 i 2 i = sim ( D , D ) i = 1 cosine normalized 1 2 t t ∑ ∑ 2 ∗ 2 ( w ) ( w ) 1 i 2 i = = i 1 i 1 Database Management Systems, R. Ramakrishnan 17 Vector Space “Relevance” Measure = D w , w ,..., w i d d d i 1 i 2 it = = Q w , w ..., w w 0 if a term is absent q 1 q 2 , qt t ∑ = ∗ if term weights normalized : sim ( Q , D ) w w i qj d ij = j 1 otherwise normalize in the similarity comparison : t ∑ ∗ w w qj d ij = j = 1 sim ( Q , D ) i t t ∑ ∑ 2 ∗ 2 ( w ) ( w ) qj d ij j = 1 j = 1 Database Management Systems, R. Ramakrishnan 18

Computing Relevance Scores = Say we have query vect or Q ( 0 . 4 , 0 . 8 ) Also, document D = ( 0 . 2 , 0 . 7 ) 2 What does their similarity comparison yield? + ( 0 . 4 * 0 . 2 ) ( 0 . 8 * 0 . 7 ) = sim ( Q , D ) 2 2 + 2 2 + 2 [( 0 . 4 ) ( 0 . 8 ) ] * [( 0 . 2 ) ( 0 . 7 ) ] 0 . 64 = = 0 . 98 0 . 42 Database Management Systems, R. Ramakrishnan 19 Vector Space with Term Weights and Cosine Matching D i =( d i1 ,w di1 ;d i2 , w di2 ;…;d it , w dit ) Term B Q =( q i1 ,w qi1 ;q i2 , w qi2 ;…;q it , w qit ) 1.0 Q = (0.4,0.8) ∑ t w w D1=(0.8,0.3) Q D 2 q d j = 1 sim ( Q , D ) = j ij D2=(0.2,0.7) 0.8 i ∑ ∑ t t 2 2 ( w ) ( w ) = q = d j 1 j j 1 ij 0.6 α ⋅ + ⋅ ( 0 . 4 0 . 2 ) ( 0 . 8 0 . 7 ) 2 = sim ( Q , D 2 ) 0.4 2 + 2 ⋅ 2 + 2 [( 0 . 4 ) ( 0 . 8 ) ] [( 0 . 2 ) ( 0 . 7 ) ] D 1 0.2 0 . 64 α = = 0 . 98 1 0 . 42 0 0.2 0.4 0.6 0.8 1.0 . 56 = = Term A sim ( Q , D ) 0 . 74 1 0 . 58 Database Management Systems, R. Ramakrishnan 20 Text Clustering � Finds overall similarities among groups of documents � Finds overall similarities among groups of tokens � Picks out some themes, ignores others Database Management Systems, R. Ramakrishnan 21

Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Te Term rm 1 1 Te Term rm 2 Database Management Systems, R. Ramakrishnan 22 Problems with Vector Space � There is no real theoretical basis for the assumption of a term space • It is more for visualization than having any real basis • Most similarity measures work about the same � Terms are not really orthogonal dimensions • Terms are not independent of all other terms; remember our discussion of correlated terms in text Database Management Systems, R. Ramakrishnan 23 Probabilistic Models � Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query � Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) � Relies on accurate estimates of probabilities Database Management Systems, R. Ramakrishnan 24

Computing Relevance, Similarity: The Vector Space Model Chapter 27, - PDF document

Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearsts slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Document Vectors

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Topic of this talk Topic of this talk From E- -Relevance Relevance From E to W- -Relevance

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

The Classic Vector Space Model Description, Advantages and Limitations of the Classic Vector

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Sentence Similarity Measures for Fine-Grained Estimation of Topical Relevance in Learner Essays

1 So, similarity is not a Boolean notion It is Similarity Are they similar? relatively

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Information Retrieval Tutorial 4: Vector Space Model Professor: Michel Schellekens TA: Ang Gao

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878

Data Mining Techniques: Cluster Analysis Mirek Riedewald Many slides based on presentations by

An expressive dissimilarity measure for relational clustering using neighbourhood trees

We need a better perceptual similarity metric Lubomir Bourdev WaveOne, Inc. CVPR Workshop

Notes about correlation (for Asgn 2) Sharon Goldwater Sharon Goldwater Correlation Overview of

A graphical view of distance between rankings: the Point and Area measures Giorgio Maria Di Nunzio

A Similarity Measure for the ALN Description Logic Nicola Fanizzi, Claudia dAmato Dipartimento

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with