1
play

1 So, similarity is not a Boolean notion It is Similarity Are they - PDF document

Ranking Ordering according to the degree of some fuzzy notions: Ranking and Preference in Similarity (or dissimilarity) Relevance Database Search: Preference Q a) Similarity and Relevance Kevin Chen-Chuan Chang ranking 2


  1. Ranking– Ordering according to the degree of some fuzzy notions: Ranking and Preference in � Similarity (or dissimilarity) � Relevance Database Search: � Preference Q a) Similarity and Relevance Kevin Chen-Chuan Chang ranking 2 Similarity!-- Are they similar? Similarity!-- Are they similar? � Two images � Two images 3 4 1

  2. So, similarity is not a Boolean notion– It is Similarity– Are they similar? relatively ranking � Two strings 5 6 Ranking by similarity Similarity-based ranking –- by a “distance” function (or “dissimilarity”) Q d(Q, O i ) 7 8 2

  3. The “space” – Defined by the objects and their Vector space– What is a vector space? distances � Object representation– Vector or not? ( S , d ) is a vector space if: � Each object in S is a k-dimensional vector x = ( 1 x ,..., x ) � k y = ( y ,..., y ) � � Distance function– Metric or not? 1 k � The distance d ( x, y ) between any x and y is metric 9 10 Vector space distance functions – Vector space distance functions – The L p distance functions L 1 : The Manhattan distance � The general form: � Let p =1 in L p : 1 k k ∑ ∑ = − = − P L ( x : ( x ,..., x ), y : ( y ,..., y )) x y ( : ( ,..., ), : ( ,..., )) ( ) p L x x x y y y x y 1 1 k 1 k i i P 1 k 1 k i i = = i 1 1 i � AKA: p-norm distance, Minkowski distance Manhattan or “block” distance: � � Does this look familiar? ( y 1 , y 2 ) ( x 1 , x 2 ) 11 12 3

  4. Vector space distance functions – Vector space distance functions– L 2 : The Euclidean distance The Cosine measure � Let p =2 in L p : 1 k ∑ ∑ = − 2 ( : ( ,..., ), : ( ,..., )) ( ) 2 • × L x x x y y y x y x y x y 1 1 = θ = = P k k i i i i sim ( x , y ) cos( ) ∑ ∑ = × i 1 × x y 2 2 x y i i � The shortest distance x θ ( y 1 , y 2 ) y ( x 1 , x 2 ) 13 14 Sounds abstract? That’s actually how Web How to evaluate vector-space queries? search engines (like Google) work Consider Lp measure-- Vector space modeling � Consider L 2 as the ranking function Cosine measure Or the “TF -IDF” model � Given object Q , find O i of increasing d ( Q, O i ) Q: “apple computer” Q = (x 1 , …, x k ) Sim(Q, D) = ∑ × � How to evaluate this query? What index structure? x y D i i D = (y 1 , …, y k ) � As nearest -neighbor queries � Using multidimensional or spatial indexes. e.g., R- tree [Guttman, 1984] 15 16 4

  5. How to evaluate vector-space queries? Is vector space always possible? Consider Cosine measure-- ∑ × x y � Sim(Q, D) = � Can you always express objects as k-dimensional i i vectors, so that � distance function compares only corresponding � How to evaluate this query? What index structure? dimensions? � Simple computation: multiply and sum up � Inverted index to find document with non-zero � Counter examples? weights for query terms 17 18 How about comparing two strings? Is it Metric space– What is a metric space? natural to consider in vector space? � Two strings � Set S of objects � Global distance function d , (the “ metric” ) � For every two points x, y in S: ≥ � Positiveness: d ( x , y ) 0 = ( , ) ( , ) � Symmetry d x y d y x = ( , ) 0 � Reflexivity d x x ≤ + ( , ) ( , ) ( , ) � Triangle inequity d x y d x z d z y 19 20 5

  6. Vector space is a special case of metric space– Another example-- Edit distance E.g., consider L 2 � Let p =2 in L p : � The smallest number of edit operations (insertions, 1 k ∑ deletions, and substitutions) required to transform = − 2 ( : ( ,..., ), : ( ,..., )) ( ) 2 L x x x y y y x y 1 1 P k k i i one string into another = i 1 � Virginia � The shortest distance � Verginia � Verminia � Vermonta ( y 1 , y 2 ) � Vermonta � Vermont ( x 1 , x 2 ) � http://urchin.earth.li/~twic/edit-distance.html 21 22 Is edit distance metric? How to evaluate metric-space ranking queries? [Chávez et al., 2001] � Can you show that it is symmetric? � Can we still use R-tree? � Such that d (Virginia, Vermont) = d (Vermont, Virginia)? � Virginia � What property of metric space can we leverage to � Verginia “prune” the search space for finding near objects? � Verminia � Vermonta � Vermonta � Vermont � Check other properties 23 24 6

  7. Metric-space indexing Relevance -based ranking – for text retrieval � What is the range of u? What is being “relevant”? � How does this help in focusing our search? Many different ways modeling relevance � Similarity Q 5 � How similar is D to Q? Index 2 � Probability 3 � How likely is D relevant to Q? u 6 � Inference � How likely can D infer Q? 25 26 Similarity-based relevance-– We just talked about Probabilistic relevance this “ vector-space modeling” [Salton et al., 1975] � View: Probability of relevance Vector space modeling Cosine measure Or the “TF -IDF” model � the “probabilistic ranking principle” [Robertson, 1977] “ If a retrieval system’s response to each request is a ranking of the Q: “apple computer” Q = (x 1 , …, x k ) documents in the collections in order of decreasing probability of Sim(Q, D) = ∑ usefulness to the user who submitted the request, where the × x y probabilities are estimated as accurately as possible on the basis of D i i D = (y 1 , …, y k ) whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be th e best that is obtainable on the basis of that data. � TF-IDF for term weights in vectors � TF: term frequency (in this document) � Initial idea proposed in [Maron and Kuhns, 1960] � the more term occurrences in this doc, the better many models followed. � IDF: inverse document frequency (in entire DB) � the fewer documents contain this term, the better 27 28 7

  8. Probabilistic models (e.g.: [Croft and Harper, This is how we derive the ranking function: 1979] ) ( | , ) ( | , ) P R Q D P R Q D � Estimate and rank by P(R | Q, D), or � To rank by log log P ( R | Q , D ) P ( R | Q , D ) − 1 i = ∏ p q p P ( t | R ) � I.e., , where i ⋅ i log ( | , ) ( , | ) ( ) ( , | ) − i P R Q D P Q D R P R P Q D R = ∝ 1 p q ∈ i i ti Q , D i = ( | , ) ( , | ) ( ) ( , | ) ( | ) P R Q D P Q D R P R P Q D R q P t R i ∏ ∏ ∏ ∏ � Assume = − = − P ( Q , D | R ) P ( t | R ) ( 1 P ( t | R ) ) p ( 1 p ) i j i j � p i the same for all query terms ∈ ∈ ∈ ∈ ti Q , D tj Q , D ti Q , D tj Q , D ∏ ∏ ∏ ∏ = − = − � q i = n i / N , where N is DB size P ( Q , D | R ) P ( t | R ) ( 1 P ( t | R ) ) q ( 1 q ) i j i j � (i.e., “all” docs are non-relevant) ∈ ∈ ∈ ∈ ti Q , D tj Q , D ti Q , D tj Q , D ∏ ∏ ∏ − − ( 1 ) ( 1 ) p p p q i j i i − − − − − P ( R | Q , D ) p 1 q 1 1 ∈ ∈ ∈ ∏ ∏ p q ∏ q ∏ N n ∑ N n = ti Q , D tj Q , D ∝ ti Q D , = i ⋅ i i ⋅ i ∝ i = i = i � log log log log ∏ ∏ ∏ − − − − 1 ( | , ) q ( 1 q ) q ( 1 p ) 1 p q p q q n n P R Q D i j i i ∈ i i ∈ i i ∈ i ∈ i ∈ i ti Q , D ti Q , D ti Q , D ti Q , D ti Q , D ∈ ∈ ∈ ti Q , D tj Q , D ti Q D , � Similar to using “IDF” � intuition: e.g., “apple computer” in a computer DB 29 30 Inference-based relevance Inference network [Turtle and Croft, 1990] � Given doc as evidence, prove that info need is satisfied � Inference based on Bayesian belief networks � Motivation d 1 d 2 doc “doc d n observed” d n � Is there any “objective” way of defining relevance? � Hint from a logic view of database querying: retrieve all objects t 1 t 2 Doc rep. t n s.t., O → Q � E.g., O = (john, cs, 3.5) � gpa>3.0 AND dept=cs r k Doc Network r 1 r 2 r 3 Doc concept � What about “Retrieve D iff we can prove D → Q”? � Challenges: Uncertainty in inference? [van Rijsbergen, 1986] c m c 1 c 2 Query concept � Representation of documents and queries � Quantify the uncertainty of inference P(D → Q) = P(Q|D) q 1 Query rep. Query Network q 2 Q Query or “infomation need” 31 32 8

  9. Using and constructing the network � Using the network: Suppose all probabilities known Ranking and Preference in � Document network can be pre-computed Database Search: � For any given query, query network can be evaluated � P(Q|D) can be computed for each document b) Preference Modeling � Documents can be ranked according to P(Q|D) � Constructing the network: Assigning probabilities Kevin Chen-Chuan Chang � Subjective probabilities � Heuristics, e.g., TF-IDF weighting � Statistical estimation � Need “training”/relevance data 33 Ranking– Ordering according to the degree of What do you prefer? For a job. some fuzzy notions: � Similarity (or dissimilarity) � Relevance � Preference Q ranking 35 36 9

Recommend


More recommend