1 So, similarity is not a Boolean notion It is Similarity Are they - PDF document

Ranking– Ordering according to the degree of some fuzzy notions: Ranking and Preference in � Similarity (or dissimilarity) � Relevance Database Search: � Preference Q a) Similarity and Relevance Kevin Chen-Chuan Chang ranking 2 Similarity!-- Are they similar? Similarity!-- Are they similar? � Two images � Two images 3 4 1

So, similarity is not a Boolean notion– It is Similarity– Are they similar? relatively ranking � Two strings 5 6 Ranking by similarity Similarity-based ranking –- by a “distance” function (or “dissimilarity”) Q d(Q, O i ) 7 8 2

The “space” – Defined by the objects and their Vector space– What is a vector space? distances � Object representation– Vector or not? ( S , d ) is a vector space if: � Each object in S is a k-dimensional vector x = ( 1 x ,..., x ) � k y = ( y ,..., y ) � � Distance function– Metric or not? 1 k � The distance d ( x, y ) between any x and y is metric 9 10 Vector space distance functions – Vector space distance functions – The L p distance functions L 1 : The Manhattan distance � The general form: � Let p =1 in L p : 1 k k ∑ ∑ = − = − P L ( x : ( x ,..., x ), y : ( y ,..., y )) x y ( : ( ,..., ), : ( ,..., )) ( ) p L x x x y y y x y 1 1 k 1 k i i P 1 k 1 k i i = = i 1 1 i � AKA: p-norm distance, Minkowski distance Manhattan or “block” distance: � � Does this look familiar? ( y 1 , y 2 ) ( x 1 , x 2 ) 11 12 3

Vector space distance functions – Vector space distance functions– L 2 : The Euclidean distance The Cosine measure � Let p =2 in L p : 1 k ∑ ∑ = − 2 ( : ( ,..., ), : ( ,..., )) ( ) 2 • × L x x x y y y x y x y x y 1 1 = θ = = P k k i i i i sim ( x , y ) cos( ) ∑ ∑ = × i 1 × x y 2 2 x y i i � The shortest distance x θ ( y 1 , y 2 ) y ( x 1 , x 2 ) 13 14 Sounds abstract? That’s actually how Web How to evaluate vector-space queries? search engines (like Google) work Consider Lp measure-- Vector space modeling � Consider L 2 as the ranking function Cosine measure Or the “TF -IDF” model � Given object Q , find O i of increasing d ( Q, O i ) Q: “apple computer” Q = (x 1 , …, x k ) Sim(Q, D) = ∑ × � How to evaluate this query? What index structure? x y D i i D = (y 1 , …, y k ) � As nearest -neighbor queries � Using multidimensional or spatial indexes. e.g., R- tree [Guttman, 1984] 15 16 4

How to evaluate vector-space queries? Is vector space always possible? Consider Cosine measure-- ∑ × x y � Sim(Q, D) = � Can you always express objects as k-dimensional i i vectors, so that � distance function compares only corresponding � How to evaluate this query? What index structure? dimensions? � Simple computation: multiply and sum up � Inverted index to find document with non-zero � Counter examples? weights for query terms 17 18 How about comparing two strings? Is it Metric space– What is a metric space? natural to consider in vector space? � Two strings � Set S of objects � Global distance function d , (the “ metric” ) � For every two points x, y in S: ≥ � Positiveness: d ( x , y ) 0 = ( , ) ( , ) � Symmetry d x y d y x = ( , ) 0 � Reflexivity d x x ≤ + ( , ) ( , ) ( , ) � Triangle inequity d x y d x z d z y 19 20 5

Vector space is a special case of metric space– Another example-- Edit distance E.g., consider L 2 � Let p =2 in L p : � The smallest number of edit operations (insertions, 1 k ∑ deletions, and substitutions) required to transform = − 2 ( : ( ,..., ), : ( ,..., )) ( ) 2 L x x x y y y x y 1 1 P k k i i one string into another = i 1 � Virginia � The shortest distance � Verginia � Verminia � Vermonta ( y 1 , y 2 ) � Vermonta � Vermont ( x 1 , x 2 ) � http://urchin.earth.li/~twic/edit-distance.html 21 22 Is edit distance metric? How to evaluate metric-space ranking queries? [Chávez et al., 2001] � Can you show that it is symmetric? � Can we still use R-tree? � Such that d (Virginia, Vermont) = d (Vermont, Virginia)? � Virginia � What property of metric space can we leverage to � Verginia “prune” the search space for finding near objects? � Verminia � Vermonta � Vermonta � Vermont � Check other properties 23 24 6

Metric-space indexing Relevance -based ranking – for text retrieval � What is the range of u? What is being “relevant”? � How does this help in focusing our search? Many different ways modeling relevance � Similarity Q 5 � How similar is D to Q? Index 2 � Probability 3 � How likely is D relevant to Q? u 6 � Inference � How likely can D infer Q? 25 26 Similarity-based relevance-– We just talked about Probabilistic relevance this “ vector-space modeling” [Salton et al., 1975] � View: Probability of relevance Vector space modeling Cosine measure Or the “TF -IDF” model � the “probabilistic ranking principle” [Robertson, 1977] “ If a retrieval system’s response to each request is a ranking of the Q: “apple computer” Q = (x 1 , …, x k ) documents in the collections in order of decreasing probability of Sim(Q, D) = ∑ usefulness to the user who submitted the request, where the × x y probabilities are estimated as accurately as possible on the basis of D i i D = (y 1 , …, y k ) whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be th e best that is obtainable on the basis of that data. � TF-IDF for term weights in vectors � TF: term frequency (in this document) � Initial idea proposed in [Maron and Kuhns, 1960] � the more term occurrences in this doc, the better many models followed. � IDF: inverse document frequency (in entire DB) � the fewer documents contain this term, the better 27 28 7

Probabilistic models (e.g.: [Croft and Harper, This is how we derive the ranking function: 1979] ) ( | , ) ( | , ) P R Q D P R Q D � Estimate and rank by P(R | Q, D), or � To rank by log log P ( R | Q , D ) P ( R | Q , D ) − 1 i = ∏ p q p P ( t | R ) � I.e., , where i ⋅ i log ( | , ) ( , | ) ( ) ( , | ) − i P R Q D P Q D R P R P Q D R = ∝ 1 p q ∈ i i ti Q , D i = ( | , ) ( , | ) ( ) ( , | ) ( | ) P R Q D P Q D R P R P Q D R q P t R i ∏ ∏ ∏ ∏ � Assume = − = − P ( Q , D | R ) P ( t | R ) ( 1 P ( t | R ) ) p ( 1 p ) i j i j � p i the same for all query terms ∈ ∈ ∈ ∈ ti Q , D tj Q , D ti Q , D tj Q , D ∏ ∏ ∏ ∏ = − = − � q i = n i / N , where N is DB size P ( Q , D | R ) P ( t | R ) ( 1 P ( t | R ) ) q ( 1 q ) i j i j � (i.e., “all” docs are non-relevant) ∈ ∈ ∈ ∈ ti Q , D tj Q , D ti Q , D tj Q , D ∏ ∏ ∏ − − ( 1 ) ( 1 ) p p p q i j i i − − − − − P ( R | Q , D ) p 1 q 1 1 ∈ ∈ ∈ ∏ ∏ p q ∏ q ∏ N n ∑ N n = ti Q , D tj Q , D ∝ ti Q D , = i ⋅ i i ⋅ i ∝ i = i = i � log log log log ∏ ∏ ∏ − − − − 1 ( | , ) q ( 1 q ) q ( 1 p ) 1 p q p q q n n P R Q D i j i i ∈ i i ∈ i i ∈ i ∈ i ∈ i ti Q , D ti Q , D ti Q , D ti Q , D ti Q , D ∈ ∈ ∈ ti Q , D tj Q , D ti Q D , � Similar to using “IDF” � intuition: e.g., “apple computer” in a computer DB 29 30 Inference-based relevance Inference network [Turtle and Croft, 1990] � Given doc as evidence, prove that info need is satisfied � Inference based on Bayesian belief networks � Motivation d 1 d 2 doc “doc d n observed” d n � Is there any “objective” way of defining relevance? � Hint from a logic view of database querying: retrieve all objects t 1 t 2 Doc rep. t n s.t., O → Q � E.g., O = (john, cs, 3.5) � gpa>3.0 AND dept=cs r k Doc Network r 1 r 2 r 3 Doc concept � What about “Retrieve D iff we can prove D → Q”? � Challenges: Uncertainty in inference? [van Rijsbergen, 1986] c m c 1 c 2 Query concept � Representation of documents and queries � Quantify the uncertainty of inference P(D → Q) = P(Q|D) q 1 Query rep. Query Network q 2 Q Query or “infomation need” 31 32 8

Using and constructing the network � Using the network: Suppose all probabilities known Ranking and Preference in � Document network can be pre-computed Database Search: � For any given query, query network can be evaluated � P(Q|D) can be computed for each document b) Preference Modeling � Documents can be ranked according to P(Q|D) � Constructing the network: Assigning probabilities Kevin Chen-Chuan Chang � Subjective probabilities � Heuristics, e.g., TF-IDF weighting � Statistical estimation � Need “training”/relevance data 33 Ranking– Ordering according to the degree of What do you prefer? For a job. some fuzzy notions: � Similarity (or dissimilarity) � Relevance � Preference Q ranking 35 36 9

1 So, similarity is not a Boolean notion It is Similarity Are they - PDF document

Ranking Ordering according to the degree of some fuzzy notions: Ranking and Preference in Similarity (or dissimilarity) Relevance Database Search: Preference Q a) Similarity and Relevance Kevin Chen-Chuan Chang ranking 2

Introduction to GenePattern Rehan Akbani rakbani@mdanderson.org Overview What is GenePattern

ACSIS Correlator Crate Software HIA/DRAO ACSIS Correlator Crate Software Outline System

Quasi Riesz transforms, Hardy spaces and generalised sub-Gaussian heat kernel estimates Li CHEN

Detection of Hadrons with New Detection of Hadrons with New Heavy Quark at LHC and Heavy Quark

L ECTURE 7: DOF S VS . M ANEUVERABILITY K INEMATICS E QUATIONS I NSTRUCTOR : G IANNI A. D I C ARO

Ready, Set, Go: Coalesced Offloading from Mobile Devices to the Cloud Liyao Xiang , Shiwen Ye,

Summary of RSG/RRB Ian Bird GDB 9 th May 2012 Ian.Bird@cern.ch 1 Slides taken from C-RSG

V e r s i o n C o n t r o l w i t h G i t B e f o r e w e s t a

Wonder of sine-Gordon Y-systems (joint with T. Nakanishi) Salvatore Stella Department of

On Transaction Costs in Insurance some work in progress Stefan Thonhauser Johann Radon Institute

FOUNDATIONS OF SEMANTIC WEB TECHNOLOGIES RDFS Rule-based Reasoning Sebastian Rudolph Dresden,

Search for Heavy Resonances with CMS Kerstin Hoepfner, RWTH Aachen, III. Phys. Inst. A On behalf

1 Hadronic top rejection 0.9 0.8 0.7 0.6 0.5 0.4 68% mass window, 350 < p < 500 GeV

Shunsaku Horiuchi Virginia Tech

Toward Detection of First Supernovae - - Masaomi Tanaka

Airport Master Plan Update Public Information Meeting # 2 October 14, 2020 1 AGENDA

ESS & Regional Sources/CANS M. Strobl ESS Deputy Head of Instrument Division Prof. X-ray

Albert-Lszl Barabsi with Emma K. Towlson, Sebastian Ruf, Michael Danziger, and Louis

Network Services on Service Extensible Routers Lukas Ruf Computer Engineering and Networks

Measuring Semantic Coherence of a Conversation Svitlana Vakulenko , Maarten de Rijke, Michael

Te Teachings s of f th the Pro rophet t (s) s) Appreciating Appr ng Others on on Family

Network Core Mechanisms of Exponence 2 nd Network Meeting, January 2008 Bernd Wiese The

Pointer Analysis in the Presence of Dynamic Class Loading Martin Hirzel, Amer Diwan University

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and

Sambuz

Useful Links

Newsletter

Mail Us

1 So, similarity is not a Boolean notion It is Similarity Are they - PDF document

Ranking Ordering according to the degree of some fuzzy notions: Ranking and Preference in Similarity (or dissimilarity) Relevance Database Search: Preference Q a) Similarity and Relevance Kevin Chen-Chuan Chang ranking 2

Introduction to GenePattern Rehan Akbani rakbani@mdanderson.org Overview What is GenePattern

ACSIS Correlator Crate Software HIA/DRAO ACSIS Correlator Crate Software Outline System

Quasi Riesz transforms, Hardy spaces and generalised sub-Gaussian heat kernel estimates Li CHEN

Detection of Hadrons with New Detection of Hadrons with New Heavy Quark at LHC and Heavy Quark

L ECTURE 7: DOF S VS . M ANEUVERABILITY K INEMATICS E QUATIONS I NSTRUCTOR : G IANNI A. D I C ARO

Ready, Set, Go: Coalesced Offloading from Mobile Devices to the Cloud Liyao Xiang , Shiwen Ye,

Summary of RSG/RRB Ian Bird GDB 9 th May 2012 Ian.Bird@cern.ch 1 Slides taken from C-RSG

V e r s i o n C o n t r o l w i t h G i t B e f o r e w e s t a

Wonder of sine-Gordon Y-systems (joint with T. Nakanishi) Salvatore Stella Department of

On Transaction Costs in Insurance some work in progress Stefan Thonhauser Johann Radon Institute

FOUNDATIONS OF SEMANTIC WEB TECHNOLOGIES RDFS Rule-based Reasoning Sebastian Rudolph Dresden,

Search for Heavy Resonances with CMS Kerstin Hoepfner, RWTH Aachen, III. Phys. Inst. A On behalf

1 Hadronic top rejection 0.9 0.8 0.7 0.6 0.5 0.4 68% mass window, 350 &lt; p &lt; 500 GeV

Shunsaku Horiuchi Virginia Tech

Toward Detection of First Supernovae - - Masaomi Tanaka

Airport Master Plan Update Public Information Meeting # 2 October 14, 2020 1 AGENDA

ESS &amp; Regional Sources/CANS M. Strobl ESS Deputy Head of Instrument Division Prof. X-ray

Albert-Lszl Barabsi with Emma K. Towlson, Sebastian Ruf, Michael Danziger, and Louis

Network Services on Service Extensible Routers Lukas Ruf Computer Engineering and Networks

Measuring Semantic Coherence of a Conversation Svitlana Vakulenko , Maarten de Rijke, Michael

Te Teachings s of f th the Pro rophet t (s) s) Appreciating Appr ng Others on on Family

Network Core Mechanisms of Exponence 2 nd Network Meeting, January 2008 Bernd Wiese The

Pointer Analysis in the Presence of Dynamic Class Loading Martin Hirzel, Amer Diwan University

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and

Sambuz

Useful Links

Newsletter

Mail Us

1 Hadronic top rejection 0.9 0.8 0.7 0.6 0.5 0.4 68% mass window, 350 < p < 500 GeV

ESS & Regional Sources/CANS M. Strobl ESS Deputy Head of Instrument Division Prof. X-ray