Matching Scores TVM, Session 4 CS6200: Information Retrieval - PowerPoint PPT Presentation

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton

Finding Similar Vectors �� ( q t − d t ) 2 • Imagine that we have perfect term dist ( q , d ) := t scores: our vectors exactly capture 1 the document’s (or query’s) sim ( q , d ) := 1 + dist ( q , d ) meaning. Play TF Distance Similarity • How can we compare two of these Henry VI, vectors so we can rank the 1 0 1.0 part 2 documents? Hamlet 1 0 1.0 • What’s wrong? Antony and • Let’s try a similarity function based 4 4.59 0.179 Cleopatra ‣ In the query’s term vector, TF=1. on the Euclidean distance between Coriolanus 109 165.40 0.006 the vectors. Julius ‣ Documents with TF > 1 are further 379 578.9 0.002 Caesar from the query, so have lower Plays for query “brutus” using TF-IDF term scores

Dot Product Similarity • We used the dot product in module 1. How does that work? sim ( q , d ) := q · d • For many documents, it gives the results we want. Play TF Similarity • However, imagine building a document by Henry VI, part 2 1 2.34 repeating the contents of some other Hamlet 1 2.34 document. Antony and 4 9.38 Cleopatra • Should two copies of Julius Caesar really Coriolanus 109 255.65 match better than a single copy? Julius Caesar 379 888.91 • Should “The Complete Plays of Julius Caesar x 2 758 1777.83 Shakespeare” match better than individual Julius Caesar x 3 1137 2666.74 plays it contains? Plays for query “brutus” using TF-IDF term scores

Cosine Similarity q · d sim ( q , d ) := • Cosine Similarity solves the � q � · � d � problems of both Euclidean-based q · d = similarity and the dot product. �� i q 2 i d 2 i · i q d ‣ Instead of using distance = · �� i q 2 i d 2 between the vectors, we should i i use the angle between them. Play TF Similarity Henry VI, part 2 1 0.002 ‣ Instead of using the dot product, Antony and we should use a length- 4 0.004 Cleopatra normalized dot product. That is, Coriolanus 109 0.122 convert to unit vectors and take Julius Caesar 379 0.550 their dot product. Julius Caesar x 2 758 0.550 Plays for query “brutus” using TF-IDF term scores

Approximating Cosine Similarity q d • The normalization term for cosine sim ( q , d ) ≈ · � � similarity can’t be calculated in len ( q ) len ( d ) advance, if it depends on df t or cf t . • For faster querying, we sometimes Play TF Similarity approximate it using the number of Henry VI, part 2 1 0.014 terms in the document. Antony and 4 0.056 Cleopatra • This preserves some information Coriolanus 109 1.478 about relative document length, Julius Caesar 379 6.109 Julius Caesar x 2 758 8.639 which can sometimes be helpful. Plays for query “brutus” using TF-IDF term scores

Pivoted Normalized Document Length • Some long documents have many short sections, each relevant to a different query. sim ( q , d ) := q d • These are hurt by Cosine Similarity a � d � + ( 1 � a ) piv , � q � · because they contain many more distinct terms than average. 0 < a < 1 ; piv determined empirically* � q d • If we normalize by a number less than the au d + ( 1 � a ) piv , � q � · length for short documents, and more than u d is # unique terms in d the length for long documents, we can give a slight boost to longer documents. • This comes in both exact and approximate forms. * See: http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html

SMART Notation • VSM weights can be denoted as ddd.qqq , where ddd indicates the scheme for document weights and qqq the scheme for queries. The triples are: term frequency, doc frequency, normalization. • A common choice is lnc.ltc: document vectors use log term frequency and cosine normalization, and query vectors use log term frequency, IDF, and cosine normalization. Image from: http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html

Wrapping Up • Ultimately, the choice of a scoring system to use depends on a balance between accuracy and performance. • Ignoring document length entirely with cosine similarity is a big improvement over the simple dot product, but it turns out that there are subtle cases when document length information is helpful. • Next, we’ll look at ways to efficiently calculate these scores at query time.

Matching Scores TVM, Session 4 CS6200: Information Retrieval - PowerPoint PPT Presentation

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding Similar Vectors ( q t d t ) 2 Imagine that we have perfect term dist ( q , d ) := t scores: our vectors exactly capture 1 the

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Why Propensity Scores Should Be Used for Matching Ben Jann University of Bern,

Parent Seminar Welcome! PSAT Scores SAT vs. ACT Next Steps Overview New PSAT Score Report

1/12/2011 Chapter 5: z-Scores : Location of Scores and Standardized Distributions Introduction to

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Propensity Score Matching James H. Steiger Department of Psychology and Human Development

Impedance Matching of 640 GHz SIS Mixer Impedance Matching of 640 GHz SIS Mixer of 640 GHz SIS

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

Graph Matchings Matching A matching M in a graph G is a set of non-loop edges with no shared

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one)

Lecturer, Computational Science and Engineering, Georgia Tech Text is everywhere We use

GpKex : Genetically Programmed Keyphrase Extraction from Croatian Texts Marko Bekavac and Jan

Content-based recommendation systems (based on chapter 9 of Mining of Massive Datasets, a book by

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

an optimized data exchange policy Hisham Mohamed and Stphane Marchand-Maillet Viper group, CVML

Integrating Structured Data and Text A Tagged Document < DOC > <

Matching Scores TVM, Session 4 CS6200: Information Retrieval - PowerPoint PPT Presentation

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding Similar Vectors ( q t d t ) 2 Imagine that we have perfect term dist ( q , d ) := t scores: our vectors exactly capture 1 the

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Why Propensity Scores Should Be Used for Matching Ben Jann University of Bern,

Parent Seminar Welcome! PSAT Scores SAT vs. ACT Next Steps Overview New PSAT Score Report

1/12/2011 Chapter 5: z-Scores : Location of Scores and Standardized Distributions Introduction to

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Propensity Score Matching James H. Steiger Department of Psychology and Human Development

Impedance Matching of 640 GHz SIS Mixer Impedance Matching of 640 GHz SIS Mixer of 640 GHz SIS

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

Graph Matchings Matching A matching M in a graph G is a set of non-loop edges with no shared

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one)

Lecturer, Computational Science and Engineering, Georgia Tech Text is everywhere We use

GpKex : Genetically Programmed Keyphrase Extraction from Croatian Texts Marko Bekavac and Jan

Content-based recommendation systems (based on chapter 9 of Mining of Massive Datasets, a book by

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

an optimized data exchange policy Hisham Mohamed and Stphane Marchand-Maillet Viper group, CVML

Integrating Structured Data and Text A Tagged Document &lt; DOC &gt; &lt;

Integrating Structured Data and Text A Tagged Document < DOC > <