Term Scores VSM, Session 3 CS6200: Information Retrieval Slides - PowerPoint PPT Presentation

Term Scores VSM, Session 3 CS6200: Information Retrieval Slides by: Jesse Anderton

Flaws of TF • In the first module, we introduced term Term Min TF Max TF # of Plays frequency tf t,d as a way to measure how and 426 1,001 37 much term t contributes to document the 403 1,143 37 d ’s meaning. die 3 40 37 • One obvious flaw with this scheme is love 12 171 37 that some of the most common words, betray 1 6 24 such as “and” or “the,” contribute no rome 1 110 16 meaning to a document. fairy 1 28 10 • We’d like a term score which assigns brutus 1 379 8 high weight to terms that are both verona 5 13 3 distinctive to the document and frequent within the document. romeo 312 312 1 TF of selected terms in Shakespeare’s plays

Zipf’s Law • If you sort words by their frequency, the frequency f times the rank r is roughly constant: freq ( t ) · rank ( t ) ≈ k , or Pr ( t ) · rank ( t ) ≈ c for language-dependant c , k ‣ Word 2 appears 1/2 as often as word 1 ‣ Word 10 appears 1/10 as often as word 1, and 1/2 as often as word 5 • Very common words have much higher TF values, but most are “syntactic glue” Zipf’s Law c ≈ 0 . 1 • In English,

Zipf’s Law on AP89 News Corpus Log-log Plot of AP89 vs. Zipf’s Law Prediction. Top 50 words from AP89 Note problems at high and low probabilities. 84,679 documents; 39,749,179 tokens; 198,763 words in vocab.

How do we fix it? • We want term scores to be proportional to TF, but we want to discount scores for too-common terms. tf t,d df t cf t tf t,d /df t tf t,d / cf t tf-idf t,d Term Doc • Two common ways to discount: and King Lear 737 37 25,932 19.92 0.028 0 ‣ A term’s cumulative frequency cf t is total number Romeo and love 150 37 2,019 4.05 0.074 0 of occurrences of term t in the collection. Juliet rome Hamlet 2 16 332 0.125 0.006 1.68 ‣ The term’s document frequency df t is the number of documents in the collection which Julius rome 42 16 332 2.625 0.127 35.21 contain term t . Caesar romeo Romeo and • The most common way to discount is to multiply 312 1 312 312 1 1126.61 Juliet by log(D/df t ) ., where D is the number of documents in the collection. This is called IDF, for inverse Various term score functions. document frequency, and leads to TF-IDF scores. Why is TF-IDF 0 for “love” in Romeo and Juliet? tf - idf t , d := tf t , d · log ( D / df t )

Nonlinear TF Scaling • Another problem with TF is that repeated use of a term doesn’t necessarily imply more information if tf t , d > 0 � 1 + log ( tf t , d ) about it. wf t , d := otherwise 0 • A document with TF=20 seems wf - idf t , d := wf t , d · idf t unlikely to have exactly ten times the information as a document with TF=2. • We want term repetition to give a document diminishing returns. A common solution is to use the logarithm instead of a linear function. * wf stands for “weight function” and is not an official name.

Normalized TF Scores • As an alternative to nonlinear scaling, we could consider normalizing the TF scores by the TF of the most popular term in the tf t , d ntf t , d := a + ( 1 − a ) document. , 0 < a < 1 max - tf d max - tf d := max • The role of a in this formula is to τ ∈ d tf τ , d produce scores that change more gradually as tf t,d changes. • a is commonly set to 0.4.

Wrapping Up • The term score functions used for VSMs are often heuristics based on some insight into a problem with previously-used functions. • When we discuss Language Models, we will implement these insights in a more principled way. • We showed solutions for several problems with raw TF scores, but the resulting scores are far from perfect. ‣ We don’t separate word senses: “love” can be used to mean many things ‣ A popular word can nevertheless have pages devoted to it, and be the topic of some query. Can we effectively search for Shakespearean plays that tell love stories using TF-IDF scores?

Term Scores VSM, Session 3 CS6200: Information Retrieval Slides - PowerPoint PPT Presentation

Term Scores VSM, Session 3 CS6200: Information Retrieval Slides by: Jesse Anderton Flaws of TF In the first module, we introduced term Term Min TF Max TF # of Plays frequency tf t,d as a way to measure how and 426 1,001 37 much term

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Parent Seminar Welcome! PSAT Scores SAT vs. ACT Next Steps Overview New PSAT Score Report

1/12/2011 Chapter 5: z-Scores : Location of Scores and Standardized Distributions Introduction to

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Organ failure scores in Organ failure scores in neonatal sepsis. neonatal sepsis. Hugo

Using Quality Using Quality -of of-Life Scores to Life Scores to Guide Prostate Radiation

2017 SBAC ELA Scores 2017 SBAC ELA Scores Average Scaled Scores Percentage

CMAS: PARCC New state assessment scores arriving by new year New assessment to measure mastery

Math 2015 Scores of 4 and 5 2018 Scores of 4 and 5 Difference 3rd Grade 58% 79% 21% 4th

Why z-scores? Transforming scores in order to make comparisons, especially when using

Developing Scale Scores & Cut Scores for On-Demand Assessments of Individual Standards Nathan

Announcements Monday, August 27 Make sure your poll scores are in the gradebook. (Poll scores

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding

The short- -term and long term and long- -term term The short stratospheric and tropospheric

Smarter Balanced/SAT Testing Results 2017 1 Smarter Balanced 2 3 4 SAT Achievement Trend 5

Business Statistics CONTENTS Estimating parameters The sampling distribution Confidence

CS422 Computer Architecture Spring 2004 Lecture 02, 01 Jan 2004 Bhaskaran Raman Department of

Presenter : Junaid Maqsood Carleton University O UTLINE : Background Information

The transition On May 17 th , 2004, Intel, the worlds largest chip maker, canceled the

Computer Chinese Chess Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Strong Law of Large Numbers Will Perkins February 12, 2013 The Theorem Theorem (Strong Law of

MATH 20: PROBABILITY Fundamental Theorems of Probability Theory Xingru Chen

Laws of probabilities in efficient markets Vladimir Vovk Department of Computer Science Royal

Term Scores VSM, Session 3 CS6200: Information Retrieval Slides - PowerPoint PPT Presentation

Term Scores VSM, Session 3 CS6200: Information Retrieval Slides by: Jesse Anderton Flaws of TF In the first module, we introduced term Term Min TF Max TF # of Plays frequency tf t,d as a way to measure how and 426 1,001 37 much term

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Parent Seminar Welcome! PSAT Scores SAT vs. ACT Next Steps Overview New PSAT Score Report

1/12/2011 Chapter 5: z-Scores : Location of Scores and Standardized Distributions Introduction to

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Organ failure scores in Organ failure scores in neonatal sepsis. neonatal sepsis. Hugo

Using Quality Using Quality -of of-Life Scores to Life Scores to Guide Prostate Radiation

2017 SBAC ELA Scores 2017 SBAC ELA Scores Average Scaled Scores Percentage

CMAS: PARCC New state assessment scores arriving by new year New assessment to measure mastery

Math 2015 Scores of 4 and 5 2018 Scores of 4 and 5 Difference 3rd Grade 58% 79% 21% 4th

Why z-scores? Transforming scores in order to make comparisons, especially when using

Developing Scale Scores &amp; Cut Scores for On-Demand Assessments of Individual Standards Nathan

Announcements Monday, August 27 Make sure your poll scores are in the gradebook. (Poll scores

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding

The short- -term and long term and long- -term term The short stratospheric and tropospheric

Smarter Balanced/SAT Testing Results 2017 1 Smarter Balanced 2 3 4 SAT Achievement Trend 5

Business Statistics CONTENTS Estimating parameters The sampling distribution Confidence

CS422 Computer Architecture Spring 2004 Lecture 02, 01 Jan 2004 Bhaskaran Raman Department of

Presenter : Junaid Maqsood Carleton University O UTLINE : Background Information

The transition On May 17 th , 2004, Intel, the worlds largest chip maker, canceled the

Computer Chinese Chess Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Strong Law of Large Numbers Will Perkins February 12, 2013 The Theorem Theorem (Strong Law of

MATH 20: PROBABILITY Fundamental Theorems of Probability Theory Xingru Chen

Laws of probabilities in efficient markets Vladimir Vovk Department of Computer Science Royal

Developing Scale Scores & Cut Scores for On-Demand Assessments of Individual Standards Nathan