Vector Space Scoring Introduction to Information Retrieval INF 141 - PowerPoint PPT Presentation

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org

Querying Corpus-wide statistics

Querying Corpus-wide statistics • Collection Frequency, cf • Define: The total number of occurences of the term in the entire corpus

Querying Corpus-wide statistics • Collection Frequency, cf • Define: The total number of occurences of the term in the entire corpus • Document Frequency, df • Define: The total number of documents which contain the term in the corpus

Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760

Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760 • This suggests that df is better at discriminating between documents

Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760 • This suggests that df is better at discriminating between documents • How do we use df?

Querying Corpus-wide statistics

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf”

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term • more commonly it is:

Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term � | corpus | � • more commonly it is: id f t = log d f t

Querying TF-IDF Examples � | corpus | � � 1 , 000 , 000 � id f t = log id f t = log 10 d f t d f t term d f t id f t 6 calpurnia 1 4 animal 10 3 sunday 1000 2 fly 10 , 000 1 under 100 , 000 0 the 1 , 000 , 000

Querying TF-IDF Summary • Assign tf-idf weight for each term t in a document d: � | corpus | � f ( t, d ) = (1 + log ( tf t,d )) ∗ log tfid d f t,d • Increases with number of occurrences of term in a doc. • Increases with rarity of term across entire corpus • Three different metrics • term frequency • document frequency • collection/corpus frequency

Querying Now, real-valued term-document matrices • Bag of words model • Each element of matrix is tf-idf value Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

Querying Vector Space Scoring • That is a nice matrix, but • How does it relate to scoring? • Next, vector space scoring

Vector Space Scoring Vector Space Model • Define: Vector Space Model • Representing a set of documents as vectors in a common vector space. • It is fundamental to many operations • (query,document) pair scoring • document classification • document clustering • Queries are represented as a document • A short one, but mathematically equivalent

Vector Space Scoring Vector Space Model • Define: Vector Space Model � • A document, d, is defined as a vector: V ( d ) • One component for each term in the dictionary • Assume the term is the tf-idf score � | corpus | � � V ( d ) t = (1 + log ( tf t,d )) ∗ log d f t,d • A corpus is many vectors together. • A document can be thought of as a point in a multi- dimensional space, with axes related to terms.

Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � V ( d 1 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0 � V ( d 6 ) 7

Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: Brutus Julius Caesar Antony and Cleopatra Hamlet Tempest Othello MacBeth Antony

Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: worser Antony and Cleopatra Tempest Hamlet Othello MacBeth Julius Caesar mercy

Vector Space Scoring Query as a vector • So a query can also be plotted in the same space • “worser mercy” • To score, we ask: worser • How similar are two points? Antony and Cleopatra • How to answer? query Tempest Hamlet Othello MacBeth Julius Caesar mercy

Vector Space Scoring Introduction to Information Retrieval INF 141 - PowerPoint PPT Presentation

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Querying Corpus-wide statistics Querying Corpus-wide statistics Collection

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Scoring, term weighting, the vector space model Giorgio Gambosi Course of Information Retrieval

Welcome to Scoring the ACIRI a Job Aid. 1 This job aid provides a brief review of the scoring

Investment Board April 21, 2014 Agenda UW-IT Portfolio Scoring Process Scoring Results

Mobile Credit Scoring: Powering Consumer Finance in Emerging Markets SUMMARY Credit Scoring

SI Scoring Guide SUBORDINATION INDEX USING SALT Discuss the scoring rules SALT SOFTWARE, LLC

The Classic Vector Space Model Description, Advantages and Limitations of the Classic Vector

Web Information Retrieval Lecture 6 Vector Space Model Recap of the last lecture Parametric

Part 10: Vector Space Classification Francesco Ricci 1 Content p Recap on nave Bayes p

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Information Retrieval Tutorial 4: Vector Space Model Professor: Michel Schellekens TA: Ang Gao

C++ Roast PRESENTED BY TIM STRAUBINGER Todays Agenda A Brief History of C++ Gentle

Gene set testing in limma COMBINE RNA-seq Workshop Why? Sometimes after differential

Discovering and Ranking Web Services with BASIL: A Personalized Approach with Biased Focus James

Example: Travelling Salesperson Problem Start with any complete tour, perform pairwise exchanges

Computer Scientists and the Law: Technical leadership on public policy and ethics challenges of

MA 123, Chapter 1 Equations, functions, and graphs (pp. 1-15) Chapters Goal: Solve an

WELCOME SHAUMBRA S H O U D 5 , J A N U A R Y 2 0 1 9 SHOUD RECAP Recent Events 2 0 1 9

KM3NeT core-collapse supernova & high energy neutrino alerts Massimiliano Lincetto

Vector Space Scoring Introduction to Information Retrieval INF 141 - PowerPoint PPT Presentation

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Querying Corpus-wide statistics Querying Corpus-wide statistics Collection

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Scoring, term weighting, the vector space model Giorgio Gambosi Course of Information Retrieval

Welcome to Scoring the ACIRI a Job Aid. 1 This job aid provides a brief review of the scoring

Investment Board April 21, 2014 Agenda UW-IT Portfolio Scoring Process Scoring Results

Mobile Credit Scoring: Powering Consumer Finance in Emerging Markets SUMMARY Credit Scoring

SI Scoring Guide SUBORDINATION INDEX USING SALT Discuss the scoring rules SALT SOFTWARE, LLC

The Classic Vector Space Model Description, Advantages and Limitations of the Classic Vector

Web Information Retrieval Lecture 6 Vector Space Model Recap of the last lecture Parametric

Part 10: Vector Space Classification Francesco Ricci 1 Content p Recap on nave Bayes p

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Information Retrieval Tutorial 4: Vector Space Model Professor: Michel Schellekens TA: Ang Gao

C++ Roast PRESENTED BY TIM STRAUBINGER Todays Agenda A Brief History of C++ Gentle

Gene set testing in limma COMBINE RNA-seq Workshop Why? Sometimes after differential

Discovering and Ranking Web Services with BASIL: A Personalized Approach with Biased Focus James

Example: Travelling Salesperson Problem Start with any complete tour, perform pairwise exchanges

Computer Scientists and the Law: Technical leadership on public policy and ethics challenges of

MA 123, Chapter 1 Equations, functions, and graphs (pp. 1-15) Chapters Goal: Solve an

WELCOME SHAUMBRA S H O U D 5 , J A N U A R Y 2 0 1 9 SHOUD RECAP Recent Events 2 0 1 9

KM3NeT core-collapse supernova &amp; high energy neutrino alerts Massimiliano Lincetto

KM3NeT core-collapse supernova & high energy neutrino alerts Massimiliano Lincetto