vector space scoring
play

Vector Space Scoring Introduction to Information Retrieval INF 141 - PowerPoint PPT Presentation

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Querying Corpus-wide statistics Querying Corpus-wide statistics Collection


  1. Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org

  2. Querying Corpus-wide statistics

  3. Querying Corpus-wide statistics • Collection Frequency, cf • Define: The total number of occurences of the term in the entire corpus

  4. Querying Corpus-wide statistics • Collection Frequency, cf • Define: The total number of occurences of the term in the entire corpus • Document Frequency, df • Define: The total number of documents which contain the term in the corpus

  5. Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760

  6. Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760 • This suggests that df is better at discriminating between documents

  7. Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760 • This suggests that df is better at discriminating between documents • How do we use df?

  8. Querying Corpus-wide statistics

  9. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights

  10. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf”

  11. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency

  12. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document

  13. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency

  14. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term

  15. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus

  16. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term

  17. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term • more commonly it is:

  18. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term � | corpus | � • more commonly it is: id f t = log d f t

  19. Querying TF-IDF Examples � | corpus | � � 1 , 000 , 000 � id f t = log id f t = log 10 d f t d f t term d f t id f t 6 calpurnia 1 4 animal 10 3 sunday 1000 2 fly 10 , 000 1 under 100 , 000 0 the 1 , 000 , 000

  20. Querying TF-IDF Summary • Assign tf-idf weight for each term t in a document d: � | corpus | � f ( t, d ) = (1 + log ( tf t,d )) ∗ log tfid d f t,d • Increases with number of occurrences of term in a doc. • Increases with rarity of term across entire corpus • Three different metrics • term frequency • document frequency • collection/corpus frequency

  21. Querying Now, real-valued term-document matrices • Bag of words model • Each element of matrix is tf-idf value Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  22. Querying Vector Space Scoring • That is a nice matrix, but • How does it relate to scoring? • Next, vector space scoring

  23. Vector Space Scoring Vector Space Model • Define: Vector Space Model • Representing a set of documents as vectors in a common vector space. • It is fundamental to many operations • (query,document) pair scoring • document classification • document clustering • Queries are represented as a document • A short one, but mathematically equivalent

  24. Vector Space Scoring Vector Space Model • Define: Vector Space Model � • A document, d, is defined as a vector: V ( d ) • One component for each term in the dictionary • Assume the term is the tf-idf score � | corpus | � � V ( d ) t = (1 + log ( tf t,d )) ∗ log d f t,d • A corpus is many vectors together. • A document can be thought of as a point in a multi- dimensional space, with axes related to terms.

  25. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  26. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � V ( d 1 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  27. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  28. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0 � V ( d 6 ) 7

  29. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  30. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: Brutus Julius Caesar Antony and Cleopatra Hamlet Tempest Othello MacBeth Antony

  31. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  32. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: worser Antony and Cleopatra Tempest Hamlet Othello MacBeth Julius Caesar mercy

  33. Vector Space Scoring Query as a vector • So a query can also be plotted in the same space • “worser mercy” • To score, we ask: worser • How similar are two points? Antony and Cleopatra • How to answer? query Tempest Hamlet Othello MacBeth Julius Caesar mercy

Recommend


More recommend