information retrieval tutorial 4 vector space model
play

Information Retrieval Tutorial 4: Vector Space Model Professor: - PowerPoint PPT Presentation

Review Information Retrieval Tutorial 4: Vector Space Model Professor: Michel Schellekens TA: Ang Gao University College Cork 2012-11-15 Vector Space Model 1 / 21 Review Outline Review 1 Vector Space Model 2 / 21 Review Simple


  1. Review Information Retrieval Tutorial 4: Vector Space Model Professor: Michel Schellekens TA: Ang Gao University College Cork 2012-11-15 Vector Space Model 1 / 21

  2. Review Outline Review 1 Vector Space Model 2 / 21

  3. Review Simple Boolean vs. Ranking of result set Simple Boolean vs. Ranking of result set Simple Boolean retrieval returns matching documents in no particular order. Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. Vector Space Model 3 / 21

  4. Review Ranked retrieval Vector Space Model 4 / 21

  5. Review Ranked retrieval Thus far, our queries have been Boolean. Vector Space Model 4 / 21

  6. Review Ranked retrieval Thus far, our queries have been Boolean. Documents either match or don’t. Vector Space Model 4 / 21

  7. Review Ranked retrieval Thus far, our queries have been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Vector Space Model 4 / 21

  8. Review Ranked retrieval Thus far, our queries have been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Vector Space Model 4 / 21

  9. Review Ranked retrieval Thus far, our queries have been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Vector Space Model 4 / 21

  10. Review Ranked retrieval Thus far, our queries have been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . . Vector Space Model 4 / 21

  11. Review Ranked retrieval Thus far, our queries have been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . . . . . or they are, but they think it’s too much work. Vector Space Model 4 / 21

  12. Review Ranked retrieval Thus far, our queries have been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . . . . . or they are, but they think it’s too much work. Most users don’t want to wade through 1000s of results. Vector Space Model 4 / 21

  13. Review Ranked retrieval Thus far, our queries have been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . . . . . or they are, but they think it’s too much work. Most users don’t want to wade through 1000s of results. This is particularly true of web search. Vector Space Model 4 / 21

  14. Review Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Vector Space Model 5 / 21

  15. Review Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. Vector Space Model 5 / 21

  16. Review Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. AND gives too few; OR gives too many Vector Space Model 5 / 21

  17. Review Scoring as the basis of ranked retrieval Vector Space Model 6 / 21

  18. Review Scoring as the basis of ranked retrieval We wish to rank documents that are more relevant higher than documents that are less relevant. Vector Space Model 6 / 21

  19. Review Scoring as the basis of ranked retrieval We wish to rank documents that are more relevant higher than documents that are less relevant. How can we accomplish such a ranking of the documents in the collection with respect to a query? Vector Space Model 6 / 21

  20. Review Scoring as the basis of ranked retrieval We wish to rank documents that are more relevant higher than documents that are less relevant. How can we accomplish such a ranking of the documents in the collection with respect to a query? Assign a score to each query-document pair, say in [0 , 1]. Vector Space Model 6 / 21

  21. Review Scoring as the basis of ranked retrieval We wish to rank documents that are more relevant higher than documents that are less relevant. How can we accomplish such a ranking of the documents in the collection with respect to a query? Assign a score to each query-document pair, say in [0 , 1]. This score measures how well document and query “match”. Vector Space Model 6 / 21

  22. Review Take 1: Jaccard coefficient Vector Space Model 7 / 21

  23. Review Take 1: Jaccard coefficient A commonly used measure of overlap of two sets Vector Space Model 7 / 21

  24. Review Take 1: Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Vector Space Model 7 / 21

  25. Review Take 1: Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard ( A , B ) = | A ∩ B | | A ∪ B | ( A � = ∅ or B � = ∅ ) Vector Space Model 7 / 21

  26. Review Take 1: Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard ( A , B ) = | A ∩ B | | A ∪ B | ( A � = ∅ or B � = ∅ ) jaccard ( A , A ) = 1 Vector Space Model 7 / 21

  27. Review Take 1: Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard ( A , B ) = | A ∩ B | | A ∪ B | ( A � = ∅ or B � = ∅ ) jaccard ( A , A ) = 1 jaccard ( A , B ) = 0 if A ∩ B = 0 Vector Space Model 7 / 21

  28. Review Take 1: Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard ( A , B ) = | A ∩ B | | A ∪ B | ( A � = ∅ or B � = ∅ ) jaccard ( A , A ) = 1 jaccard ( A , B ) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Vector Space Model 7 / 21

  29. Review Take 1: Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard ( A , B ) = | A ∩ B | | A ∪ B | ( A � = ∅ or B � = ∅ ) jaccard ( A , A ) = 1 jaccard ( A , B ) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1. Vector Space Model 7 / 21

  30. Review Jaccard coefficient: Example Problem1: What is the query-document match score that the Jaccard coefficient computes for: Vector Space Model 8 / 21

  31. Review Jaccard coefficient: Example Problem1: What is the query-document match score that the Jaccard coefficient computes for: Query: “University College Cork” Vector Space Model 8 / 21

  32. Review Jaccard coefficient: Example Problem1: What is the query-document match score that the Jaccard coefficient computes for: Query: “University College Cork” Document “Cork City Tourism guide” Vector Space Model 8 / 21

  33. Review Jaccard coefficient: Example Problem1: What is the query-document match score that the Jaccard coefficient computes for: Query: “University College Cork” Document “Cork City Tourism guide” jaccard ( q , d ) = 1 / 6 Vector Space Model 8 / 21

  34. Review What’s wrong with Jaccard? Vector Space Model 9 / 21

  35. Review What’s wrong with Jaccard? It doesn’t consider term frequency (how many occurrences a term has). (tf) Vector Space Model 9 / 21

  36. Review What’s wrong with Jaccard? It doesn’t consider term frequency (how many occurrences a term has). (tf) Rare terms are more informative than frequent terms. Jaccard does not consider this information. (idf) Vector Space Model 9 / 21

  37. Review What’s wrong with Jaccard? It doesn’t consider term frequency (how many occurrences a term has). (tf) Rare terms are more informative than frequent terms. Jaccard does not consider this information. (idf) We need a more sophisticated way of normalizing for the length of a document. Vector Space Model 9 / 21

  38. Review tf-idf weighting Vector Space Model 10 / 21

  39. Review tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Vector Space Model 10 / 21

Recommend


More recommend