introduction to information retrieval
play

Introduction to Information Retrieval - PowerPoint PPT Presentation

Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Sch utze Institute for Natural Language Processing,


  1. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Term-document incidence matrix Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 1 1 0 1 0 0 Brutus Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 . . . Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar . Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The tempest . We will return to this matrix many times in this class. Sch¨ utze: Boolean retrieval 9 / 30

  2. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? We can’t build the incidence matrix for large collections Sch¨ utze: Boolean retrieval 10 / 30

  3. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? We can’t build the incidence matrix for large collections Size of incidence matrix: number of documents times number terms → too large for large collections Sch¨ utze: Boolean retrieval 10 / 30

  4. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? We can’t build the incidence matrix for large collections Size of incidence matrix: number of documents times number terms → too large for large collections But the matrix is very sparse – mostly 0s, few 1s. Sch¨ utze: Boolean retrieval 10 / 30

  5. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? We can’t build the incidence matrix for large collections Size of incidence matrix: number of documents times number terms → too large for large collections But the matrix is very sparse – mostly 0s, few 1s. Inverted index: We only record the 1s. Sch¨ utze: Boolean retrieval 10 / 30

  6. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Inverted Index Sch¨ utze: Boolean retrieval 11 / 30

  7. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Inverted Index For each term t , we store a list of all documents that contain t . = For each term t , we store the 1s in its row in the incidence matrix 1 2 4 11 31 45 173 174 Brutus − → 1 2 4 5 6 16 57 132 . . . Caesar − → Calpurnia 2 31 54 101 − → . . . � �� � � �� � dictionary postings Sch¨ utze: Boolean retrieval 11 / 30

  8. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Outline Boolean model and Inverted index 1 Processing Boolean queries 2 Why ranked retrieval? 3 Sch¨ utze: Boolean retrieval 12 / 30

  9. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Sch¨ utze: Boolean retrieval 13 / 30

  10. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia Sch¨ utze: Boolean retrieval 13 / 30

  11. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index: Sch¨ utze: Boolean retrieval 13 / 30

  12. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index: Locate Brutus in the dictionary 1 Sch¨ utze: Boolean retrieval 13 / 30

  13. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index: Locate Brutus in the dictionary 1 Retrieve its postings list from the postings file 2 Sch¨ utze: Boolean retrieval 13 / 30

  14. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index: Locate Brutus in the dictionary 1 Retrieve its postings list from the postings file 2 Locate Calpurnia in the dictionary 3 Sch¨ utze: Boolean retrieval 13 / 30

  15. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index: Locate Brutus in the dictionary 1 Retrieve its postings list from the postings file 2 Locate Calpurnia in the dictionary 3 Retrieve its postings list from the postings file 4 Sch¨ utze: Boolean retrieval 13 / 30

  16. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index: Locate Brutus in the dictionary 1 Retrieve its postings list from the postings file 2 Locate Calpurnia in the dictionary 3 Retrieve its postings list from the postings file 4 Intersect the two postings lists 5 Sch¨ utze: Boolean retrieval 13 / 30

  17. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Simple conjunctive query (two terms) Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index: Locate Brutus in the dictionary 1 Retrieve its postings list from the postings file 2 Locate Calpurnia in the dictionary 3 Retrieve its postings list from the postings file 4 Intersect the two postings lists 5 Return intersection to user 6 Sch¨ utze: Boolean retrieval 13 / 30

  18. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  19. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  20. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  21. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  22. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  23. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  24. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  25. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 → 31 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  26. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 → 31 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  27. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 → 31 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  28. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 → 31 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  29. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 → 31 ⇒ Sch¨ utze: Boolean retrieval 14 / 30

  30. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Intersecting two postings lists 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Brutus − → Calpurnia 2 → 31 → 54 → 101 − → Intersection = 2 → 31 ⇒ This is linear in the length of the postings lists. Sch¨ utze: Boolean retrieval 14 / 30

  31. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . Sch¨ utze: Boolean retrieval 15 / 30

  32. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Sch¨ utze: Boolean retrieval 15 / 30

  33. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and , or and not to join query terms. Sch¨ utze: Boolean retrieval 15 / 30

  34. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and , or and not to join query terms. Views each document as a set of terms. Sch¨ utze: Boolean retrieval 15 / 30

  35. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and , or and not to join query terms. Views each document as a set of terms. Is precise: Document matches condition or not. Sch¨ utze: Boolean retrieval 15 / 30

  36. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and , or and not to join query terms. Views each document as a set of terms. Is precise: Document matches condition or not. Primary commercial retrieval tool for 3 decades Sch¨ utze: Boolean retrieval 15 / 30

  37. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and , or and not to join query terms. Views each document as a set of terms. Is precise: Document matches condition or not. Primary commercial retrieval tool for 3 decades Many professional searchers (e.g., lawyers) still like Boolean queries. Sch¨ utze: Boolean retrieval 15 / 30

  38. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and , or and not to join query terms. Views each document as a set of terms. Is precise: Document matches condition or not. Primary commercial retrieval tool for 3 decades Many professional searchers (e.g., lawyers) still like Boolean queries. You know exactly what you are getting. Sch¨ utze: Boolean retrieval 15 / 30

  39. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and , or and not to join query terms. Views each document as a set of terms. Is precise: Document matches condition or not. Primary commercial retrieval tool for 3 decades Many professional searchers (e.g., lawyers) still like Boolean queries. You know exactly what you are getting. Many search systems you use are also Boolean: search system on your laptop, in your email reader, on the intranet etc Sch¨ utze: Boolean retrieval 15 / 30

  40. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Boolean queries The example was a simple conjunctive query . . . . . . the Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use and , or and not to join query terms. Views each document as a set of terms. Is precise: Document matches condition or not. Primary commercial retrieval tool for 3 decades Many professional searchers (e.g., lawyers) still like Boolean queries. You know exactly what you are getting. Many search systems you use are also Boolean: search system on your laptop, in your email reader, on the intranet etc So are we done? Sch¨ utze: Boolean retrieval 15 / 30

  41. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Outline Boolean model and Inverted index 1 Processing Boolean queries 2 Why ranked retrieval? 3 Sch¨ utze: Boolean retrieval 16 / 30

  42. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Sch¨ utze: Boolean retrieval 17 / 30

  43. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Key property: Documents either match or don’t. Sch¨ utze: Boolean retrieval 17 / 30

  44. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Key property: Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Sch¨ utze: Boolean retrieval 17 / 30

  45. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Key property: Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Sch¨ utze: Boolean retrieval 17 / 30

  46. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Key property: Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Sch¨ utze: Boolean retrieval 17 / 30

  47. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Key property: Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . . Sch¨ utze: Boolean retrieval 17 / 30

  48. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Key property: Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . . . . . or they are, but they think it’s too much work. Sch¨ utze: Boolean retrieval 17 / 30

  49. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Key property: Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . . . . . or they are, but they think it’s too much work. Most users don’t want to wade through 1000s of results. Sch¨ utze: Boolean retrieval 17 / 30

  50. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? The Boolean model: Pros and Cons Key property: Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . . . . . or they are, but they think it’s too much work. Most users don’t want to wade through 1000s of results. This is particularly true of web search. Sch¨ utze: Boolean retrieval 17 / 30

  51. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Problem with Boolean search: Feast or famine Sch¨ utze: Boolean retrieval 18 / 30

  52. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Sch¨ utze: Boolean retrieval 18 / 30

  53. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1 (boolean conjunction): [standard user dlink 650] Sch¨ utze: Boolean retrieval 18 / 30

  54. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1 (boolean conjunction): [standard user dlink 650] → 200,000 hits – feast Sch¨ utze: Boolean retrieval 18 / 30

  55. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1 (boolean conjunction): [standard user dlink 650] → 200,000 hits – feast Query 2 (boolean conjunction): [standard user dlink 650 no card found] Sch¨ utze: Boolean retrieval 18 / 30

  56. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1 (boolean conjunction): [standard user dlink 650] → 200,000 hits – feast Query 2 (boolean conjunction): [standard user dlink 650 no card found] → 0 hits – famine Sch¨ utze: Boolean retrieval 18 / 30

  57. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1 (boolean conjunction): [standard user dlink 650] → 200,000 hits – feast Query 2 (boolean conjunction): [standard user dlink 650 no card found] → 0 hits – famine In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. Sch¨ utze: Boolean retrieval 18 / 30

  58. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Feast or famine: No problem in ranked retrieval Sch¨ utze: Boolean retrieval 19 / 30

  59. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Feast or famine: No problem in ranked retrieval With ranking, large result sets are not an issue. Sch¨ utze: Boolean retrieval 19 / 30

  60. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Feast or famine: No problem in ranked retrieval With ranking, large result sets are not an issue. Just show the top 10 results and the user won’t be overwhelmed Sch¨ utze: Boolean retrieval 19 / 30

  61. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Feast or famine: No problem in ranked retrieval With ranking, large result sets are not an issue. Just show the top 10 results and the user won’t be overwhelmed Premise: the ranking algorithm works: More relevant results are ranked higher than less relevant results. Sch¨ utze: Boolean retrieval 19 / 30

  62. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Sch¨ utze: Boolean retrieval 20 / 30

  63. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Sch¨ utze: Boolean retrieval 20 / 30

  64. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Sch¨ utze: Boolean retrieval 20 / 30

  65. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Ask them to “think aloud” Sch¨ utze: Boolean retrieval 20 / 30

  66. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Ask them to “think aloud” Interview them Sch¨ utze: Boolean retrieval 20 / 30

  67. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Ask them to “think aloud” Interview them Eye-track them Sch¨ utze: Boolean retrieval 20 / 30

  68. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Ask them to “think aloud” Interview them Eye-track them Time them Sch¨ utze: Boolean retrieval 20 / 30

  69. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Ask them to “think aloud” Interview them Eye-track them Time them Record and count their clicks Sch¨ utze: Boolean retrieval 20 / 30

  70. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Ask them to “think aloud” Interview them Eye-track them Time them Record and count their clicks The following slides are from Dan Russell’s 2007 JCDL talk Sch¨ utze: Boolean retrieval 20 / 30

  71. Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Ask them to “think aloud” Interview them Eye-track them Time them Record and count their clicks The following slides are from Dan Russell’s 2007 JCDL talk Dan Russell was at the “¨ Uber Tech Lead for Search Quality & User Happiness” at Google. Sch¨ utze: Boolean retrieval 20 / 30

Recommend


More recommend