introduction to information retrieval
play

Introduction to Information Retrieval - PowerPoint PPT Presentation

Probabilistic Approach to IR Binary independence model Okapi BM25 Introduction to Information Retrieval http://informationretrieval.org IIR 11: Probabilistic Information Retrieval Hinrich Sch utze Institute for Natural Language Processing,


  1. Probabilistic Approach to IR Binary independence model Okapi BM25 Probabilistic IR and ranking Ranked retrieval setup: the user issues a query, and a ranked list of documents is returned. How can we rank probabilistically? Let R d , q be a random dichotomous variable, such that R d , q = 1 if document d is relevant w.r.t query q R d , q = 0 otherwise Sch¨ utze: Probabilistic Information Retrieval 9 / 36

  2. Probabilistic Approach to IR Binary independence model Okapi BM25 Probabilistic IR and ranking Ranked retrieval setup: the user issues a query, and a ranked list of documents is returned. How can we rank probabilistically? Let R d , q be a random dichotomous variable, such that R d , q = 1 if document d is relevant w.r.t query q R d , q = 0 otherwise (This is a binary notion of relevance.) Sch¨ utze: Probabilistic Information Retrieval 9 / 36

  3. Probabilistic Approach to IR Binary independence model Okapi BM25 Probabilistic IR and ranking Ranked retrieval setup: the user issues a query, and a ranked list of documents is returned. How can we rank probabilistically? Let R d , q be a random dichotomous variable, such that R d , q = 1 if document d is relevant w.r.t query q R d , q = 0 otherwise (This is a binary notion of relevance.) Probabilistic ranking orders documents decreasingly by their estimated probability of relevance w.r.t. query: P ( R = 1 | d , q ) Sch¨ utze: Probabilistic Information Retrieval 9 / 36

  4. Probabilistic Approach to IR Binary independence model Okapi BM25 Probabilistic IR and ranking Ranked retrieval setup: the user issues a query, and a ranked list of documents is returned. How can we rank probabilistically? Let R d , q be a random dichotomous variable, such that R d , q = 1 if document d is relevant w.r.t query q R d , q = 0 otherwise (This is a binary notion of relevance.) Probabilistic ranking orders documents decreasingly by their estimated probability of relevance w.r.t. query: P ( R = 1 | d , q ) How can we justify this way of proceeding? Sch¨ utze: Probabilistic Information Retrieval 9 / 36

  5. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability Ranking Principle (PRP) If the retrieved documents are ranked decreasingly on their probability of relevance (w.r.t a query), then the effectiveness of the system will be the best that is obtainable. Sch¨ utze: Probabilistic Information Retrieval 10 / 36

  6. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability Ranking Principle (PRP) If the retrieved documents are ranked decreasingly on their probability of relevance (w.r.t a query), then the effectiveness of the system will be the best that is obtainable. Fundamental assumption: the relevance of each document is independent of the relevance of other documents. Sch¨ utze: Probabilistic Information Retrieval 10 / 36

  7. Probabilistic Approach to IR Binary independence model Okapi BM25 Outline Probabilistic Approach to IR 1 Binary independence model 2 Okapi BM25 3 Sch¨ utze: Probabilistic Information Retrieval 11 / 36

  8. Probabilistic Approach to IR Binary independence model Okapi BM25 Binary Independence Model (BIM) Binary: documents and queries represented as binary term incidence vectors Sch¨ utze: Probabilistic Information Retrieval 12 / 36

  9. Probabilistic Approach to IR Binary independence model Okapi BM25 Binary Independence Model (BIM) Binary: documents and queries represented as binary term incidence vectors Independence: terms are independent of each other (not true, but works in practice – naive assumption of Naive Bayes models) Sch¨ utze: Probabilistic Information Retrieval 12 / 36

  10. Probabilistic Approach to IR Binary independence model Okapi BM25 Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 0 1 0 0 0 0 Calpurnia Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 . . . Each document is represented as a binary vector ∈ { 0 , 1 } | V | . Sch¨ utze: Probabilistic Information Retrieval 13 / 36

  11. Probabilistic Approach to IR Binary independence model Okapi BM25 Bayes’ rule Sch¨ utze: Probabilistic Information Retrieval 14 / 36

  12. Probabilistic Approach to IR Binary independence model Okapi BM25 Bayes’ rule P ( � x | R = 1 ,� q ) P ( R = 1 | � q ) P ( R = 1 | � x ,� q ) = P ( � x | � q ) P ( � x | R = 0 ,� q ) P ( R = 0 | � q ) P ( R = 0 | � x ,� q ) = P ( � x | � q ) Sch¨ utze: Probabilistic Information Retrieval 14 / 36

  13. Probabilistic Approach to IR Binary independence model Okapi BM25 Bayes’ rule P ( � x | R = 1 ,� q ) P ( R = 1 | � q ) P ( R = 1 | � x ,� q ) = P ( � x | � q ) P ( � x | R = 0 ,� q ) P ( R = 0 | � q ) P ( R = 0 | � x ,� q ) = P ( � x | � q ) (Recall that document and query are modeled as term incidence vectors: � x and � q .) Sch¨ utze: Probabilistic Information Retrieval 14 / 36

  14. Probabilistic Approach to IR Binary independence model Okapi BM25 Bayes’ rule P ( � x | R = 1 ,� q ) P ( R = 1 | � q ) P ( R = 1 | � x ,� q ) = P ( � x | � q ) P ( � x | R = 0 ,� q ) P ( R = 0 | � q ) P ( R = 0 | � x ,� q ) = P ( � x | � q ) (Recall that document and query are modeled as term incidence vectors: � x and � q .) P ( � x | R = 1 ,� q ) and P ( � x | R = 0 ,� q ): probability that if a relevant or nonrelevant document is retrieved, then that document’s representation is � x Sch¨ utze: Probabilistic Information Retrieval 14 / 36

  15. Probabilistic Approach to IR Binary independence model Okapi BM25 Bayes’ rule P ( � x | R = 1 ,� q ) P ( R = 1 | � q ) P ( R = 1 | � x ,� q ) = P ( � x | � q ) P ( � x | R = 0 ,� q ) P ( R = 0 | � q ) P ( R = 0 | � x ,� q ) = P ( � x | � q ) (Recall that document and query are modeled as term incidence vectors: � x and � q .) P ( � x | R = 1 ,� q ) and P ( � x | R = 0 ,� q ): probability that if a relevant or nonrelevant document is retrieved, then that document’s representation is � x Use statistics about the document collection to estimate these probabilities Sch¨ utze: Probabilistic Information Retrieval 14 / 36

  16. Probabilistic Approach to IR Binary independence model Okapi BM25 Priors P ( R | d , q ) is modeled using term incidence vectors as P ( R | � x ,� q ) P ( � x | R = 1 ,� q ) P ( R = 1 | � q ) P ( R = 1 | � x ,� q ) = P ( � x | � q ) P ( � x | R = 0 ,� q ) P ( R = 0 | � q ) P ( R = 0 | � x ,� q ) = P ( � x | � q ) Sch¨ utze: Probabilistic Information Retrieval 15 / 36

  17. Probabilistic Approach to IR Binary independence model Okapi BM25 Priors P ( R | d , q ) is modeled using term incidence vectors as P ( R | � x ,� q ) P ( � x | R = 1 ,� q ) P ( R = 1 | � q ) P ( R = 1 | � x ,� q ) = P ( � x | � q ) P ( � x | R = 0 ,� q ) P ( R = 0 | � q ) P ( R = 0 | � x ,� q ) = P ( � x | � q ) P ( R = 1 | � q ) and P ( R = 0 | � q ): prior probability of retrieving a relevant or nonrelevant document for a query � q Sch¨ utze: Probabilistic Information Retrieval 15 / 36

  18. Probabilistic Approach to IR Binary independence model Okapi BM25 Priors P ( R | d , q ) is modeled using term incidence vectors as P ( R | � x ,� q ) P ( � x | R = 1 ,� q ) P ( R = 1 | � q ) P ( R = 1 | � x ,� q ) = P ( � x | � q ) P ( � x | R = 0 ,� q ) P ( R = 0 | � q ) P ( R = 0 | � x ,� q ) = P ( � x | � q ) P ( R = 1 | � q ) and P ( R = 0 | � q ): prior probability of retrieving a relevant or nonrelevant document for a query � q Estimate P ( R = 1 | � q ) and P ( R = 0 | � q ) from percentage of relevant documents in the collection Sch¨ utze: Probabilistic Information Retrieval 15 / 36

  19. Probabilistic Approach to IR Binary independence model Okapi BM25 Ranking according to odds We said that we’re going to rank documents according to P ( R = 1 | � x ,� q ) Sch¨ utze: Probabilistic Information Retrieval 16 / 36

  20. Probabilistic Approach to IR Binary independence model Okapi BM25 Ranking according to odds We said that we’re going to rank documents according to P ( R = 1 | � x ,� q ) Easier: rank documents by their odds of relevance (gives same ranking) P ( R =1 | � q ) P ( � x | R =1 ,� q ) q ) = P ( R = 1 | � x ,� q ) P ( � x | � q ) O ( R | � x ,� q ) = P ( R = 0 | � x ,� P ( R =0 | � q ) P ( � x | R =0 ,� q ) P ( � x | � q ) = P ( R = 1 | � q ) q ) · P ( � x | R = 1 ,� q ) P ( R = 0 | � P ( � x | R = 0 ,� q ) Sch¨ utze: Probabilistic Information Retrieval 16 / 36

  21. Probabilistic Approach to IR Binary independence model Okapi BM25 Ranking according to odds We said that we’re going to rank documents according to P ( R = 1 | � x ,� q ) Easier: rank documents by their odds of relevance (gives same ranking) P ( R =1 | � q ) P ( � x | R =1 ,� q ) q ) = P ( R = 1 | � x ,� q ) P ( � x | � q ) O ( R | � x ,� q ) = P ( R = 0 | � x ,� P ( R =0 | � q ) P ( � x | R =0 ,� q ) P ( � x | � q ) = P ( R = 1 | � q ) q ) · P ( � x | R = 1 ,� q ) P ( R = 0 | � P ( � x | R = 0 ,� q ) P ( R =1 | � q ) q ) is a constant for a given query - can be ignored P ( R =0 | � Sch¨ utze: Probabilistic Information Retrieval 16 / 36

  22. Probabilistic Approach to IR Binary independence model Okapi BM25 Naive Bayes conditional independence assumption Sch¨ utze: Probabilistic Information Retrieval 17 / 36

  23. Probabilistic Approach to IR Binary independence model Okapi BM25 Naive Bayes conditional independence assumption Now we make the Naive Bayes conditional independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word (given the query): � M P ( � x | R = 1 ,� q ) t =1 P ( x t | R = 1 ,� q ) q ) = P ( � x | R = 0 ,� � M t =1 P ( x t | R = 0 ,� q ) So: M P ( x t | R = 1 ,� q ) � O ( R | � x ,� q ) ∝ P ( x t | R = 0 ,� q ) t =1 Sch¨ utze: Probabilistic Information Retrieval 17 / 36

  24. Probabilistic Approach to IR Binary independence model Okapi BM25 Separating terms in the document vs. not Since each x t is either 0 or 1, we can separate the terms: Sch¨ utze: Probabilistic Information Retrieval 18 / 36

  25. Probabilistic Approach to IR Binary independence model Okapi BM25 Separating terms in the document vs. not Since each x t is either 0 or 1, we can separate the terms: P ( x t = 1 | R = 1 ,� q ) P ( x t = 0 | R = 1 ,� q ) � � O ( R | � x ,� q ) ∝ P ( x t = 1 | R = 0 ,� q ) P ( x t = 0 | R = 0 ,� q ) t : x t =1 t : x t =0 Sch¨ utze: Probabilistic Information Retrieval 18 / 36

  26. Probabilistic Approach to IR Binary independence model Okapi BM25 Definition of p t and u t Let p t = P ( x t = 1 | R = 1 ,� q ) be the probability of a term appearing in relevant document. Sch¨ utze: Probabilistic Information Retrieval 19 / 36

  27. Probabilistic Approach to IR Binary independence model Okapi BM25 Definition of p t and u t Let p t = P ( x t = 1 | R = 1 ,� q ) be the probability of a term appearing in relevant document. Let u t = P ( x t = 1 | R = 0 ,� q ) be the probability of a term appearing in a nonrelevant document. Sch¨ utze: Probabilistic Information Retrieval 19 / 36

  28. Probabilistic Approach to IR Binary independence model Okapi BM25 Definition of p t and u t Let p t = P ( x t = 1 | R = 1 ,� q ) be the probability of a term appearing in relevant document. Let u t = P ( x t = 1 | R = 0 ,� q ) be the probability of a term appearing in a nonrelevant document. Can be displayed as contingency table: R = 1 R = 0 term present x t = 1 p t u t term absent x t = 0 1 − p t 1 − u t Sch¨ utze: Probabilistic Information Retrieval 19 / 36

  29. Probabilistic Approach to IR Binary independence model Okapi BM25 Definition of p t and u t Let p t = P ( x t = 1 | R = 1 ,� q ) be the probability of a term appearing in relevant document. Let u t = P ( x t = 1 | R = 0 ,� q ) be the probability of a term appearing in a nonrelevant document. Can be displayed as contingency table: R = 1 R = 0 term present x t = 1 p t u t term absent x t = 0 1 − p t 1 − u t 1 − p t p t � � O ( R | � x ,� q ) ∝ 1 − u t u t t : x t =1 t : x t =0 Sch¨ utze: Probabilistic Information Retrieval 19 / 36

  30. Probabilistic Approach to IR Binary independence model Okapi BM25 Dropping terms that don’t occur in the query Sch¨ utze: Probabilistic Information Retrieval 20 / 36

  31. Probabilistic Approach to IR Binary independence model Okapi BM25 Dropping terms that don’t occur in the query Additional simplifying assumption: If q t = 0, then p t = u t Sch¨ utze: Probabilistic Information Retrieval 20 / 36

  32. Probabilistic Approach to IR Binary independence model Okapi BM25 Dropping terms that don’t occur in the query Additional simplifying assumption: If q t = 0, then p t = u t A term not occurring in the query is equally likely to occur in relevant and nonrelevant documents. Sch¨ utze: Probabilistic Information Retrieval 20 / 36

  33. Probabilistic Approach to IR Binary independence model Okapi BM25 Dropping terms that don’t occur in the query Additional simplifying assumption: If q t = 0, then p t = u t A term not occurring in the query is equally likely to occur in relevant and nonrelevant documents. Now we need only to consider terms in the products that appear in the query: Sch¨ utze: Probabilistic Information Retrieval 20 / 36

  34. Probabilistic Approach to IR Binary independence model Okapi BM25 Dropping terms that don’t occur in the query Additional simplifying assumption: If q t = 0, then p t = u t A term not occurring in the query is equally likely to occur in relevant and nonrelevant documents. Now we need only to consider terms in the products that appear in the query: Sch¨ utze: Probabilistic Information Retrieval 20 / 36

  35. Probabilistic Approach to IR Binary independence model Okapi BM25 Dropping terms that don’t occur in the query Additional simplifying assumption: If q t = 0, then p t = u t A term not occurring in the query is equally likely to occur in relevant and nonrelevant documents. Now we need only to consider terms in the products that appear in the query: 1 − p t 1 − p t p t p t � � � � O ( R | � x ,� q ) ∝ ≈ 1 − u t 1 − u t u t u t t : x t = 1 t : x t = 0 t : x t = q t = 1 t : x t = 0 , q t = 1 Sch¨ utze: Probabilistic Information Retrieval 20 / 36

  36. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value Sch¨ utze: Probabilistic Information Retrieval 21 / 36

  37. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value Including the query terms found in the document into the right product, but simultaneously dividing by them in the left product, gives: p t (1 − u t ) 1 − p t � � O ( R | � x ,� q ) ∝ u t (1 − p t ) · 1 − u t t : x t = q t =1 t : q t =1 Sch¨ utze: Probabilistic Information Retrieval 21 / 36

  38. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value Including the query terms found in the document into the right product, but simultaneously dividing by them in the left product, gives: p t (1 − u t ) 1 − p t � � O ( R | � x ,� q ) ∝ u t (1 − p t ) · 1 − u t t : x t = q t =1 t : q t =1 The right product is now over all query terms, hence constant for a particular query and can be ignored. Sch¨ utze: Probabilistic Information Retrieval 21 / 36

  39. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value Including the query terms found in the document into the right product, but simultaneously dividing by them in the left product, gives: p t (1 − u t ) 1 − p t � � O ( R | � x ,� q ) ∝ u t (1 − p t ) · 1 − u t t : x t = q t =1 t : q t =1 The right product is now over all query terms, hence constant for a particular query and can be ignored. → The only quantity that needs to be estimated to rank documents w.r.t a query is the left product. Sch¨ utze: Probabilistic Information Retrieval 21 / 36

  40. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value Including the query terms found in the document into the right product, but simultaneously dividing by them in the left product, gives: p t (1 − u t ) 1 − p t � � O ( R | � x ,� q ) ∝ u t (1 − p t ) · 1 − u t t : x t = q t =1 t : q t =1 The right product is now over all query terms, hence constant for a particular query and can be ignored. → The only quantity that needs to be estimated to rank documents w.r.t a query is the left product. Hence the Retrieval Status Value (RSV) in this model: p t (1 − u t ) log p t (1 − u t ) � � RSV d = log u t (1 − p t ) = u t (1 − p t ) t : x t = q t =1 t : x t = q t =1 Sch¨ utze: Probabilistic Information Retrieval 21 / 36

  41. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value (2) Sch¨ utze: Probabilistic Information Retrieval 22 / 36

  42. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value (2) Equivalent: rank documents using the log odds ratios for the terms in the query c t : c t = log p t (1 − u t ) p t u t u t (1 − p t ) = log (1 − p t ) − log 1 − u t The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant ( p t / (1 − p t )), and (ii) the odds of the term appearing if the document is nonrelevant ( u t / (1 − u t )) Sch¨ utze: Probabilistic Information Retrieval 22 / 36

  43. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value (2) Equivalent: rank documents using the log odds ratios for the terms in the query c t : c t = log p t (1 − u t ) p t u t u t (1 − p t ) = log (1 − p t ) − log 1 − u t The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant ( p t / (1 − p t )), and (ii) the odds of the term appearing if the document is nonrelevant ( u t / (1 − u t )) c t = 0: term has equal odds of appearing in relevant and nonrelevant docs Sch¨ utze: Probabilistic Information Retrieval 22 / 36

  44. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value (2) Equivalent: rank documents using the log odds ratios for the terms in the query c t : c t = log p t (1 − u t ) p t u t u t (1 − p t ) = log (1 − p t ) − log 1 − u t The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant ( p t / (1 − p t )), and (ii) the odds of the term appearing if the document is nonrelevant ( u t / (1 − u t )) c t = 0: term has equal odds of appearing in relevant and nonrelevant docs c t positive: higher odds to appear in relevant documents Sch¨ utze: Probabilistic Information Retrieval 22 / 36

  45. Probabilistic Approach to IR Binary independence model Okapi BM25 BIM retrieval status value (2) Equivalent: rank documents using the log odds ratios for the terms in the query c t : c t = log p t (1 − u t ) p t u t u t (1 − p t ) = log (1 − p t ) − log 1 − u t The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant ( p t / (1 − p t )), and (ii) the odds of the term appearing if the document is nonrelevant ( u t / (1 − u t )) c t = 0: term has equal odds of appearing in relevant and nonrelevant docs c t positive: higher odds to appear in relevant documents c t negative: higher odds to appear in nonrelevant documents Sch¨ utze: Probabilistic Information Retrieval 22 / 36

  46. Probabilistic Approach to IR Binary independence model Okapi BM25 Term weight c t in BIM p t u t c t = log (1 − p t ) − log 1 − u t functions as a term weight. Sch¨ utze: Probabilistic Information Retrieval 23 / 36

  47. Probabilistic Approach to IR Binary independence model Okapi BM25 Term weight c t in BIM p t u t c t = log (1 − p t ) − log 1 − u t functions as a term weight. Retrieval status value for document d : RSV d = � x t = q t =1 c t . Sch¨ utze: Probabilistic Information Retrieval 23 / 36

  48. Probabilistic Approach to IR Binary independence model Okapi BM25 Term weight c t in BIM p t u t c t = log (1 − p t ) − log 1 − u t functions as a term weight. Retrieval status value for document d : RSV d = � x t = q t =1 c t . So BIM and vector space model are similar on an operational level. Sch¨ utze: Probabilistic Information Retrieval 23 / 36

  49. Probabilistic Approach to IR Binary independence model Okapi BM25 Term weight c t in BIM p t u t c t = log (1 − p t ) − log 1 − u t functions as a term weight. Retrieval status value for document d : RSV d = � x t = q t =1 c t . So BIM and vector space model are similar on an operational level. In particular: we can use the same data structures (inverted index etc) for the two models. Sch¨ utze: Probabilistic Information Retrieval 23 / 36

  50. Probabilistic Approach to IR Binary independence model Okapi BM25 Computing term weights c t For each term t in a query, estimate c t in the whole collection using a contingency table of counts of documents in the collection, where df t is the number of documents that contain term t : documents relevant nonrelevant Total Term present x t = 1 df t − s s df t Term absent x t = 0 S − s ( N − df t ) − ( S − s ) N − df t Total S N − S N p t = s / S u t = ( df t − s ) / ( N − S ) s / ( S − s ) c t = K ( N , df t , S , s ) = log ( df t − s ) / (( N − df t ) − ( S − s )) Sch¨ utze: Probabilistic Information Retrieval 24 / 36

  51. Probabilistic Approach to IR Binary independence model Okapi BM25 Avoiding zeros Sch¨ utze: Probabilistic Information Retrieval 25 / 36

  52. Probabilistic Approach to IR Binary independence model Okapi BM25 Avoiding zeros If any of the counts is a zero, then the term weight is not well-defined. Sch¨ utze: Probabilistic Information Retrieval 25 / 36

  53. Probabilistic Approach to IR Binary independence model Okapi BM25 Avoiding zeros If any of the counts is a zero, then the term weight is not well-defined. Maximum likelihood estimates do not work for rare events. Sch¨ utze: Probabilistic Information Retrieval 25 / 36

  54. Probabilistic Approach to IR Binary independence model Okapi BM25 Avoiding zeros If any of the counts is a zero, then the term weight is not well-defined. Maximum likelihood estimates do not work for rare events. To avoid zeros: add 0.5 to each count (expected likelihood estimation = ELE) or use a different type of smoothing Sch¨ utze: Probabilistic Information Retrieval 25 / 36

  55. Probabilistic Approach to IR Binary independence model Okapi BM25 More simplifying assumptions Sch¨ utze: Probabilistic Information Retrieval 26 / 36

  56. Probabilistic Approach to IR Binary independence model Okapi BM25 More simplifying assumptions Assume that relevant documents are a very small percentage of the collection . . . Sch¨ utze: Probabilistic Information Retrieval 26 / 36

  57. Probabilistic Approach to IR Binary independence model Okapi BM25 More simplifying assumptions Assume that relevant documents are a very small percentage of the collection . . . . . . then we can approximate statistics for nonrelevant documents by statistics from the whole collection: log[(1 − u t ) / u t ] = log[( N − df t ) / df t ] ≈ log N / df t Sch¨ utze: Probabilistic Information Retrieval 26 / 36

  58. Probabilistic Approach to IR Binary independence model Okapi BM25 More simplifying assumptions Assume that relevant documents are a very small percentage of the collection . . . . . . then we can approximate statistics for nonrelevant documents by statistics from the whole collection: log[(1 − u t ) / u t ] = log[( N − df t ) / df t ] ≈ log N / df t This should look familiar to you . . . Sch¨ utze: Probabilistic Information Retrieval 26 / 36

  59. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in relevance feedback Sch¨ utze: Probabilistic Information Retrieval 27 / 36

  60. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in relevance feedback For relevance feedback, we can directly compute term weights c t based on the contingency table (using an appropriate smoothing method like ELE). Sch¨ utze: Probabilistic Information Retrieval 27 / 36

  61. Probabilistic Approach to IR Binary independence model Okapi BM25 Computing term weights c t for relevance feedback For each term t in a query, estimate c t in the whole collection using a contingency table of counts of documents in the collection, where df t is the number of documents that contain term t : documents relevant nonrelevant Total Term present x t = 1 df t − s s df t Term absent x t = 0 S − s ( N − df t ) − ( S − s ) N − df t Total S N − S N p t = s / S u t = ( df t − s ) / ( N − S ) s / ( S − s ) c t = K ( N , df t , S , s ) = log ( df t − s ) / (( N − df t ) − ( S − s )) Sch¨ utze: Probabilistic Information Retrieval 28 / 36

  62. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in adhoc retrieval Sch¨ utze: Probabilistic Information Retrieval 29 / 36

  63. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available Sch¨ utze: Probabilistic Information Retrieval 29 / 36

  64. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume constant p t = 0 . 5 for all terms x t in the query Sch¨ utze: Probabilistic Information Retrieval 29 / 36

  65. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume constant p t = 0 . 5 for all terms x t in the query Each query term is equally likely to occur in a relevant document, and so the p t and (1 − p t ) factors cancel out in the expression for RSV. Sch¨ utze: Probabilistic Information Retrieval 29 / 36

  66. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume constant p t = 0 . 5 for all terms x t in the query Each query term is equally likely to occur in a relevant document, and so the p t and (1 − p t ) factors cancel out in the expression for RSV. Weak estimate, but doesn’t disagree violently with expectation that query terms appear in many but not all relevant documents. Sch¨ utze: Probabilistic Information Retrieval 29 / 36

  67. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume constant p t = 0 . 5 for all terms x t in the query Each query term is equally likely to occur in a relevant document, and so the p t and (1 − p t ) factors cancel out in the expression for RSV. Weak estimate, but doesn’t disagree violently with expectation that query terms appear in many but not all relevant documents. p t u t Weight c t in this case: c t = log (1 − p t ) − log 1 − u t ≈ log N / df t Sch¨ utze: Probabilistic Information Retrieval 29 / 36

  68. Probabilistic Approach to IR Binary independence model Okapi BM25 Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume constant p t = 0 . 5 for all terms x t in the query Each query term is equally likely to occur in a relevant document, and so the p t and (1 − p t ) factors cancel out in the expression for RSV. Weak estimate, but doesn’t disagree violently with expectation that query terms appear in many but not all relevant documents. p t u t Weight c t in this case: c t = log (1 − p t ) − log 1 − u t ≈ log N / df t For short documents (titles or abstracts), this simple version of BIM works well. Sch¨ utze: Probabilistic Information Retrieval 29 / 36

  69. Probabilistic Approach to IR Binary independence model Okapi BM25 Outline Probabilistic Approach to IR 1 Binary independence model 2 Okapi BM25 3 Sch¨ utze: Probabilistic Information Retrieval 30 / 36

  70. Probabilistic Approach to IR Binary independence model Okapi BM25 Okapi BM25: Overview Okapi BM25 is a probabilistic model that incorporates term frequency (i.e., it’s nonbinary) and length normalization. Sch¨ utze: Probabilistic Information Retrieval 31 / 36

  71. Probabilistic Approach to IR Binary independence model Okapi BM25 Okapi BM25: Overview Okapi BM25 is a probabilistic model that incorporates term frequency (i.e., it’s nonbinary) and length normalization. BIM was originally designed for short catalog records of fairly consistent length, and it works reasonably in these contexts. Sch¨ utze: Probabilistic Information Retrieval 31 / 36

Recommend


More recommend