tf idf and okapi bm25
play

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval - PowerPoint PPT Presentation

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton Binary Independence Models In Bayesian classification, we rank documents by their likelihood ratios P ( D | R = 1 ) calculated from some probabilistic


  1. TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton

  2. Binary Independence Models In Bayesian classification, we rank documents by their likelihood ratios P ( D | R = 1 ) calculated from some probabilistic model. P ( D | R = 0 ) The model predicts the features that a relevant or non-relevant document is likely to have. Likelihood Ratio Our first model is a unigram language model, which independently estimates the probability of each term appearing in a � | F | i = 1 P ( f i | R = 1 ) relevant or non-relevant document. � | F | i = 1 P ( f i | R = 0 ) Any model like this, based on independent f i ∈ F binary features , is called a binary independence model . Binary independence Model

  3. Ranking with B.I. Models Simplifying the binary independence model leads to a ranking score which allows us to ignore terms not found in the document. This is important for efficient queries. Let p i := P ( f i | R = 1 ) , q i := P ( f i | R = 0 ) , d i ∈ { 0 , 1 } := value of f i in doc D . Then P ( D | R = 1 ) p i 1 − p i � � P ( D | R = 0 ) = · = 1 q i 1 − q i i : d i = 1 i : d i = 0 � � � p i 1 − q i 1 − p i 1 − p i � � � = · · · q i 1 − p i 1 − q i 1 − q i i : d i = 1 i : d i = 1 i : d i = 1 i : d i = 0 | F | p i ( 1 − q i ) 1 − p i � � = q i ( 1 − p i ) · 1 − q i i : d i = 1 i = 1 p i ( 1 − q i ) log p i ( 1 − q i ) rank rank � � = = Ranking Score q i ( 1 − p i ) q i ( 1 − p i ) i : d i = 1 i : d i = 1

  4. Relationship to IDF log p i ( 1 − q i ) Ranking Score, Under certain assumptions, the q i ( 1 − p i ) ranking score is just IDF: ≈ log 0 . 5 ( 1 − df i D ) approximated using assumptions, df i D ( 1 − 0 . 5 ) 1. All words have a fixed uniform = log 1 − df i probability of appearing in a D relevant document: p i = 1/2 . df i D = log D − df i · D 2. Most documents containing the df i df i · D q i ≈ df i / D term are non-relevant, so . = log D − df i df i 3. Most documents do not contain the ≈ log D term, so . D − df i ≈ D becomes IDF df i

  5. Improving on IDF It turns out that we can do better than IDF. To get there, we’ll start by considering the contingency table of all combinations of d i and R . We will estimate p i and q i using this table and a Total R = 1 R = 0 technique called “add- ⍺ smoothing,” with ⍺ =0.5. r i df i – r i df i d i = 1 p i = r i + 0 . 5 R + 1 ; q i = df i − r i + 0 . 5 R – r i D – R – df i + r i D – df i d i = 0 D − R + 1 Total R D D – R This leads to a slightly different ranking score: log p i ( 1 − q i ) log ( num ( d i = 1 , R = 1 ) + 0 . 5 ) / ( num ( d i = 0 , R = 1 ) + 0 . 5 ) � � q i ( 1 − p i ) = ( num ( d i = 1 , R = 0 ) + 0 . 5 ) / ( num ( d i = 0 , R = 0 ) + 0 . 5 ) i : d i = 1 i : d i = 1 ( r i + 0 . 5 ) / ( R − r i + 0 . 5 ) � = log ( df i − r i + 0 . 5 ) / ( D − R − df i + r i + 0 . 5 ) i : d i = 1

  6. Is it better? Let’s unpack this formula to understand it better. The numerator is a ratio of counts of relevant documents the term does and does not appear in. It’s a likelihood ratio giving the ( r i + 0 . 5 ) / ( R − r i + 0 . 5 ) amount of “evidence of relevance” the term log provides. ( df i − r i + 0 . 5 ) / ( D − R − df i + r i + 0 . 5 ) The denominator is the same ratio, for non- A better IDF? relevant documents. It gives the amount of “evidence of non-relevance” for the term. If the term is in many documents, but most of them are relevant , it doesn’t discount the term as IDF would.

  7. Okapi BM25 Okapi BM25 is one of the strongest “simple” scoring functions, and has proven � � � ( r i + 0 . 5 ) / ( R − r i + 0 . 5 ) a useful baseline for experiments and � log ( df i − r i + 0 . 5 ) / ( D − R − df i + r i + 0 . 5 ) feature for ranking. i : d i = q i = 1 � tf i , d + k 1 · tf i , d avg ( dl ) ) · tf i , q + k 2 · tf i , q It combines: · dl tf i , q + k 2 tf i , d + k 1 (( 1 − b ) + b · • The IDF-like ranking score from the last Okapi BM25 slide, • the document term frequency tf i,d , k 1 , k 2 , and b are empirically-set parameters. normalized by the ratio of the document’s Typical values at TREC are: length dl to the average length avg ( dl ) , and k 1 = 1 . 2 0 ≤ k 2 ≤ 1000 • the query term frequency tf i,q . b = 0 . 75

  8. Example: BM25 Example query: “president lincoln” tf president tf lincoln BM25 • tf president,q = tf lincoln,q = 1 15 25 20.66 • No relevance information: R = r i = 0 • “president” is in 40,000 documents in the 15 1 12.74 collection: df president = 40,000 15 0 5.00 • “lincoln” is in 300 documents in the collection: df lincoln = 300 1 25 18.2 • The document length is 90% of the 0 25 15.66 average length: dl / avg ( dl ) = 0.9 The low df term plays a bigger role. • We pick k 1 = 1.2 , k 2 = 100 , b = 0.75

  9. Wrapping Up Binary Independence Models are a principled, general way to combine evidence from many binary features (not just unigrams!) The version of BM25 shown here is one of many in a family of scoring functions. Modern alternatives can take additional evidence, such as anchor text, into account. Next, we’ll generalize what we’ve learned so far into the fundamental topics of machine learning.

Recommend


More recommend