why probabili es
play

Why probabili)es? Document representa)on is uncertain - PowerPoint PPT Presentation

Probabilis)c Models in IR Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from Chris Manning, Pandu Nayak and Prabhakar


  1. Probabilis)c ¡Models ¡in ¡IR ¡ Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from Chris ¡Manning, ¡Pandu ¡Nayak ¡and ¡ ¡ Prabhakar ¡Raghavan ¡

  2. Why ¡probabili)es? ¡ Document ¡representa)on ¡is ¡uncertain ¡ Understanding ¡of ¡user ¡need ¡is ¡uncertain ¡ How to match? Assumption: the required User needs some information is present information somewhere Traditional IR matching: by semantically imprecise space of terms Probabilities: principled foundation for uncertain reasoning Goal: use probabilities to quantify uncertainties 2 ¡

  3. Probabilis)c ¡IR ¡topics ¡ § Classical probabilistic retrieval model – Probability ranking principle, etc. – Binary independence model ( ≈ Naïve Bayes text cat) – (Okapi) BM25 § Bayesian networks for text retrieval § Language model approach to IR – An important emphasis in recent work § Timeline: old, as well as currently hot in IR – Traditionally: neat ideas, but didn’t win on performance – It may be different now

  4. The ¡document ¡ranking ¡problem ¡ § Collection D = { d 1 , … , d N } § Query: q § Ranking: return a list of documents, in the order of relevance Probabilistic idea of relevance § Given a document d , a query q, is d relevant for q ? Denote by P ( R = 1 | d, q ) Random variable R = 0 (not relevant) or 1 (relevant) Ranking: rank documents in the order of P ( R = 1| d i , q )

  5. Probability ¡review ¡ Bayes’ Theorem § For events A and B: P ( A , B ) = P ( A ∩ B ) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) Prior ¡ P ( A | B ) = P ( B | A ) P ( A ) P ( B | A ) P ( A ) = P ( B ) P ( B | A ) P ( A ) + P ( B | A ) P ( A ) Posterior ¡ § Odds: O ( A ) = P ( A ) P ( A ) P ( A ) = 1 − P ( A )

  6. The ¡Probability ¡Ranking ¡Principle ¡(PRP) ¡ [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538) § Goal: overall effectiveness to be the best obtainable on the basis of the available data § Approach: rank the documents in the collection in order of decreasing probability of relevance to the user who submitted the request – Assumption: the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system

  7. Probability ¡Ranking ¡Principle ¡(PRP) ¡ Goal: for every document d estimate P [ d is relevant to q ] P ( R = 1 | d, q ), or simply P ( R = 1 | d ) P ( R=1) , P ( R=0 ) - prior probability P ( R = 1| d ) = P ( d | R = 1) P ( R = 1) of retrieving a relevant or non-relevant P ( d ) document P ( d |R= 1), P ( d |R= 0) - probability that if P ( R = 0 | d ) = P ( d | R = 0) P ( R = 0) a relevant (not relevant) document is P ( d ) retrieved, it is d P ( R = 0 | d ) + P ( R = 1| d ) = 1

  8. Probability ¡Ranking ¡Principle ¡(PRP) ¡ § Simple case: no selection costs or other utility concerns that would differentially weight errors § PRP: Rank all documents by p ( R=1 | x ) § The 1/0 loss: – Lose a point if you return a non relevant document – Gain a point if you return a relevant document § Theorem: Using the PRP is optimal, in that it minimizes the loss (Bayes risk) under 1/0 loss – Provable if all probabilities correct, etc. [e.g., Ripley 1996]

  9. Probability ¡Ranking ¡Principle ¡(PRP) ¡ § More complex case: retrieval costs. – Let d be a document – C : cost of not retrieving a relevant document – C’ : cost of retrieving a non-relevant document § Probability Ranking Principle: if C ⋅ P ( R = 0 | d ) − C ⋅ P ( R = 1| d ) ≤ ! C ⋅ P ( R = 0 | ! d ) − C ⋅ P ( R = 1| ! d ) ! for all d’ not yet retrieved Then d is the next document to be retrieved We won’t further consider cost/utility

  10. Probabilis)c ¡Retrieval ¡Strategy ¡ § Estimating the probabilities: Binary Independence Model (BIM) – the simplest model § Questionable assumptions – “Relevance” of each document is independent of relevance of other documents. • It is bad to keep on returning duplicates – Boolean model of relevance § Estimate how terms contribute to relevance – How tf, df, document length etc influence document relevance? § Combine to find document relevance probability § Order documents by decreasing probability

  11. Probabilis)c ¡Ranking ¡ Basic concept: “ For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically.” Van Rijsbergen

  12. Binary ¡Independence ¡Model ¡ § Traditionally used in conjunction with PRP § “Binary” = Boolean: documents are represented as binary incidence vectors of terms � = x ( 1 x , … , x ) – n x 1 – if term i is present in document x . = i § “Independence”: terms occur in documents independently § Different documents can be modeled as the same vector

  13. Binary ¡Independence ¡Model ¡ § Query: binary term incidence vector q § Given query q , – For each document d , need to compute P ( R | q, d ) – Replace with computing P ( R | q, x ) where x is binary term incidence vector representing d – Interested only in ranking § Use odds and Bayes’ Rule: P ( R = 1| q ) P ( ! x | R = 1, q ) x ) = P ( R = 1| q , ! P ( ! O ( R | q , ! x ) x | q ) P ( R = 0 | q ) P ( ! P ( R = 0 | q , ! x ) = x | R = 0, q ) P ( ! x | q ) P ( R = 0 | q ) ⋅ P ( ! Constant for a = P ( R = 1| q ) x | R = 1, q ) P ( ! given query x | R = 0, q ) Needs estimation

  14. Binary ¡Independence ¡Model ¡ Using independence assumption: P ( ! n x | R = 1, q ) P ( x i | R = 1, q ) ∏ P ( ! x | R = 0, q ) = P ( x i | R = 0, q ) i = 1 O ( R | q , ! n P ( x i | R = 1, q ) ∏ x ) = O ( R | q ) ⋅ P ( x i | R = 0, q ) i = 1 Since x i is either 0 or 1: p ( x i = 1| R = 1, q ) p ( x i = 0 | R = 1, q ) ∏ ∏ = O ( R | q ) ⋅ ⋅ p ( x i = 1| R = 0, q ) p ( x i = 0 | R = 0, q ) x i = 1 x i = 0 Let p i = P ( x i = 1| R = 1, q ); r i = P ( x i = 1| R = 0, q ); ! p i (1 − p i ) ∏ ∏ O ( R | q , x ) = O ( R | q ) ⋅ ⋅ r (1 − r i ) x i = 1 i x i = 0 q i = 1 q i = 1

  15. What ¡it ¡means ¡ ! p i (1 − p i ) ∏ ∏ O ( R | q , x ) = O ( R | q ) ⋅ ⋅ r (1 − r i ) i x i = 1 x i = 0 q i = 1 q i = 1 in document relevant (R=1) not relevant (R=0) term present x i = 1 p i r i term absent x i = 0 (1 – p i ) (1 – r i )

  16. Binary ¡Independence ¡Model ¡ O ( R | q ,  p i 1 − p i ∏ ∏ x ) = O ( R | q ) ⋅ ⋅ r 1 − r i i x i = q i = 1 x i = 0 q i = 1 All matching Non-matching query query terms terms: too many!! O ( R | q ,  $ ' p i 1 − r ⋅ 1 − p i 1 − p i ∏ ∏ ∏ i x ) = O ( R | q ) ⋅ ⋅ & ) r 1 − p i 1 − r 1 − r % ( x i = 1 i x i = 1 i x i = 0 i q i = 1 q i = 1 q i = 1 O ( R | q ,  p i (1 − r i ) 1 − p i ∏ ∏ x ) = O ( R | q ) ⋅ ⋅ r i (1 − p i ) 1 − r x i = q i = 1 q i = 1 i All matching terms All query terms

  17. Binary ¡Independence ¡Model ¡ � p ( 1 r ) 1 p − − i i i O ( R | q , x ) O ( R | q ) ∏ ∏ = ⋅ ⋅ r ( 1 p ) 1 r − − x q 1 q 1 i i i = = = i i i Constant for each query Only quantity to be estimated for rankings ¡ Retrieval Status Value (taking log): p ( 1 r ) p ( 1 r ) − − i i i i RSV log log ∑ ∏ = = r ( 1 p ) r ( 1 p ) − − x q 1 x q 1 i i i i = = = = i i i i

  18. Binary ¡Independence ¡Model ¡ Only need to compute RSV : p ( 1 r ) p ( 1 r ) − − RSV log i i log i i ∏ ∑ = = r ( 1 p ) r ( 1 p ) − − x q 1 x q 1 i i i i = = = = i i i i p ( 1 r ) − c log i i RSV c ; ∑ = = i i r ( 1 p ) − x i q 1 i i = = i The c i are log odds ratios They function as the term weights in this model How to compute c i ’ s from the data ?

  19. Binary ¡Independence ¡Model ¡ Estimating RSV coefficients: For each term i look at this table of document counts: Documents Relevant Non-Relevant Total x i =1 s n-s n x i =0 S-s N-n-S+s N-n Total S N-S N s ( n s ) − Estimates: p i ≈ r ≈ i ( N S ) S − s ( S s ) − assume no c i K ( N , n , S , s ) log ≈ = zero terms. ( n s ) ( N n S s ) − − − +

  20. Es)ma)on ¡ § If non-relevant documents are approximated by the whole collection, then r i (prob. of occurrence in non-relevant documents for query) is n/N and log1 − r = log N − n − S + s ≈ log N − n ≈ log N i n = IDF ! r n − s n i

  21. Es)ma)on ¡– ¡key ¡challenge ¡ § Estimating p i (probability of occurrence in relevant documents) is a little difficult § p i can be estimated in various ways: – from relevant documents if know some • Relevance weighting can be used in a feedback loop – constant (Croft and Harper combination match) – then just get idf weighting of terms (with p i =0.5 ) log N ∑ RSV = n i x i = q i = 1 – proportional to prob. of occurrence in collection • Greiff (SIGIR 1998) argues for 1/3 + 2/3 df i /N

More recommend