Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Why probabilities in IR? Understanding of user User Query need is uncertain Information Need Representation How to match? Document Uncertain guess of whether Documents Representation doct has relevant content In traditional IR systems, matching between each doc and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties? 2
Probabilistic IR Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR. Traditionally: neat ideas, but didn ’ t win on performance It may be different now. 3
Probabilistic IR topics Classical probabilistic retrieval model Probability Ranking Principle Binary independence model ( ≈ We will see that its a Naïve Bayes text categorization) (Okapi) BM25 Language model approach to IR An important emphasis on this approach in recent work 4
The document ranking problem Problem specification: We have a collection of docs User issues a query A list of docs needs to be returned Ranking method is the core of an IR system: In what order do we present documents to the user? Idea: Rank by probability of relevance of the doc w.r.t. information need 𝑄(𝑆 = 1|𝑒𝑝𝑑 𝑗 , 𝑟𝑣𝑓𝑠𝑧) 5
Probability Ranking Principle (PRP) “ If a reference retrieval system ’ s response to each request is a ranking of the docs in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data. ” [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538) 6
Recall a few probability basics Product rule: 𝑞 𝑏, 𝑐 = 𝑞 𝑏 𝑐 𝑄(𝑐) Sum rule: 𝑞 𝑏 = 𝑐 𝑞(𝑏, 𝑐) Bayes ’ Rule Prior ( | ) ( ) ( | ) ( ) p b a p a p b a p a ( | ) p a b ( ) ( | ) ( ) ( | ) ( ) p b p b a p a p b a p a Posterior Odds: ( ) ( ) p a p a ( ) O a ( ) 1 ( ) p a p a 7
Probability Ranking Principle (PRP) d : doc 𝑟 : query R : relevance of a doc w.r.t. given (fixed) query 𝑆 = 1 : relevant 𝑆 = 0 : not relevant Need to find probability that a doc 𝒚 is relevant to a query 𝒓 . 𝑞(𝑆 = 1|𝑒, 𝑟) 𝑞 𝑆 = 0 𝑒, 𝑟 = 1 − 𝑞 𝑆 = 1 𝑒, 𝑟 8
Probability Ranking Principle (PRP) 𝑞 𝑆 = 1 𝑒, 𝑟 = 𝑞 𝑒 𝑆 = 1, 𝑟 𝑞(𝑆 = 1|𝑟) 𝑞(𝑒|𝑟) 𝑞 𝑆 = 0 𝑒, 𝑟 = 𝑞 𝑒 𝑆 = 0, 𝑟 𝑞(𝑆 = 0|𝑟) 𝑞(𝑒|𝑟) 𝑞(𝑒|𝑆 = 1, 𝑟) : probability of 𝑒 in the class of relevant docs to the query 𝑟 . 𝑞(𝑒|𝑆 = 0, 𝑟) : probability of 𝑒 in the class of non- relevant docs to the query 𝑟 . 9
Probability Ranking Principle (PRP) How do we compute all those probabilities? Do not know exact probabilities, have to use estimates Binary Independence Model (BIM) which we discuss next – is the simplest model 10
Probabilistic Retrieval Strategy Estimate how terms contribute to relevance How do things like tf , df , and length influence your judgments about doc relevance? A more nuanced answer is the Okapi formula Spärck Jones / Robertson Combine the above estimated values to find doc relevance probability Order docs by decreasing probability 11
Probabilistic Ranking Basic concept: “ For a given query, if we know some docs that are relevant, terms that occur in those docs should be given greater weighting in searching for other relevant docs . By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically. ” Van Rijsbergen 12
Binary Independence Model Traditionally used in conjunction with PRP “ Binary ” = Boolean : docs are represented as binary incidence vectors of terms 𝒚 = [𝑦 1 𝑦 2 … 𝑦 𝑛 ] 𝑦 𝑗 = 1 iff term 𝑗 is present in document 𝑦 . “ Independence ” : terms occur in docs independently Equivalent to Multivariate Bernoulli Naive Bayes model Sometimes used for text categorization [we will see in the next lectures] 13
Binary Independence Model Will use odds and Bayes ’ Rule: ( 1| ) ( | 1, ) P R q P x R q ( 1| , ) ( | ) P R q x x P q ( | , ) O R q x ( 0 | ) ( | 0, ) P R q P x R q ( 0 | , ) P R q x ( | ) P x q 14
Binary Independence Model ( 1| , ) ( 1| ) ( | 1, ) P R q x P R q P x R q ( | , ) O R q x ( 0 | , ) ( 0 | ) ( | 0, ) P R q x P R q P x R q Constant for a Needs estimation given query Using Independence Assumption: n ( | 1, ) ( | 1, ) p x R q P x R q i ( | 0, ) ( | 0, ) p x R q P x R q i 1 i n ( | 1, ) P x R q i ( | , ) ( | ) O R q d O R q ( | 0, ) P x R q 1 i i 15
Binary Independence Model Since 𝑦 𝑗 is either 0 or 1: ( 1| 1, ) ( 0 | 1, ) P x R q P x R q i i ( | , ) ( | ) O R q d O R q ( 1| 0, ) ( 0 | 0, ) P x R q P x R q x 1 x 0 i i i i ( 1| 1, ) p P x R q Let i i ( 1| 0, ) u P x R q i i Assume, for all terms not occurring in the query ( q i =0 ) that 𝑞 𝑗 = 𝑣 𝑗 This can be changed (e.g., in relevance feedback) 16
Probabilities document relevant (R=1) not relevant (R=0) term present p i u i x i = 1 term absent (1 – p i ) (1 – u i ) x i = 0 Then... 17
Binary Independence Model 1 p p i i ( | , ) ( | ) O R q x O R q 1 u u 1 0 x q x i i i i i 1 q i Non-matching All matching terms query terms (1 ) 1 p u p ( | ) i i i O R q (1 ) 1 u p u 1 1 x q q i i i i i i All query terms All matching terms 18
Binary Independence Model (1 ) 1 p u p i i i ( | , ) ( | ) O R q x O R q (1 ) 1 u p u 1 1 x q q i i i i i i Constant for each query Only quantity to be estimated for rankings Retrieval Status Value: (1 ) (1 ) p u p u log i i log i i RSV (1 ) (1 ) u p u p 1 1 x q x q i i i i i i i i 19
Binary Independence Model All boils down to computing RSV: (1 ) (1 ) p u p u i i i i log log RSV (1 ) (1 ) u p u p 1 1 x q x q i i i i i i i i (1 ) p u ; RSV c log i i c i i (1 ) u p 1 x i q i i i c i s function as the term weights in this model So, how do we compute c i ’ s from our data ? 20
BIM: example 𝑟 = {𝑦 1 , 𝑦 2 } Relevance judgements from 20 docs together with the distribution of 𝑦 1 , 𝑦 2 within these docs 𝑞 1 = 8/12 , 𝑣 1 = 3/8 (1,1) 𝑞 2 = 7/12 and 𝑣 2 = 4/8 . (1,0) (0,1) 𝑑 1 = log 10 /3 (0,0) 𝑑 2 = log 7 /5 21
Binary Independence Model Estimating RSV coefficients in theory For each term i look at this table of document counts: Documents Relevant Non-Relevant Total x i =1 s df-s df x i =0 S-s N-df-S+s N-df Total S N-S N 𝑣 𝑗 = 𝑒𝑔 − 𝑡 s p i Estimates: 𝑂 − 𝑇 S For now, 𝑡 𝑇 − 𝑡 assume no Weight of i-th term: 𝑑 𝑗 ≈ log zero terms. 𝑒𝑔 − 𝑡 𝑂 − 𝑒𝑔 − 𝑇 + 𝑡 22
Estimation – key challenge If non-relevant docs are approximated by the whole collection: 𝑣 𝑗 = 𝑒𝑔 𝑗 /𝑂 prob. of occurrence in non-relevant docs for query log(1– 𝑣 𝑗 )/𝑣 𝑗 = log(𝑂– 𝑒𝑔 𝑗 )/𝑒𝑔 𝑗 ≈ log 𝑂/𝑒𝑔 IDF! 𝑗 23
Estimation – key challenge 𝑞 𝑗 cannot be approximated as easily as 𝑣 𝑗 probability of occurrence in relevant docs 𝑞 𝑗 can be estimated in various ways: constant (Croft and Harper combination match) Then just get idf weighting of terms proportional to prob. of occurrence in collection Greiff (SIGIR 1998) argues for 1/3 + 2/3 𝑒𝑔 𝑗 /𝑂 from relevant docs if know some Relevance weighting can be used in a feedback loop 24
Recommend
More recommend