Why probabili)es? Document representa)on is uncertain - PowerPoint PPT Presentation

Probabilis)c ¡Models ¡in ¡IR ¡ Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from Chris ¡Manning, ¡Pandu ¡Nayak ¡and ¡ ¡ Prabhakar ¡Raghavan ¡

Why ¡probabili)es? ¡ Document ¡representa)on ¡is ¡uncertain ¡ Understanding ¡of ¡user ¡need ¡is ¡uncertain ¡ How to match? Assumption: the required User needs some information is present information somewhere Traditional IR matching: by semantically imprecise space of terms Probabilities: principled foundation for uncertain reasoning Goal: use probabilities to quantify uncertainties 2 ¡

Probabilis)c ¡IR ¡topics ¡ § Classical probabilistic retrieval model – Probability ranking principle, etc. – Binary independence model ( ≈ Naïve Bayes text cat) – (Okapi) BM25 § Bayesian networks for text retrieval § Language model approach to IR – An important emphasis in recent work § Timeline: old, as well as currently hot in IR – Traditionally: neat ideas, but didn’t win on performance – It may be different now

The ¡document ¡ranking ¡problem ¡ § Collection D = { d 1 , … , d N } § Query: q § Ranking: return a list of documents, in the order of relevance Probabilistic idea of relevance § Given a document d , a query q, is d relevant for q ? Denote by P ( R = 1 | d, q ) Random variable R = 0 (not relevant) or 1 (relevant) Ranking: rank documents in the order of P ( R = 1| d i , q )

The ¡Probability ¡Ranking ¡Principle ¡(PRP) ¡ [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538) § Goal: overall effectiveness to be the best obtainable on the basis of the available data § Approach: rank the documents in the collection in order of decreasing probability of relevance to the user who submitted the request – Assumption: the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system

Probability ¡Ranking ¡Principle ¡(PRP) ¡ Goal: for every document d estimate P [ d is relevant to q ] P ( R = 1 | d, q ), or simply P ( R = 1 | d ) P ( R=1) , P ( R=0 ) - prior probability P ( R = 1| d ) = P ( d | R = 1) P ( R = 1) of retrieving a relevant or non-relevant P ( d ) document P ( d |R= 1), P ( d |R= 0) - probability that if P ( R = 0 | d ) = P ( d | R = 0) P ( R = 0) a relevant (not relevant) document is P ( d ) retrieved, it is d P ( R = 0 | d ) + P ( R = 1| d ) = 1

Probability ¡Ranking ¡Principle ¡(PRP) ¡ § Simple case: no selection costs or other utility concerns that would differentially weight errors § PRP: Rank all documents by p ( R=1 | x ) § The 1/0 loss: – Lose a point if you return a non relevant document – Gain a point if you return a relevant document § Theorem: Using the PRP is optimal, in that it minimizes the loss (Bayes risk) under 1/0 loss – Provable if all probabilities correct, etc. [e.g., Ripley 1996]

Probability ¡Ranking ¡Principle ¡(PRP) ¡ § More complex case: retrieval costs. – Let d be a document – C : cost of not retrieving a relevant document – C’ : cost of retrieving a non-relevant document § Probability Ranking Principle: if C ⋅ P ( R = 0 | d ) − C ⋅ P ( R = 1| d ) ≤ ! C ⋅ P ( R = 0 | ! d ) − C ⋅ P ( R = 1| ! d ) ! for all d’ not yet retrieved Then d is the next document to be retrieved We won’t further consider cost/utility

Probabilis)c ¡Retrieval ¡Strategy ¡ § Estimating the probabilities: Binary Independence Model (BIM) – the simplest model § Questionable assumptions – “Relevance” of each document is independent of relevance of other documents. • It is bad to keep on returning duplicates – Boolean model of relevance § Estimate how terms contribute to relevance – How tf, df, document length etc influence document relevance? § Combine to find document relevance probability § Order documents by decreasing probability

Probabilis)c ¡Ranking ¡ Basic concept: “ For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically.” Van Rijsbergen

Binary ¡Independence ¡Model ¡ § Traditionally used in conjunction with PRP § “Binary” = Boolean: documents are represented as binary incidence vectors of terms � = x ( 1 x , … , x ) – n x 1 – if term i is present in document x . = i § “Independence”: terms occur in documents independently § Different documents can be modeled as the same vector

What ¡it ¡means ¡ ! p i (1 − p i ) ∏ ∏ O ( R | q , x ) = O ( R | q ) ⋅ ⋅ r (1 − r i ) i x i = 1 x i = 0 q i = 1 q i = 1 in document relevant (R=1) not relevant (R=0) term present x i = 1 p i r i term absent x i = 0 (1 – p i ) (1 – r i )

Binary ¡Independence ¡Model ¡ O ( R | q ,  p i 1 − p i ∏ ∏ x ) = O ( R | q ) ⋅ ⋅ r 1 − r i i x i = q i = 1 x i = 0 q i = 1 All matching Non-matching query query terms terms: too many!! O ( R | q ,  $ ' p i 1 − r ⋅ 1 − p i 1 − p i ∏ ∏ ∏ i x ) = O ( R | q ) ⋅ ⋅ & ) r 1 − p i 1 − r 1 − r % ( x i = 1 i x i = 1 i x i = 0 i q i = 1 q i = 1 q i = 1 O ( R | q ,  p i (1 − r i ) 1 − p i ∏ ∏ x ) = O ( R | q ) ⋅ ⋅ r i (1 − p i ) 1 − r x i = q i = 1 q i = 1 i All matching terms All query terms

Binary ¡Independence ¡Model ¡ � p ( 1 r ) 1 p − − i i i O ( R | q , x ) O ( R | q ) ∏ ∏ = ⋅ ⋅ r ( 1 p ) 1 r − − x q 1 q 1 i i i = = = i i i Constant for each query Only quantity to be estimated for rankings ¡ Retrieval Status Value (taking log): p ( 1 r ) p ( 1 r ) − − i i i i RSV log log ∑ ∏ = = r ( 1 p ) r ( 1 p ) − − x q 1 x q 1 i i i i = = = = i i i i

Binary ¡Independence ¡Model ¡ Only need to compute RSV : p ( 1 r ) p ( 1 r ) − − RSV log i i log i i ∏ ∑ = = r ( 1 p ) r ( 1 p ) − − x q 1 x q 1 i i i i = = = = i i i i p ( 1 r ) − c log i i RSV c ; ∑ = = i i r ( 1 p ) − x i q 1 i i = = i The c i are log odds ratios They function as the term weights in this model How to compute c i ’ s from the data ?

Binary ¡Independence ¡Model ¡ Estimating RSV coefficients: For each term i look at this table of document counts: Documents Relevant Non-Relevant Total x i =1 s n-s n x i =0 S-s N-n-S+s N-n Total S N-S N s ( n s ) − Estimates: p i ≈ r ≈ i ( N S ) S − s ( S s ) − assume no c i K ( N , n , S , s ) log ≈ = zero terms. ( n s ) ( N n S s ) − − − +

Es)ma)on ¡ § If non-relevant documents are approximated by the whole collection, then r i (prob. of occurrence in non-relevant documents for query) is n/N and log1 − r = log N − n − S + s ≈ log N − n ≈ log N i n = IDF ! r n − s n i

Es)ma)on ¡– ¡key ¡challenge ¡ § Estimating p i (probability of occurrence in relevant documents) is a little difficult § p i can be estimated in various ways: – from relevant documents if know some • Relevance weighting can be used in a feedback loop – constant (Croft and Harper combination match) – then just get idf weighting of terms (with p i =0.5 ) log N ∑ RSV = n i x i = q i = 1 – proportional to prob. of occurrence in collection • Greiff (SIGIR 1998) argues for 1/3 + 2/3 df i /N

Why probabili)es? Document representa)on is uncertain - PowerPoint PPT Presentation

Probabilis)c Models in IR Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from Chris Manning, Pandu Nayak and Prabhakar

Probability for linguists probabili- ties Logarithms and plogs John A Goldsmith From single

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

NO NORMAL RMAL PR PROBABILITY OBABILITY CUR URVE VE Dr Dr. . Kuldee eep Kaur Kaur His

Uncertain t y Chapter 14 c AIMA Slides Stuart Russell and P eter Norvig, 1998

Sta$s$cs & Experimental Design with R Barbara Kitchenham

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

July 29, 2018 WHY I PICKED POSTGRES OVER ORACLE WHY WHY This presentation Open source

Vision Zero Start With Why Start With Why Start With Why ATSs WHY We SAVE LIVES!

Marta Milewska @_pani_jesien HERE WE ARE POLAND Why Why write a blog? Why localize it?

Why ALL Why ALL Why ALL Social Why ALL Social ocial Media ocial Media edia Are edia Are Are

Why Children? Why Now? Bruce Lesley Why Children? Why

Building on our momentum Building on our momentum Three questions 1. Why this? 2. Why downtown?

HOW WILL YOU WIN IN BRICS? WHY? Why Brazil? Why Russia? Why anywhere? Build it and

Why Im NOT Why Im NOT Jewish/ Christian Atheist Agnostic Hindu Muslim Buddhist

Why Im NOT Why Im NOT Jewish/ Christian Atheist Agnostic Hindu Muslim Buddhist

1 WEEK/CHOICE #1 The Reality Choice: realize Im not God, and humbly admit that I need help.

Chemnitz University of Technology @ VideoCLEF 2009 Outline Motivation System description

Agenda Welcome and Introductions Task Force Updates Fred Kronz, NSF SBE Break (30

Twisted Hessian curves 1986 ChudnovskyChudnovsky, Sequences of numbers

11 RTN / RTL

Analyzable and Practical Real-Time Gang Scheduling on Multicore Using RT-Gang Waqar Ali, Michael

Waqar Ali, Heechul Yun University of Kansas Multicore Processors Provide high computing

Linear maps Matthew Macauley Department of Mathematical Sciences Clemson University

Why probabili)es? Document representa)on is uncertain - PowerPoint PPT Presentation

Probabilis)c Models in IR Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from Chris Manning, Pandu Nayak and Prabhakar

Probability for linguists probabili- ties Logarithms and plogs John A Goldsmith From single

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

NO NORMAL RMAL PR PROBABILITY OBABILITY CUR URVE VE Dr Dr. . Kuldee eep Kaur Kaur His

Uncertain t y Chapter 14 c AIMA Slides Stuart Russell and P eter Norvig, 1998

Sta$s$cs &amp; Experimental Design with R Barbara Kitchenham

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

July 29, 2018 WHY I PICKED POSTGRES OVER ORACLE WHY WHY This presentation Open source

Vision Zero Start With Why Start With Why Start With Why ATSs WHY We SAVE LIVES!

Marta Milewska @_pani_jesien HERE WE ARE POLAND Why Why write a blog? Why localize it?

Why ALL Why ALL Why ALL Social Why ALL Social ocial Media ocial Media edia Are edia Are Are

Why Children? Why Now? Bruce Lesley Why Children? Why

Building on our momentum Building on our momentum Three questions 1. Why this? 2. Why downtown?

HOW WILL YOU WIN IN BRICS? WHY? Why Brazil? Why Russia? Why anywhere? Build it and

Why Im NOT Why Im NOT Jewish/ Christian Atheist Agnostic Hindu Muslim Buddhist

Why Im NOT Why Im NOT Jewish/ Christian Atheist Agnostic Hindu Muslim Buddhist

1 WEEK/CHOICE #1 The Reality Choice: realize Im not God, and humbly admit that I need help.

Chemnitz University of Technology @ VideoCLEF 2009 Outline Motivation System description

Agenda Welcome and Introductions Task Force Updates Fred Kronz, NSF SBE Break (30

Twisted Hessian curves 1986 ChudnovskyChudnovsky, Sequences of numbers

11 RTN / RTL

Analyzable and Practical Real-Time Gang Scheduling on Multicore Using RT-Gang Waqar Ali, Michael

Waqar Ali, Heechul Yun University of Kansas Multicore Processors Provide high computing

Linear maps Matthew Macauley Department of Mathematical Sciences Clemson University

Sta$s$cs & Experimental Design with R Barbara Kitchenham