modelling a user population for designing information
play

Modelling A User Population for Designing Information Retrieval - PowerPoint PPT Presentation

Modelling A User Population for Designing Information Retrieval Metrics Tetsuya Sakai (NewsWatch, Inc.) tetsuyasakai@acm.org Stephen Robertson (Microsoft Research Cambridge) ser@microsoft.com EVIA 2008, December 16, 2008@NII, Tokyo TALK


  1. Modelling A User Population for Designing Information Retrieval Metrics Tetsuya Sakai (NewsWatch, Inc.) tetsuyasakai@acm.org Stephen Robertson (Microsoft Research Cambridge) ser@microsoft.com EVIA 2008, December 16, 2008@NII, Tokyo

  2. TALK OUTLINE 1. Objectives 2. Normalised Cumulative Precision (NCP) 3. Normalised Cumulative Utility (NCU) 4. Evaluating Evaluation Metrics: Resemblance to Average Precision 5. Evaluating Evaluation Metrics: Discriminative Power 6. Conclusions

  3. Average Precision (AP) Precision at rank r Number of 1 iff doc at r relevant is relevant docs • Used widely since the advent of TREC • Mean over topics is referred to as “MAP” • Cannot handle graded relevance (but many IR researchers just love it)

  4. Criticisms of (Mean) Average Precision ((M)AP) • AP may be a poor measure of user performance/satisfaction [Turpin/Scholer SIGIR 06 etc.] • AP lacks a user model “there is no single user application that directly motivates MAP” [Buckley/Voorhees TREC book] “there is no plausible search model that corresponds to MAP, because no user knows in advance the number of relevant answers present in the collection…” [Moffat/Webber/Zobel SIGIR 07]

  5. “AP lacks a user model?” Rubbish! [Robertson SIGIR 08]

  6. Objectives • Robertson showed that AP is a special case of Normalised Cumulative Precision (NCP) which models a population of users . • We generalise NCP and introduce Normalised Cumulative Utility (NCU), and show that - AP and Q-measure are in fact reasonable metrics! - A version of NCU, which utilises graded relevance in a novel way, has high discriminative power!

  7. Query I need the latest information on EVIA! Information need

  8. L1 (partially relevant) L3 (highly relevant) L0 (not relevant) L2 (relevant) L0 (not relevant) L0 (not relevant) L3 (highly relevant)

  9. Where do users stop scanning the list? I stop at L1 rank 1 I stop at L3 rank 2 L0 I stop at L2 rank 4 L0 L0 I stop at L3 rank 7

  10. p u : Uniform Distribution over Relevant Documents L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - The stopping probability is L2 uniform across all relevant docs L0 L0 L3

  11. p rb : Rank-Biased Distribution over Relevant Docs L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - Users tend to stop L2 near the top than near the bottom L0 L0 L3

  12. p gu : Graded-Uniform Distribution over Relevant Docs L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - Users tend to stop L2 at a highly relevant doc than at a partially relevant doc L0 L0 L3

  13. TALK OUTLINE 1. Objectives 2. Normalised Cumulative Precision (NCP) 3. Normalised Cumulative Utility (NCU) 4. Evaluating Evaluation Metrics: Resemblance to Average Precision 5. Evaluating Evaluation Metrics: Discriminative Power 6. Conclusions

  14. Robertson’s Normalised Cumulative Precision (NCP) Probability that the user stops at the (relevant) document at rank n Cost: #docs seen so far Utility/Cost Expectation Utility: given the over a user #relevant stopping point population seen so far (precision at n )

  15. p u : Uniform Distribution over Relevant Documents L1 ASSUMPTIONS: - Users stop at a relevant doc; L3 - The stopping probability is uniform across all relevant docs L0 Let p s ( n ) = p u ( n ) = 1/ R for every rank n that has a L2 relevant doc. Then NCP reduces to AP (=NCP u )! L0 L0 L3

  16. That is, • AP is a special case of NCP. • It is an expectation of utility/cost over a user population whose stopping probability is uniform across all relevant documents. • Hence, AP is in fact a reasonable metric!

  17. TALK OUTLINE 1. Objectives 2. Normalised Cumulative Precision (NCP) 3. Normalised Cumulative Utility (NCU) 4. Evaluating Evaluation Metrics: Resemblance to Average Precision 5. Evaluating Evaluation Metrics: Discriminative Power 6. Conclusions

  18. We generalise NCP in two ways Stopping probability: Normalised Utility: p u (uniform) BR ( n ) (blended ratio) p rb (rank-biased) which generalises p gu (graded-uniform) P ( n )

  19. p u : Uniform Distribution over Relevant Documents L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - The stopping probability is L2 uniform across all relevant docs L0 L0 L3

  20. p rb : Rank-Biased Distribution over Relevant Docs L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - Users tend to stop L2 near the top than near the bottom L0 L0 L3

  21. γ : top-heaviness parameter for p rb Stopping probability γ =1 reduces p rb to p u Relevant documents found in the ranked list

  22. p gu : Graded-Uniform Distribution over Relevant Docs L1 ASSUMPTIONS: - Users stop at a relevant doc; L3 - Users tend to stop at a highly relevant doc than L0 at a partially relevant doc L2 Stopping weights stop(L3):stop(L2):stop(L1)=3:2:1 L0 stop(L3):stop(L2):stop(L1)=10:5:1 (stop(L3):stop(L2):stop(L1)=1:1:1 L0 reduces p gu to p u ) L3

  23. Blended Ratio (BR) Normalised Cumulative Gain Precision for handling graded relevance A large β represents a very persistent user; β =0 reduces BR to P BR is suitable as a utility/cost function because, given the stopping point n , it does NOT matter where the relevant documents are within top n .

  24. NCU family Normalised Utility Stopping probability: given the stopping p rb (rank-biased) with point: BR ( n ) top-heaviness parameter (blended ratio) with γ ( γ =1 reduces p rb to p u ) persistence p gu (graded-uniform) parameter β with stopping weights ( β =0 reduces (flat weights BR ( n ) to P ( n ) ) reduces p gu to p u )

  25. TALK OUTLINE 1. Objectives 2. Normalised Cumulative Precision (NCP) 3. Normalised Cumulative Utility (NCU) 4. Evaluating Evaluation Metrics: Resemblance to Average Precision 5. Evaluating Evaluation Metrics: Discriminative Power 6. Conclusions

  26. Comparing a system ranking by Metric M to that by AP • Kendall’s rank correlation Monotonic function of the probability that a randomly chosen system pair is ordered identically in two rankings • Yilmaz/Aslam/Robertson (YAR) rank correlation [SIGIR 08] Monotonic function of the probability that a randomly chosen system and one ranked above it are ordered identically in two rankings Assumes that the top ranks are the most important Not symmetrical, but is almost symmetrical in practice

  27. YAR rank correlation with AP (NCU u , β =0 ): NTCIR-6J Q Q, NCU gu , β =0 and AP 0.96 0.89 NCU gu , β =1 produce 0.954 1 1 rankings that are very similar 0.9 0.8 0.74 to that by AP 0.773 0.7 0.628 0.6 0.66 rb, β=0 0.5 0.589 rb, β=1 0.4 0.604 gu, β=0 0.3 gu, β=1 0.2 0.1 gu, β=1 0 gu, β=0 γ=1 rb, β=1 γ=0.9 rb, β=0 γ=0.7 Heavy rank bias produces γ=0.5 Stop very unconventional weights=3:2:1 system rankings

  28. YAR rank correlation with AP (NCU u , β =0 ): TREC03 Q Q, NCU gu , β =0 and AP 0.909 0.925 NCU gu , β =1 produce 1 1 0.893 rankings that are very similar 0.9 0.776 0.8 to that by AP 0.761 0.7 0.6 0.595 rb, β=0 0.5 0.601 rb, β=1 0.524 0.4 gu, β=0 0.535 0.3 gu, β=1 0.2 0.1 gu, β=1 0 gu, β=0 γ=1 rb, β=1 γ=0.9 rb, β=0 γ=0.7 Heavy rank bias produces γ=0.5 Stop very unconventional weights=3:2:1 system rankings

  29. TALK OUTLINE 1. Objectives 2. Normalised Cumulative Precision (NCP) 3. Normalised Cumulative Utility (NCU) 4. Evaluating Evaluation Metrics: Resemblance to Average Precision 5. Evaluating Evaluation Metrics: Discriminative Power 6. Conclusions

  30. Measuring discriminative power of metrics [Sakai SIGIR06] • Given a set of systems and a significance level α , for how many system pairs can a metric detect statistical significance? Probability of Type I error α =0.05 ⇔ 95% confidence • Sakai’s method uses the bootstrap test, and can also estimate the absolute performance difference required to achieve statistical significance (e.g. “a difference of 0.20 is usually statistically significant”) • Sakai’s method and the Voorhees/Buckley swap method [SIGIR 02] give similar results in practice

  31. Discriminative power at α =0.05: NTCIR-6J Q AP, Q, NCU gu , β =0 and AP 64.4 NCU gu , β =1 have high 70 57.8 62.2 discriminative power 60 60 57.8 55.6 50 48.9 53.3 48.9 53.3 40 rb, β=0 rb, β=1 30 gu, β=0 20 gu, β=1 10 gu, β=1 0 gu, β=0 γ=1 rb, β=1 γ=0.9 rb, β=0 γ=0.7 γ=0.5 Heavy rank bias hurts discriminative power (by ignoring low-ranked docs)

  32. Discriminative power at α =0.05: TREC03 Q 68.3 AP, Q, NCU gu , β =0 and AP 64.2 NCU gu , β =1 have high 66.7 70 64.2 62.5 discriminative power 60 54.2 50 45.8 40 46.7 rb, β=0 40.8 rb, β=1 30 41.7 gu, β=0 20 gu, β=1 10 gu, β=1 0 gu, β=0 γ=1 rb, β=1 γ=0.9 rb, β=0 γ=0.7 γ=0.5 Heavy rank bias hurts discriminative power (by ignoring low-ranked docs)

  33. Effect of γ on discriminative power: TREC03 Heavy rank bias hurts Achieved significance level (ASL) discriminative power (by ignoring low-ranked docs) Run pairs sorted by ASL

  34. TALK OUTLINE 1. Objectives 2. Normalised Cumulative Precision (NCP) 3. Normalised Cumulative Utility (NCU) 4. Evaluating Evaluation Metrics: Resemblance to Average Precision 5. Evaluating Evaluation Metrics: Discriminative Power 6. Conclusions

Recommend


More recommend