improving the accuracy of system performance estimation
play

Improving the Accuracy of System Performance Estimation by Using - PowerPoint PPT Presentation

Improving the Accuracy of System Performance Estimation by Using Shards Nicola Ferro & Mark Sanderson 1 IR evaluation is noisy 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 2 ANOVA Data = Model + Error


  1. — Improving the Accuracy of System Performance Estimation by Using Shards Nicola Ferro & Mark Sanderson 1

  2. IR evaluation is noisy 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 2

  3. ANOVA Data = Model + Error Model: Linear mixture of factors 3

  4. First go Tague-Sutcliffe and Blustein, 1995 Factors Systems Topics 4

  5. Question Can we do better? Add a Topic*System factor? 5

  6. New system 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 6

  7. Partition collections Shards 7

  8. Replicates 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 8

  9. Replicates 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 9

  10. Replicates 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 E. M. Voorhees, D. Samarov, and I. Soboroff. Using Replicates in Information Retrieval Evaluation. ACM Transactions on Information Systems (TOIS), 36(2): 12:1–12:21, September 2017 10

  11. Past ANOVA Factors Topics Systems Topics*System interactions 11

  12. Our ANOVA Factors Topics Systems Shards Topics*System System*Shard Topic*Shard 12

  13. Models 13

  14. IR evaluation is noisy 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 14

  15. Hard vs Easy Topics? 1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 ok8alx 0.50 0.50 INQ604 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 15

  16. Few vs Many QRELs 16

  17. This paper x 17

  18. Proof in the paper Include topic*shard factor? Value of x is not important, we choose x=0 18

  19. Few vs Many QRELs? Should topics be ‘treated’ equally? G. V. Cormack and T. R. Lynam. Statistical Precision of Information Retrieval Evaluation. In E. N. Efthimiadis, S. Dumais, D. Hawking, and K. Järvelin, editors, Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pages 533–540. ACM Press, New York, USA, 2006 S. Robertson. On document populations and measures of IR effectiveness. In Proceedings of the 1st International Conference on the Theory of Information Retrieval (ICTIR’07), Foundation for Information Society, pages 9–22, 2007. 19

  20. Compare MD6 factors 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 20

  21. Experiments TREC-8, Adhoc, 129 runs TREC-9, Web, 104 runs TREC-27, Common core, 72 runs Original runs rankings ( τ ) 21

  22. 22

  23. MD6 23

  24. Other parts of the paper Confidence intervals calculated with Tukey HSD Details of the proof on zero value for shards & MD6 Code: https://bitbucket.org/frrncl/sigir2019-fs-code/ 24

  25. Conclusions Can we do better than past ANOVA? Yes, MD6 Topic*shard interaction is strong. Its impact has not been observed when measuring performance Test collections are expensive to build, we can get substantially more signal out of three collections 25

  26. Future work UQV100: query test collection Compare to Voorhees, Samarov, Soboroff, 2017. Metric, not significant differences but predictive power Create new collections with fewer judgments/topics 26

Recommend


More recommend