— Improving the Accuracy of System Performance Estimation by Using Shards Nicola Ferro & Mark Sanderson 1
IR evaluation is noisy 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 2
ANOVA Data = Model + Error Model: Linear mixture of factors 3
First go Tague-Sutcliffe and Blustein, 1995 Factors Systems Topics 4
Question Can we do better? Add a Topic*System factor? 5
New system 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 6
Partition collections Shards 7
Replicates 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 8
Replicates 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 9
Replicates 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 E. M. Voorhees, D. Samarov, and I. Soboroff. Using Replicates in Information Retrieval Evaluation. ACM Transactions on Information Systems (TOIS), 36(2): 12:1–12:21, September 2017 10
Past ANOVA Factors Topics Systems Topics*System interactions 11
Our ANOVA Factors Topics Systems Shards Topics*System System*Shard Topic*Shard 12
Models 13
IR evaluation is noisy 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 14
Hard vs Easy Topics? 1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 ok8alx 0.50 0.50 INQ604 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 15
Few vs Many QRELs 16
This paper x 17
Proof in the paper Include topic*shard factor? Value of x is not important, we choose x=0 18
Few vs Many QRELs? Should topics be ‘treated’ equally? G. V. Cormack and T. R. Lynam. Statistical Precision of Information Retrieval Evaluation. In E. N. Efthimiadis, S. Dumais, D. Hawking, and K. Järvelin, editors, Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pages 533–540. ACM Press, New York, USA, 2006 S. Robertson. On document populations and measures of IR effectiveness. In Proceedings of the 1st International Conference on the Theory of Information Retrieval (ICTIR’07), Foundation for Information Society, pages 9–22, 2007. 19
Compare MD6 factors 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 20
Experiments TREC-8, Adhoc, 129 runs TREC-9, Web, 104 runs TREC-27, Common core, 72 runs Original runs rankings ( τ ) 21
22
MD6 23
Other parts of the paper Confidence intervals calculated with Tukey HSD Details of the proof on zero value for shards & MD6 Code: https://bitbucket.org/frrncl/sigir2019-fs-code/ 24
Conclusions Can we do better than past ANOVA? Yes, MD6 Topic*shard interaction is strong. Its impact has not been observed when measuring performance Test collections are expensive to build, we can get substantially more signal out of three collections 25
Future work UQV100: query test collection Compare to Voorhees, Samarov, Soboroff, 2017. Metric, not significant differences but predictive power Create new collections with fewer judgments/topics 26
Recommend
More recommend