Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019) Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, Ze Zhong Wu
Vision Source: saveur.com
Vision The ultimate candy store for information retrieval researchers! Source: Wikipedia (Candy)
Vision The ultimate candy store for information retrieval researchers! See a result you like? Click a button to recreate those results! Really, any result? (not quite… let’s start with batch ad hoc retrieval experiments on standard test collections) What is this, really?
Repeatability: you can recreate your own We get this “for free” results again Replicability: others can recreate your results (with your code) Our focus Reproducibility: others can recreate your results (with code they rewrite) Stepping stone… ACM Artifact Review and Badging Guidelines
Why is this important? Good science Sustained cumulative progress Armstrong et al. (CIKM 2009): Little empirical progress made from 1998 to 2009 Why? researchers compare against weak baselines Yang et al. (SIGIR 2019): Researchers still compare against weak baselines
How do we get there? Open-Source Code! … h g u o n e m o f r r a f t u b t , r a s t d o o g A TREC 2015 “Open Runs” 79 submitted runs… Voorhees et al. Promoting Repeatability Through Open Runs. EVIA 2016.
0 Number of runs successfully replicated Voorhees et al. Promoting Repeatability Through Open Runs. EVIA 2016.
How do we get there? Open-Source Code! … h g u o n e m o f r r a f t u b t , r a s t d o o g A Ask developers to show us how! Open-Source IR Reproducibility Challenge (OSIRRC), SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) Participants contributed end-to-end scripts for replicating ad hoc retrieval experiments Lin et al. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. ECIR 2016.
System E ff ectiveness 0.75 0.50 MAP 0.25 0.00 5 L P L B P 5 5 + M M D 5 H ) ) E . t 2 2 2 2 Q Q s Q B S P n M M M D D M M : B o u J D P S S x : : 1 : 1 4 o B 5 B B B o i J o ( G o r C . g : 4 2 : : r : B d S r 5 : o i : : a M . G P ( n E r J e r S t 2 + l : g n d + e a S I 4 i 5 A R M M a r i a n G H G 2 S l r H r J I u a I M B r A e P T M G P T e Q T J A D B D : e : E n : : : e r R e r n e c e i I e u i r T r c L r A r u e e T L T System / Model 7 participating systems, GOV2 collection
System E ffi ciency 100,000 10,000 Search Time (ms) 1,000 100 10 1 P B P + 5 ) ) 5 5 5 H L D L E M M t . 2 2 2 2 s Q Q Q B n P S M M M M D D M : B o u J D P x S S 1 : : : 1 4 o 5 B B B B J o i ( o G o C r . : 4 g 2 : r B : : S d 5 r : : o i M G . ( : a P E J e n r S t 2 r + : l g n + d S 5 4 e i a I A M M R a r a G i H n 2 G S l r r H J I u M a I B A r e P T M T P G Q e J A T D B : D e : E n : : e : r R e r n e c e i I e u i r T c r L r A u r e e T L T System / Model 7 participating systems, GOV2 collection
E ff ectiveness/E ffi ciency Tradeo ff 10000 Indri: SDM Galago: SDM Terrier: DPH+Bo1 QE Indri: QL Terrier: DPH+Prox SD 1000 Time (ms) Galago: QL Terrier: DPH Terrier: BM25 MG4J: BM25 ATIRE: BM25 Lucene: BM25 (Pos.) Lucene: BM25 (Count) 100 ATIRE: Quant. BM25 MG4J: B+ JASS: 1B P MG4J: B JASS: 2.5M P 8 0 2 4 2 3 3 3 . . . . 0 0 0 0 MAP 7 participating systems, GOV2 collection
How do we get there? Open-Source Code! … h g u o n e m o f r r a f t u b t , r a s t d o o g A Ask developers to show us how! I t w o r k e d , b u t …
What worked well? We actually pulled it off! What didn’t work well? Technical infrastructure was brittle Replication scripts too under-constrained
Infrastructure Source: Wikipedia (Burj Khalifa)
VMs App App OS OS VM VM hypervisor Physical Machine
Containers App App Container Container Container Engine OS Physical Machine
Infrastructure Source: Wikipedia (Burj Khalifa)
Workshop Goals 1. Develop common Docker specification for capturing ad hoc retrieval experiments – the “jig”. 2. Build a library of curate images that work with the jig. 3. Take over the world! (encourage adoption, broaden to other tasks, etc.)
jig Docker image User specifies <image>:<tag> Starts image prepare phase Triggers hook init hook Triggers hook index hook Creates snapshot <snapshot> <image>:<tag> Triggers hook with snapshot search hook search run files phase trec_eval
Source: Flickr (https://www.flickr.com/photos/m00k/15789986125/)
17 images 13 different teams Focus on newswire collections: Robust04, Core17, Core18 Official runs on Microsoft Azure f t o s o r c M i k s n a h T ! s i t d e r c e e r f r o f
Anserini (University of Waterloo) Anserini-bm25prf (Waseda University) ATIRE (University of Otago) Birch (University of Waterloo) Elastirini (University of Waterloo) EntityRetrieval (Ryerson University) Galago (University of Massachusetts) ielab (University of Queensland) Indri (TU Delft) IRC-CENTRE2019 (T echnische Hochschule Köln) JASS (University of Otago) JASSv2 (University of Otago) NVSM (University of Padua) OldDog (Radboud University) PISA (New York University and RMIT University) Solrini (University of Waterloo) T errier (TU Delft and University of Glasgow)
Robust04 49 runs from 13 images Images captured diverse models: query expansion and relevance feedback conjunctive and efficiency-oriented query processing neural ranking models
Core17 12 runs from 6 images
Core18 19 runs from 4 images
Robust04 49 runs from 13 images
Who won? Source: Time Magazine
But it’s not a competition! Source: Washington Post
TREC best – 0.333 TREC median (title) – 0.258
Workshop Goals ✓ 1. Develop common Docker specification for capturing ✓ ad hoc retrieval experiments – the “jig”. 2. Build a library of curate images that work with the jig. ? 3. Take over the world! (encourage adoption, broaden to other tasks, etc.)
What’s next? Source: flickr (https://www.flickr.com/photos/39414578@N03/16042029002)
Recommend
More recommend