tackling the reproducibility problem in systems research
play

Tackling the Reproducibility Problem in Systems Research with - PowerPoint PPT Presentation

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications Ivo Jimenez , Carlos Maltzahn ( UCSC ) Adam Moody, Kathryn Mohror ( LLNL ) Jay Lofstead ( Sandia ) Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau (


  1. Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications Ivo Jimenez , Carlos Maltzahn ( UCSC ) Adam Moody, Kathryn Mohror ( LLNL ) Jay Lofstead ( Sandia ) Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau ( UWM )

  2. The Reproducibility Problem • Network • Magic numbers • Disks • Workload • BIOS • Jitter • OS conf. • etc... Reproduced? 140 Throughput (MB/s) Original 120 100 80 60 40 Goal : define methodology so that 20 we don’t end up in this situation 1 2 3 4 5 6 7 8 9 10 11 12 13 Cluster size 2

  3. Outline • Re-execution vs. validation • Declarative Experiment Specification (ESF) • Case Study • Benefits & Challenges 3

  4. Outline • Re-execution vs. validation • Declarative Experiment Specification (ESF) • Case Study • Benefits & Challenges 4

  5. Reproducibility Workflow 1. Re-execute experiment – Recreate original setup, re-execute experiments – Technical task 2. Validate results – Compare against original – A subjective task • How do we express objective validation criteria? • What contextual information to include with results? 5

  6. Figure Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art. Raw data libs Means of code, data OS workload Experiment Observations hardware 6

  7. Outline • Re-execution vs. validation • Declarative Experiment Specification (ESF) • Case Study • Benefits & Challenges 7

  8. Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art. Means of Experiment 8

  9. Validation Language Syntax validation validation : 'for' condition ('and' condition)* 'expect' result ('and' result)* : 'for' condition ('and' condition)* 'expect' result ('and' result)* ; ; condition condition : vars ('in' range | ('=' | '<' | '>' | '!=') value) : vars ('in' range | ('=' | '<' | '>' | '!=') value) ; ; result result : condition : condition ; ; vars vars : var (',' var)* : var (',' var)* ; ; range range : '[' range_num (',' range_num)* ']' : '[' range_num (',' range_num)* ']' ; ; range_num range_num : NUMBER '-' NUMBER | '*' : NUMBER '-' NUMBER | '*' ; ; value value : '*' | 'NUMBER (',' NUMBER)* : '*' | 'NUMBER (',' NUMBER)* ; ; 9

  10. Outline • Re-execution vs. validation • Declarative Experiment Specification (ESF) • Case Study • Benefits & Challenges 10

  11. Ceph OSDI ‘06 • Select scalability experiment. – Distributed; makes use of all resources – Main bottlenecks: I/O and network • Why this experiment? – Top conference – 10 year old experiment – Ideal reproducibility conditions • Access to authors, topic familiarity, same hardware, – Even in an ideal scenario, we s:ll struggle • Demonstrates which missing info is captured by an ESF! 11

  12. Ceph OSDI ’06 Scalability Experiment Validation Statement Schema of Experiment Output Data "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” for for for for for for for for for for for for }, },{ }], }, }, { },{ }, }], { { { cluster_size <= 24 cluster_size <= 24 cluster_size <= 24 cluster_size <= 24 cluster_size <= 24 cluster_size = * and and "type": "method", "type": "method", "type": "method", "type": "method", "dependent_variable": { "dependent_variable": { "type": "method", "type": "method", expect expect expect expect expect not expect expect expect expect expect not net_saturated "values": ["raw", "ceph"] "type": "throughput", "type": "throughput", "values": ["raw", "ceph"] "values": ["raw", "ceph"] "values": ["raw", "ceph"] "values": ["raw", "ceph"] "values": ["raw", "ceph"] }], "scale": "mb/s" "scale": "mb/s" },{ }], }, ], ], { ceph >= (raw * .90) expect ceph >= (raw * .90) expect ceph >= (raw * .90) ceph >= 55 mb/s ceph >= 55 mb/s "dependent_variable": { "dependent_variable": { "dependent_variable": { "type": ”net_saturated", "dependent_variable": { }, }, "type": ”net_saturated", ceph >= (raw * .90) "type": "throughput", "values": [”true", ”false"] "values": [”true", ”false"] "type": "throughput", "type": "throughput", "type": "throughput", }], "scale": "mb/s" "scale": "mb/s" "scale": "mb/s" "scale": "mb/s" ], }, }, "dependent_variable": { }, }, "dependent_variable": { "type": "throughput", "type": "throughput", 60 "scale": "mb/s" "scale": "mb/s" Throughput (MB/s) Per-OSD Average }, }, 50 40 30 2 6 10 14 18 22 26 12 Cluster size

  13. 1.1 . . . 1 Normalized Per-OSD Throughput 0.9 0.8 0.7 0.6 0.5 0.4 0.3 reproduced 0.2 original 0.1 0 . . . 24 26 1 2 3 4 5 6 7 8 9 10 11 12 OSD Cluster Size 13

  14. Benefits & Challenges 14

  15. Why care about Reproducibility? • Good enough is not an excuse – We can always improve the state of our practice – How do we compare hardware/software in a scientific way? • Experimental Cloud Infrastructure – PRObE / CloudLab / Chameleon – Having reproducible / validated experiments would represent a significant step toward embodying the scientific method as a core component of these infrastructures 15

  16. Benefits of ESF-based methodology • Brings falsibiability to our field – Statements can be proven false • Automate validation – Validation becomes an objective task 16

  17. Validation Workflow Obtain/ Re-run and check Original work no recreate validation clauses against findings are means of output. Any validation corroborated failed? experiment. Any significant Update yes differences yes means of between original experiment and recreated means? Cannot no validate original claims 17

  18. Benefits of ESF-based methodology • Brings falsibiability to our field – Statements can be proven false • Automate validation – Validation becomes an objective task • Usability – We all do this anyway, albeit in an ad-hoc way • Integrate into existing infrastructure 18

  19. Integration with Existing Infrastructure pull push code push code Test: Test: and - Unit - Unit ESF - Integration - Integration - Validations 19

  20. Challenges • Reproduce every time – Include sanity checks as part of experiment – Alternative: corroborate that network/disk observes expected behavior at runtime • Reproduce everywhere – Example: GCC’s flags, 10 806 combinations – Alternative: provide image of complete software stack (e.g. linux containers) 20

  21. Conclusion ESFs: • Embody all components of an experiment • Enable automation of result validation • Brings us closer to the scientific method • Our ideal future: – Researchers use ESFs to express an hypothesis – Toolkits for ESFs produce metadata-rich figures – Machine-readable evaluation section https://github.com/systemslab/esf 21

  22. Thanks! 22

Recommend


More recommend