Reducing Technical Debt with Reproducible Containers Tanu Malik 2019 BSSw Fellow Assistant Professor School of Computing DePaul University Chicago, IL IDEAS-ECP Webinar, November 4 th , 2020 IDEAS-ECP Webinar, November 2020 2
WhoamI My expertise is: Databases and distributed computing Data provenance: history and lineage of data and software Computational reproducibility: Repeating and recreating some one else’s work Tanu Malik Assistant Professor, Systems built: http://sciunit.run School of Computing Director, Data Systems and Opt. Lab I want to know more about: DePaul University Reproducibility case studies in HPC and how containers are used. Chicago, IL https://facsrv.cs.depaul.edu/~tmalik1 Problems I’m currently working on: Tanu.Malik@depaul.edu Provenance alignment: Using provenance to highlight sources of irreproducibility State maintenance in lineage graphs: Making Jupyter Notebooks reproducible IDEAS-ECP Webinar, November 2020 1
Outline PART 1: How technical debt affects reproducibility? PART 2: If reproducible containers provide a start? PART 3: Guidance and summary IDEAS-ECP Webinar, November 2020 3
PART 1: How technical debt affects reproducibility? IDEAS-ECP Webinar, November 2020 4
Monetary debt IDEAS-ECP Webinar, November 2020 5
Monetary debt meets the objective “sooner” IDEAS-ECP Webinar, November 2020 6
Technical debt 1 is no different 1 A metaphor introduced by Ward Cunningham in 1992. IDEAS-ECP Webinar, November 2020 7
Technical debt 1 is no different 1 A metaphor introduced by Ward Cunningham in 1992. </> </> </> IDEAS-ECP Webinar, November 2020 8
Technical debt is no different. Journal deadline Productivity Good scientific software Technical debt Poor scientific software Time IDEAS-ECP Webinar, November 2020 9
Dimensions of Technical Debt • Poor quality code • Poor design • Environment debt • Documentation debt • Testing debt IDEAS-ECP Webinar, November 2020 10
Consequence of Mismanaged Debt REPOSSESSED IDEAS-ECP Webinar, November 2020 11
Consequence of Mismanaged Debt REPOSSESSED </> </> IRREPRODUCIBLE </> IDEAS-ECP Webinar, November 2020 12
Dimensions of Scientific Technical Debt • Poor quality code • Poor design • Environment debt • Documentation debt • Testing debt 1 E. Tom, A. Aurum, R. Vidgen, An exploration of technical debt, Journal of Systems and Software, Volume 86, Issue 6, 2013, Pages 1498-1516, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2012.12.052. IDEAS-ECP Webinar, November 2020 13
Dim Dimensio ions o of S Scie cientif ific T ic Tech chnic ical De al Debt • Poor quality code • Poor design ü Environment debt ü Documentation debt • Testing debt IDEAS-ECP Webinar, November 2020 14
https://www.newscientist.com/gallery/software-bugs IDEAS-ECP Webinar, November 2020 15
IDEAS-ECP Webinar, November 2020 16
https://www.nature.com/articles/d41586-020-01685-y IDEAS-ECP Webinar, November 2020 17
Cos Cost of of Sc Scientific Technical Debt IDEAS-ECP Webinar, November 2020 18
Su Supercomp omputing A Art rtifact ct D Descri cription on a and Ev Evaluation Initiative https://sc20.supercomputing.org/planning-committee/ IDEAS-ECP Webinar, November 2020 19
La Lack ck of of a art rtifact cts w will r reject ct a a p paper Total Number Unacceptable AD/AE 1 with VG/E AD/AE (Phase 2) 24 Submissions (Phase 2) 43 with VG/E AD/AE (Phase 1) 5 Per reviewer 80 Submissions (Phase 1) 380 0 50 100 150 200 250 300 350 400 Number IDEAS-ECP Webinar, November 2020 20
Te Technical debt incurs burden • “Sticks” from reviewers work • Reproducibility is an after • Authors who have not taken thought. AD/AE process seriously do submit additional work • Identifying files for an • Time consuming task application is a challenge • No tools to check if everything • Missing workflows relevant for the publication is submitted • Really, that data/algorithm • No mapping of experiments to should be part of the bundle? content in the paper. • No infrastructure for efficiently verifying claimed results IDEAS-ECP Webinar, November 2020 21
PART 2: Do reproducible containers provide a start? IDEAS-ECP Webinar, November 2020 22
Re Reproducibility ecosystem Github Sharing images via the cloud Package managers Zenodo.org OpenData.gov Figshare Docker.com An introduction to Docker for reproducible research C Boettiger - ACM SIGOPS Operating Systems Review, 2015 - dl.acm.org IDEAS-ECP Webinar, November 2020 23
Do Dock cker: U : Usin ing c contain ainers f from b build ild t to r run https://www.exascaleproject.org/event/conthpc IDEAS-ECP Webinar, November 2020 24
Con Containers provide con onstrained resou ource is isola latio tion Filesystem Network CPU Memory IDEAS-ECP Webinar, November 2020 25
Authors must program a Dockerfile IDEAS-ECP Webinar, November 2020 26
Con Containers do o not ot reduce technical debt • Declarative encapsulation of dependencies for isolated execution • E.g. various shell utilities and library versions unknown to user IDEAS-ECP Webinar, November 2020 27
Au Automatic Encapsulation of Dependencies: Th The Sci Sciunit IDEAS-ECP Webinar, November 2020 28
Ke Key Idea: Iden Identif tify dependenc dependencies ies dur during ing pr progr gram exec ecut ution • Captures application dependencies during executions • Repeats executions (with guarantees) within isolated environments IDEAS-ECP Webinar, November 2020 29
Sci Sciunit: A : Audit • Audit uses ptrace to observe dependencies and environment variables • Identifies binaries, libraries, scripts, and environment variables that Sciunit Sciunit application is dependent on. • Dependencies are copied into a directory in the filesystem • Inclusion of data files is optional • user may or may not want to package based on the size of the dataset. D.H. Ton That, G. Fils, Z. Yuan, T. Malik. Sciunits: Reusable Research Objects. In IEEE eScience Conference (eScience), 374-383, 2017 IDEAS-ECP Webinar, November 2020 30
Au Audits provenance during execution time Sciunit Utilizing Provenance in Reusable Research Objects, In Special Issue on Using Computational Provenance , MDPI Informatics, Vol 5(1), 2018. Light-weight Database Virtualization. In IEEE International Conference on Data Engineering , ICDE, 2015. Auditing and Maintaining Provenance in Software Packages. In International P rovenance and Annotation Workshop (IPAW), 97-109, 2014 IDEAS-ECP Webinar, November 2020 31
Sciunit: Sh Sci : Share a as a a Z Zip f file o or Do r Dock cker c r container r Sciunit Containment Provenance Graph Computational Sciunit Log Documentation Artifacts (from websites) Identification of Docker File Inputs, Outputs, Processes, Dependencies Documenting Computing Environments for Reproducible Experiments, In Parallel Computing: Technology Trends, 756-765 , 2020 IDEAS-ECP Webinar, November 2020 32
Sci Sciunit: R : Repeat • Sciunit uses namespace isolation during repeat • Redirection of each call into the package Sciunit Sciunit Sciunit Efficient Provenance Alignment in Reproduced Executions, In Theory and Practice of Provenance , 2020. ScIInc: A Container Runtime for Incremental Recomputation”, In IEEE 15th International Conference on eScience (eScience), 291-300, 2019, doi: 10.1109/eScience. 2019.00040. IDEAS-ECP Webinar, November 2020 33
Sci Sciunit st steps and external re require rements 3. Repeat 2. Share 1. Create IDEAS-ECP Webinar, November 2020 34
Network Ne rk-enabled enabled Sci Sciunit: A : Audit Network-enabled Sciunit Possible with Network- enabled Sciunit 1. Network-enabled 1. Network-enabled Spawn task Spawn task Sciunit Sciunit 1 2 4. Merge 4. Merge Note : 1. Identify remote host & copy Sciunit to it 2&3. Run task 1 2&3. Run task 2 2&3. Configure & run task with Sciunit 4. Retrieve & manually merge IDEAS-ECP Webinar, November 2020 35
Ne Netw twork-en enabled ed Sci ciunit: : Rep epea eat t on singl gle e node Network-enabled Sciunit Run application Note : 1. Repeat all computations at root node. No connection 2. Network system calls are supplied through the content data captured during the original audit. IDEAS-ECP Webinar, November 2020 36
Ne Netw twork-en enabled ed Sci ciunit: : Rep epea eat t on mu multi tiple e nodes es Network-enabled Sciunit Run application Requirements : 1. Identical number of nodes Network-enabled Network-enabled Sciunit & sub- 2. Descriptions of new hostnames or IP Sciunit & sub- container container addresses Run task 1 Run task 2 IDEAS-ECP Webinar, November 2020 37
Recommend
More recommend