reproducibility in data science
play

Reproducibility in Data Science Juliana Freire Visualization, - PowerPoint PPT Presentation

Reproducibility in Data Science Juliana Freire Visualization, Imaging and Data Analysis Center (VIDA) Computer Science & Engineering Center for Data Science (CDS) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER Data-Driven Exploration


  1. Reproducibility in Data Science Juliana Freire Visualization, Imaging and Data Analysis Center (VIDA) Computer Science & Engineering Center for Data Science (CDS) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  2. Data-Driven Exploration • Every scientific domain is moving toward data-driven exploration, this has led to great advances and discoveries • Companies are capitalizing on data • Government agencies uses data to operate efficiently, make policies, and informed decisions Computing is free Storage is free Data are abundant The bottlenecks lie with people VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  3. Data-Driven Exploration: Challenges • Data are vast and produced at unprecedented rates • Sources are broad, varied, and unreliable • Computational processes are required to extract insight • But they hard to assemble machine learning algorithms statistics math data curation data discovery data management data integration provenance visualization VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  4. Data-Driven Exploration: Challenges • Exploratory tasks are inherently iterative as one tests and formulates hypotheses Data Perception & Computation Data Knowledge Products Cognition Specification Exploration [Modified from Van Wijk, Vis 2005] VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  5. Many Trials and Errors… Data Clean data Re-run model F-measure=.75 Cluster data Refine model Update F-measure=.92 scikitlearn Visualize Visualize Clean data Clean data Create model F-measure=.61 VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  6. Data-Driven Exploration: Challenges • After many steps… "An analysis has 30 different steps. It is tempting to just do this then that and then this. You have no idea in which ways you are wrong and what data is wrong” [Kandel et al., VAST 2012] ! ! e e c c n n a a n n • It is easy to get lost and not remember how a result was derived e e v v o o r r p p • Processes can break or misbehave in unforeseen ways d d e e e e N N • Results can be hard to understand, interpret and trust decisions knowledge data Incorrect conclusions can lead to bad decisions! VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  7. Computational Provenance “Provenance is the source or origin of an object; its history and pedigree; a record of the ultimate derivation and passage of an item through its various owners.” The Oxford English Dictionary ce is a key ingredient for transparency and • Pr Proven enan ance reproducibility • Computational provenance is a causality graph that models process and data dependencies VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  8. Computational Provenance = Graph Data Data i dependencies Clean Process data dependencies Data 1 F-measure=.75 Cluster Clean Run data data Model Clean Data c Data 2 data F-measure=.92 Create Create Visualize Data 3 Model 2 Model Model Visualize Model 1 … VISUALIZATION F-measure=.61 IMAGING AND DATA ANALYSIS CENTER

  9. Computational Provenance: Benefits • Interpret and reproduce results • Understand the experiment and chain of reasoning that was used in the production of a result • Verify that an experiment was performed according to acceptable procedures • Identify the inputs to an experiment were and where they came from • Re-run steps, possibly with different settings • Debug • Share, re-use and extend results VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  10. Different Flavors of Provenance • Computations are carried out in a controlled environment • It is possible to systematically capture detailed provenance • What to capture? Depends on what you will use provenance for: • Document computational process • Re-execute • Enable others to re-execute • Extend/modify process VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  11. Capture the Code What do you get? Is this enough? VISUALIZATION http: http:// //ti tinyur nyurl.com/y /y3eohbo hbo4 IMAGING AND DATA ANALYSIS CENTER

  12. Notebooks and Reproducibility • Recent study of 1,435,373 notebooks collected from 265,143 GitHub repositories • 1,029,279 attempted executions of valid notebooks (i.e., notebooks with defined Python version and execution order) • Only 25.28% executed without errors, and • 4.57% produced the same results • Problems: • No specification of library versions • Hard-coded paths • Out-of-order cells • Hidden states VISUALIZATION IMAGING AND [Pimentel et al., MSR2019] DATA ANALYSIS CENTER

  13. Notebooks: Best Practices • Use relative paths (or external data repositories) • Re-run notebook top to bottom before committing • Declare dependencies and library versions • Use clean environment to test dependencies Or use https://www.reprozip.org/ VISUALIZATION IMAGING AND [Pimentel et al., MSR2019] DATA ANALYSIS CENTER

  14. ReproZip: Reproducibility in 2 Steps Packing Unpacking Windows ReproZip Package Linux Linux data files, libraries, Mac OS X environment variables, etc. required to reproduce the research open, unpack, and reproduce anywhere, VISUALIZATION anytime! IMAGING AND DATA ANALYSIS CENTER

  15. ReproZip: Advantages • Automatically tracks dependencies in an environment and set them up in a different environment – portability • Deals with variability in computational environments • Reproducibility in hindsight • Very easy (I will show!) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  16. ReproZip: How does it work? https://www.youtube.com/watch?v=-zLPuwCHXo0 VISUALIZATION [Chirigati et al., ACM SIGMOD 2013] IMAGING AND DATA ANALYSIS CENTER

  17. Packing a Notebook Packing VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  18. Reproducing the Notebook Unpacking VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  19. ReproZip Jupyter Extension https://docs.reprozip.org/en/1.0.x/jupyter.html VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  20. ReproZip can pack… Data analysis scripts / software (any language, you name it!) Graphical tools Interactive tools Client-server applications (including databases) Jupyter notebooks (very soon!) MPI experiments (setting up the experiment is involved though…) ... and many more! https://examples.reprozip.org VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  21. ReproServer: Unpacking in a Browser Packing Unpacking ReproZip ReproServer Package Linux Upload it to data files, libraries, ReproServer or give it environment variables, etc. a link, and reproduce! required to reproduce the research VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  22. ReproServer Runs ReproZip packages in in ● th the browser , no local software needed Allows ch anging input data, chan ● configuration, command-lines Gives you a a UR URL to incl clude e in ● ports to reproduce pa pape pers/r /repo your experiment in : build on your No No lock-in ● laptop, pack automatically, reproduce anywhere https://www.youtube.com/watch?v=Ffb-PaVPC58 [Rampin et al., 2018, https://arxiv.org/abs/1808.01406] VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  23. Capture the Exploratory Process Data i Clean data Data 1 F-measure=.75 Cluster Clean Run data data Model Clean Data c Data 2 data F-measure=.92 Create Create Visualize Data 3 Model 2 Model Model Visualize Model 1 … VISUALIZATION F-measure=.61 IMAGING AND DATA ANALYSIS CENTER

  24. Capture the Exploratory Process Automatically VISUALIZATION http: http:// //www www.vistr trails.org IMAGING AND DATA ANALYSIS CENTER

  25. Provenance Beyond Reproducibility • Support for reflective reasoning • Ability to compare data products vt 1 = x i ◦ x i-1 ◦ … ◦ x 1 ◦ Ø vt 2 = x j ◦ x j-1 ◦ … ◦ x 1 ◦ Ø vt 1 -vt 2 = { x i , x i-1 , …, x 1 , Ø } – {x j , x j-1 , …,x 1 , Ø } VISUALIZATION [Freire et al., IPAW 2006] IMAGING AND DATA ANALYSIS CENTER

  26. Provenance Beyond Reproducibility • Support for reflective reasoning • Ability to compare data products • Explore parameter spaces and compare results • Also explore alternative computations (setParameter(id n ,value n ) ◦ … ◦ ( setParameter(id 1 ,value 1 ) ◦ v t ) (addModule(id i ,…) ◦ ( deleteModule(id i ) ◦ v 1 ) … (addModule(id i ,…) ◦ ( deleteModule(id i ) ◦ v n ) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  27. Provenance Beyond Reproducibility • Support for reflective reasoning • Ability to compare data products • Explore parameter spaces and compare results • Support for collaboration [Ellkvist et al., IPAW 2008] VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  28. Change-Based Provenance: Extensibility Autodesk Maya ParaView VisIt ImageVis3d [Callahan et al., IPAW 2008] VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  29. Provenance Plugin for Autodesk Maya VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

  30. Vizier: Provenance + Notebooks https://vizierdb.info/ VISUALIZATION [Glavic et al., ACM SIGMOD 2019] IMAGING AND DATA ANALYSIS CENTER

Recommend


More recommend