converting scripts into reproducible workflow research
play

Converting Scripts into Reproducible Workflow Research Objects - PowerPoint PPT Presentation

Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016 Background and Motivation


  1. Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016

  2. Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers 2

  3. Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data How to understand, reproduce or reuse data and models of Papers experiments? 3

  4. Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data How to understand, reproduce or reuse data and models of Papers experiments? Manual collection and 4 organization of data provenance

  5. Background and Motivation ● Script-based experiments What are the inputs and outputs? How to change this local program for a similar web service? Difficult to understand, to reuse, and to reproduce. 5 Example of script code.

  6. Background and Motivation ● Scientific Workflows 6 Example of Scientific Workflow Management System.

  7. Overview Understand Reuse Create Reproduce 7

  8. Overview Understand + Reuse Create Reproduce 8

  9. Overview Understand + Reuse Create Reproduce Methodology Step 2 Step 3 Step 1 Step 4 Step 5 9

  10. Related Work ● Script-language specific. ● Workflow-engine specific. ● A new language is needed. ● Outcome is not an executable workflow. ● Do not collect provenance data of the conversion process. 10

  11. Two Kind of Experts ● Scientists – Domain experts who understand the experiment, and the script (sometimes called user ); ● Curators: – Scientists who are also familiar with workflow and script programming or; – Computer scientists who are familiar enough with the domain to be able to implement our methodology; – Responsible for authoring, documenting and publishing workflows and associated resources. 11

  12. Requirements ● Produce workflow-like view of the script. 1 ● Create an executable workflow and compare 2 execution of workflow and script. ● Modify the workflow resources. 3 ● Record provenance data. 4 ● Aggregate all resources to support 5 Reproducibility and Reuse. 12

  13. Requirements ● Produce workflow-like view of the script. 1 Port 1 Port 2 Port 3 Activity 1 Port 1 Port 2 Port 3 Activity 2 Port 3 Activity n Port n Abstract workflow. Script-based experiment. 13

  14. Requirements ● Create executable workflow and compare 2 execution of workflow and script. Executable workflow. Script-based experiment. 14

  15. Requirements ● Modify the workflow resources. 3 (a) Local (b) Algorithm A Algorithm B 15

  16. Requirements ● Record provenance data 4 wasAssociatedWith Workflow Lucas Run used Sample “2012-06-01” used wasStartedAt Activity 1 wasGeneratedBy wasGeneratedBy Output 2 Output 1 used Activity 2 16

  17. Requirements ● Aggregate all resources to support 5 Reproducibility and Reuse. Authors Data Annotations Provenance Scripts Concrete Abstract Papers and workfmows workfmows Reports 17

  18. Methodology 2 Create an Create an executable workfmow executable workfmow 3 Refjne workfmow Refjne workfmow Concrete Abstract workfmow workfmow Generate Abstract Generate Abstract 4 Annotate and Annotate and 1 Workfmow Workfmow check quality check quality Script Bundle Resources into Bundle Resources into 5 a Research Object a Research Object 18

  19. Workflow Research Object (WRO) ● Research Objects are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. ● WROs encapsulate scientific workflows and additional information regarding their Research Object Model context and resources. 19

  20. Running Example ● Molecular Dynamics Simulations – Many branches of material sciences, computational engineering, physics and chemistry. – Scripts (shell script), programs (NAMD, VMD, Fortran) – Phases : set up, simulation and analysis of trajectories. – Inputs : protein structure, simulation parameters and force field files. – Output : trajectories and analysis results. 20

  21. Step 1 Generate Abstract Workfmow Script code. 21

  22. Step 1 Generate Abstract Workfmow Manually annotate Script code. Annotated script code. 22

  23. Step 1 Generate Abstract Workfmow Manually annotate Script code. Annotated script code. Create workflow-like view Abstract workflow. 23

  24. Step 1 Generate Abstract Workfmow code blocks YesWorkflow McPhillips et. al, 2015 Input/ouput - Code comments - Tags: ● @begin ● @end ● @desc ● @in ● @out ● ... Annotated script code. Create Workflow-like T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language- independent tool for recovering workflow information from scripts,” view International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015. Abstract workflow. 24

  25. Step 1 Generate Abstract Workfmow Annotated script code. Create Workflow-like view Abstract workflow. 25

  26. Step 2 Create an executable workfmow Abstract workflow. 26

  27. Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 27 Executable workflow.

  28. Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 28 Executable workflow.

  29. Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 29 Executable workflow. Script code.

  30. Step 3 Refjne executable workfmow Modify resources: Algorithms ● Data Sets ● Parallelization ● Web Services ● ... ● 30 Executable workflow. New workflow version.

  31. Step 3 Refjne executable workfmow Create new version Modify resources: Algorithms ● Data Sets ● Parallelization ● Web Services ● ... ● 31 Executable workflow. New workflow version.

  32. Steps 3 2 Record provenance data: execution traces. wasAssociatedWith Workflow Lucas Run used hasSpecification “2012-06-01” Sample used wasStartedAt split wasGeneratedBy wasGeneratedBy Output 2 Output 1 used psgen Executable workflow. wasEnactedBy W3C PROV 32

  33. Steps 3 2 Record provenance data: conversion process. wasDerivedFrom wasDerivedFrom Script code. wasDerivedFrom Executable workflow. New workflow version. W3C PROV wasAssociatedWith Curator Curator 33

  34. Step 4 Annotate and check quality ● Annotations describing the workflow. ● Use provenance data – To check the quality of the conversion process. ● Run checks to verify the soundness of the workflow. 34

  35. Step 4 Annotate and check quality Script code. Executable workflow. 35

  36. Step 4 Annotate and check quality Initial Executable workflow. 36 Workflow version.

  37. Step 4 Annotate and check quality ● Common mistakes during the conversion: – not clearly identified the main logical processing units in the script; – a mistake when migrating script code into the corresponding activity; – not provided the correct input files and parameters; – the coding of the workflow itself contained errors. 37

  38. Step 5 Bundle Resources into a Research Object Provenance Data Attributions Annotations Script Concrete Abstract Paper workfmow(s) workfmow 38

  39. Contributions ● A methodology that guides curators in a principled manner to transform scripts into reproducible and reusable WRO; ● This addresses an important issue in the area of script provenance; 39

  40. Conclusions ● We addressed issues wrt understanding, reuse and reproducibility of script-based experiments. ● The methodology created was: – elaborated based on requirements; – showcased via a real world use case from the field of Molecular Dynamics; ● We exploited tools and standards from the scientific community: – Scientific Workflows, YesWorkflow, Research Objects, the W3C PROV recommendations and the Web Annotation Data Model. ● The bundle is available at http://w3id.org/w2share/s2rwro/ 40

  41. Next Steps ● Evaluation using other case studies; ● Evaluation of the cost of the effectiveness of our methodology; ● Extension of YesWorkflow to support the semantic annotation of blocks; ● Implementation of tools. 41

  42. Acknowledgments ● FAPESP (grant # 2014/23861-4) ● CCES/CEPID (grant # 2013/08293-7) – Center for Computational Engineering & Sciences ● LIS (Laboratory of Information Systems) ● Prof. Munir Skaf and his group from Institute of Chemistry - Unicamp. 42

  43. Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016

Recommend


More recommend