Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016
Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers 2
Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data How to understand, reproduce or reuse data and models of Papers experiments? 3
Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data How to understand, reproduce or reuse data and models of Papers experiments? Manual collection and 4 organization of data provenance
Background and Motivation ● Script-based experiments What are the inputs and outputs? How to change this local program for a similar web service? Difficult to understand, to reuse, and to reproduce. 5 Example of script code.
Background and Motivation ● Scientific Workflows 6 Example of Scientific Workflow Management System.
Overview Understand Reuse Create Reproduce 7
Overview Understand + Reuse Create Reproduce 8
Overview Understand + Reuse Create Reproduce Methodology Step 2 Step 3 Step 1 Step 4 Step 5 9
Related Work ● Script-language specific. ● Workflow-engine specific. ● A new language is needed. ● Outcome is not an executable workflow. ● Do not collect provenance data of the conversion process. 10
Two Kind of Experts ● Scientists – Domain experts who understand the experiment, and the script (sometimes called user ); ● Curators: – Scientists who are also familiar with workflow and script programming or; – Computer scientists who are familiar enough with the domain to be able to implement our methodology; – Responsible for authoring, documenting and publishing workflows and associated resources. 11
Requirements ● Produce workflow-like view of the script. 1 ● Create an executable workflow and compare 2 execution of workflow and script. ● Modify the workflow resources. 3 ● Record provenance data. 4 ● Aggregate all resources to support 5 Reproducibility and Reuse. 12
Requirements ● Produce workflow-like view of the script. 1 Port 1 Port 2 Port 3 Activity 1 Port 1 Port 2 Port 3 Activity 2 Port 3 Activity n Port n Abstract workflow. Script-based experiment. 13
Requirements ● Create executable workflow and compare 2 execution of workflow and script. Executable workflow. Script-based experiment. 14
Requirements ● Modify the workflow resources. 3 (a) Local (b) Algorithm A Algorithm B 15
Requirements ● Record provenance data 4 wasAssociatedWith Workflow Lucas Run used Sample “2012-06-01” used wasStartedAt Activity 1 wasGeneratedBy wasGeneratedBy Output 2 Output 1 used Activity 2 16
Requirements ● Aggregate all resources to support 5 Reproducibility and Reuse. Authors Data Annotations Provenance Scripts Concrete Abstract Papers and workfmows workfmows Reports 17
Methodology 2 Create an Create an executable workfmow executable workfmow 3 Refjne workfmow Refjne workfmow Concrete Abstract workfmow workfmow Generate Abstract Generate Abstract 4 Annotate and Annotate and 1 Workfmow Workfmow check quality check quality Script Bundle Resources into Bundle Resources into 5 a Research Object a Research Object 18
Workflow Research Object (WRO) ● Research Objects are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. ● WROs encapsulate scientific workflows and additional information regarding their Research Object Model context and resources. 19
Running Example ● Molecular Dynamics Simulations – Many branches of material sciences, computational engineering, physics and chemistry. – Scripts (shell script), programs (NAMD, VMD, Fortran) – Phases : set up, simulation and analysis of trajectories. – Inputs : protein structure, simulation parameters and force field files. – Output : trajectories and analysis results. 20
Step 1 Generate Abstract Workfmow Script code. 21
Step 1 Generate Abstract Workfmow Manually annotate Script code. Annotated script code. 22
Step 1 Generate Abstract Workfmow Manually annotate Script code. Annotated script code. Create workflow-like view Abstract workflow. 23
Step 1 Generate Abstract Workfmow code blocks YesWorkflow McPhillips et. al, 2015 Input/ouput - Code comments - Tags: ● @begin ● @end ● @desc ● @in ● @out ● ... Annotated script code. Create Workflow-like T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language- independent tool for recovering workflow information from scripts,” view International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015. Abstract workflow. 24
Step 1 Generate Abstract Workfmow Annotated script code. Create Workflow-like view Abstract workflow. 25
Step 2 Create an executable workfmow Abstract workflow. 26
Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 27 Executable workflow.
Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 28 Executable workflow.
Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 29 Executable workflow. Script code.
Step 3 Refjne executable workfmow Modify resources: Algorithms ● Data Sets ● Parallelization ● Web Services ● ... ● 30 Executable workflow. New workflow version.
Step 3 Refjne executable workfmow Create new version Modify resources: Algorithms ● Data Sets ● Parallelization ● Web Services ● ... ● 31 Executable workflow. New workflow version.
Steps 3 2 Record provenance data: execution traces. wasAssociatedWith Workflow Lucas Run used hasSpecification “2012-06-01” Sample used wasStartedAt split wasGeneratedBy wasGeneratedBy Output 2 Output 1 used psgen Executable workflow. wasEnactedBy W3C PROV 32
Steps 3 2 Record provenance data: conversion process. wasDerivedFrom wasDerivedFrom Script code. wasDerivedFrom Executable workflow. New workflow version. W3C PROV wasAssociatedWith Curator Curator 33
Step 4 Annotate and check quality ● Annotations describing the workflow. ● Use provenance data – To check the quality of the conversion process. ● Run checks to verify the soundness of the workflow. 34
Step 4 Annotate and check quality Script code. Executable workflow. 35
Step 4 Annotate and check quality Initial Executable workflow. 36 Workflow version.
Step 4 Annotate and check quality ● Common mistakes during the conversion: – not clearly identified the main logical processing units in the script; – a mistake when migrating script code into the corresponding activity; – not provided the correct input files and parameters; – the coding of the workflow itself contained errors. 37
Step 5 Bundle Resources into a Research Object Provenance Data Attributions Annotations Script Concrete Abstract Paper workfmow(s) workfmow 38
Contributions ● A methodology that guides curators in a principled manner to transform scripts into reproducible and reusable WRO; ● This addresses an important issue in the area of script provenance; 39
Conclusions ● We addressed issues wrt understanding, reuse and reproducibility of script-based experiments. ● The methodology created was: – elaborated based on requirements; – showcased via a real world use case from the field of Molecular Dynamics; ● We exploited tools and standards from the scientific community: – Scientific Workflows, YesWorkflow, Research Objects, the W3C PROV recommendations and the Web Annotation Data Model. ● The bundle is available at http://w3id.org/w2share/s2rwro/ 40
Next Steps ● Evaluation using other case studies; ● Evaluation of the cost of the effectiveness of our methodology; ● Extension of YesWorkflow to support the semantic annotation of blocks; ● Implementation of tools. 41
Acknowledgments ● FAPESP (grant # 2014/23861-4) ● CCES/CEPID (grant # 2013/08293-7) – Center for Computational Engineering & Sciences ● LIS (Laboratory of Information Systems) ● Prof. Munir Skaf and his group from Institute of Chemistry - Unicamp. 42
Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016
Recommend
More recommend