Prediction of workflow execution time using provenance traces: - PowerPoint PPT Presentation

Prediction of workflow execution time using provenance traces: practical applications in medical data processing Hugo Hiden Simon Woodman Paul Watson

How long will my program take to run?

Part of a bigger picture Can I repeat my results? What are the implications of errors How long will my program take to run? What version of the program ran? How was a result generated?

Provenance Research • Used to answer these questions • Important in scientific research • Lots of work done to capture and represent provenance • Active research area OPM PROV

e-Science Central • Source of all our provenance data – Platform used for many projects • Repository of code and data – Users can add their own code • Well instrumented and understood – Used to collect OPM – Now PROV • Plenty of data sets – Diverse projects – Large applications • Workflows for data processing

The workflow model • Simple workflow implementation – Acyclic directed graph – Composed of connected “Blocks” – Deploys at reasonable scale in clouds

Modelling performance • Execution time for a single block – Workflow is some combination of individual block models • There should be some predictors: – The input data sizes – The configuration of the block – The machine it is running on • The issues are: – What types of model are most appropriate – How accurate are they

Execution time of a block me= f (i time ti (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors) More data Each block has The Machine load, increases different configuration of network traffic, execution time characteristics, the block hardware so a model is instance can variations,… needed for each change block behavior A workflow is a connected pathway of blocks…

Requirements for a “real” system • Proactively build models – In response to more data – When more blocks are added • Select the most appropriate model – Pick based on best error • Aim to always return some estimate – Mechanisms to return estimate if no models are available

Complications • Gathering data – Collect data ”non-invasively” • Model types – Different blocks display different characteristics – Different algorithms and versions • Dynamic environment – New blocks being added – Block behaviour only becomes apparent as data is collected

Data collected via provenance • Provenance collection already captures: – Data sizes – Code versions – Algorithm settings • Extra instrumentation for – Block start and end times – Number of concurrent workflows – CPU / Memory usage

e-SC Architecture Tooling External API HTTP - Maven Plugins REST - File uploader - Domain specific apps/websites User MGMT - Friends Postgres - Groups MySQL - Projects Workflow Engines e-SC DB SQL Server - Quotas Workflow Queue Filesystem S3 e-SC Blob Postgres Security Store Azure Blob Store Migration Queue - ACL HDFS - Authentication Control Topic Service/Lib Cache New e-SC Blob Store Storage Private API Provenance Store - Versioning - Archiving RMI OpenID Postgres REST Shibboleth Provenance Queue Processing - Services External Auth - Workflows Neo4j - Libs SWORD Provenance/ Archive Queues Audit Filesystem - Capture Archive AWS Glacier - Query/Search - Presentation

Data capture architecture Tooling External API HTTP - Maven Plugins REST Provenance and performance data capture - File uploader - Domain specific apps/websites User MGMT - Friends - Groups - Projects Workflow Engines Data / model storage - Quotas Workflow Queue Security - ACL - Authentication Control Topic Service/Lib Cache Data Models Storage Private API - Versioning - Archiving RMI OpenID REST Shibboleth Processing - Services Model building / updating External Auth - Workflows - Libs SWORD Provenance/ Audit - Capture - Query/Search - Presentation

Data collected • Each execution of a block creates a single data point: ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size Identifying Model X data Model Y data data

Block models Execution Time No relationship Blocks may exhibit very Observed Execution Data different behaviors depending on their Linear relationship Execution Time implementation details / configuration Observed Execution Data Execution Time Non-linear relationship Observed Execution Data

Selecting the most appropriate model

Dynamic model updating • Impossible (difficult) to know what the best model will be – Gathering more data may change our view • Need to implement model updating – Models can be rebuilt and replaced on the fly • Return best available estimate at a given time – This may improve

“Panel of experts” pattern • Maintain a suite of different models – Rebuild them all when new data arrives – Use the best one until the next update • Drug modelling project: Quantitative Structure Activity Relationship f( ) Activity ≈

Model fallbacks • What happens if there is no model? – Still want to return something • We used the following logic: – Use version agnostic model – Use average execution time of block – Use average execution time of all blocks • This will always return some prediction as long as a single block of any type has executed

Medical data processing • Measure acceleration in 3-axes Wrist worn – Typically 100Hz accelerometers – Worn for 2 weeks – Analyse sleep patterns, general activity levels etc • Data collected and analysed – Clinicians view results and modify exercise regime – Collections of 100k data sets (24TB)

Results Physical Activity Classification (PAC1) 100 3500 Prediction Prediction Fitted Fitted 95 Ideal 3000 Ideal 90 2500 85 Predicted (seconds) Predicted (KB) 80 2000 75 1500 70 65 1000 60 500 55 50 0 55 60 65 70 75 80 85 90 95 100 0 500 1000 1500 2000 2500 3000 3500 Actual (KB) Actual (seconds) Output size model Duration model

Results GGIR GENEActiv processing 22000 1600 Prediction Prediction (RMSE=34.670, r 2 =0.987) Fitted 20000 Fitted Ideal 1400 Ideal 18000 1200 16000 Predicted (seconds) Predicted (KB) 14000 1000 12000 800 10000 8000 600 6000 400 4000 2000 200 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 200 400 600 800 1000 1200 1400 1600 Actual (KB) Actual (seconds) Output size model Duration model

Not always successful 80 Prediction Fitted 70 Ideal 60 Predicted (seconds) 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Actual (seconds)

Predicting Workflow duration Modelling is complicated by connected nature of workflow how big are the All data for model ? intermediate data … not the case here readily available… transfers ? ? ? ? ? ?

Data volume produced by a block size= f (i si (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors) More data Each block has The Machine load, increases different configuration of network traffic, execution time characteristics, the block hardware so a model is instance can variations, needed for each change phase of moon block behavior

Modelling total execution time Execution time = Sum(block predictions)

Results Chemical property modelling Models built for each individual • block Prediction generated by • propagating size predictions 110 Training Prediction (RMSE=5.008,r 2 =0.980) 100 Testing Prediction (RMSE=4.698,r 2 =0.981) Fitted Training 90 Ideal 80 Predicted (seconds) 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 Actual (seconds)

Modelling workflows: caveats • Much harder to model workflow duration – Propagation of errors • Works for simple workflows – Rapidly fails for larger workflows • Possible solutions – More data collection – Model groups of blocks – Build models of whole workflows

Conclusions • Extended provenance capture to build predictive models – Asynchronous collection of data and model building • Demonstrated it is possible to model block execution time • Show it may be possible to combine predictions to estimate workflow execution time – Large workflows / poor block models are issues

Prediction of workflow execution time using provenance traces: - PowerPoint PPT Presentation

Prediction of workflow execution time using provenance traces: practical applications in medical data processing Hugo Hiden Simon Woodman Paul Watson How long will my program take to run? Part of a bigger picture Can I repeat my results?

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Provenance for Interactive Visualizations Fotis Psallidas Eugene Wu fotis@cs.columbia.edu

Provenance Tracking in CXXR Chris A. Silles Andrew R. Runnalls Computing Laboratory, University

PASS PASS Provenance-Aware Storage System Provenance-Aware Storage System Margo Seltzer, David

Scalable Uncertainty Management 03 Provenance Rainer Gemulla May 18, 2012 Overview In this

Provenance of astronomical data The IVOA Provenance Working Group: Catherine Boisson Franois

Provenance from the data provider view constructing provenance information for the APPLAUSE

Peoplesoft Workflow Peoplesoft Workflow Technology Technology Putting Customer First SOA IT

A graph model for data and workflow provenance Umut Acar, Peter Buneman, James Cheney , Natalia

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

STAR-CCM+ in your Workflow Bill Jester, CD-adapco STAR-CCM+ in your workflow Contents

Day 8 Workflow Cloud Resource Provisioning Todays Agenda Introduction What is workflow?

workflow: workflow: QSPR = Quantitative Structure Property

A Workflow Workflow for for Retrieving Retrieving Orthologous Orthologous A Promoters and I

Provenance -Only Integration Ashish Gehani Dawood Tariq SRI Provenance -Only Integration p.

Provenance Analytics and Visualization Juliana Freire VisTrails Group & Web and Databases

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

An Education, Research and Innovation Ecosystem 2 Carnegie Mellon Portugal Program The

sttde : a time-depending and post- estimation within time-interval command Hugo Sjqvist

Introd u ction to iterators P YTH ON DATA SC IE N C E TOOL BOX ( PAR T 2 ) H u go Bo w ne -

The Astrophysical Multimessenger Observatory Network Hugo Ayala Entering a new era where we can

Neural Networks for Machine Learning Lecture 10a Why it helps to combine models Geoffrey Hinton

Complexity Theory Jan K ret nsk y Chair for Foundations of Software Reliability and

The Hugo Group meeting slides June 2016 Current Issues The Labour-Greens pact

Sambuz

Useful Links

Newsletter

Mail Us

Prediction of workflow execution time using provenance traces: - PowerPoint PPT Presentation

Prediction of workflow execution time using provenance traces: practical applications in medical data processing Hugo Hiden Simon Woodman Paul Watson How long will my program take to run? Part of a bigger picture Can I repeat my results?

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Provenance for Interactive Visualizations Fotis Psallidas Eugene Wu fotis@cs.columbia.edu

Provenance Tracking in CXXR Chris A. Silles Andrew R. Runnalls Computing Laboratory, University

PASS PASS Provenance-Aware Storage System Provenance-Aware Storage System Margo Seltzer, David

Scalable Uncertainty Management 03 Provenance Rainer Gemulla May 18, 2012 Overview In this

Provenance of astronomical data The IVOA Provenance Working Group: Catherine Boisson Franois

Provenance from the data provider view constructing provenance information for the APPLAUSE

Peoplesoft Workflow Peoplesoft Workflow Technology Technology Putting Customer First SOA IT

A graph model for data and workflow provenance Umut Acar, Peter Buneman, James Cheney , Natalia

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

STAR-CCM+ in your Workflow Bill Jester, CD-adapco STAR-CCM+ in your workflow Contents

Day 8 Workflow Cloud Resource Provisioning Todays Agenda Introduction What is workflow?

workflow: workflow: QSPR = Quantitative Structure Property

A Workflow Workflow for for Retrieving Retrieving Orthologous Orthologous A Promoters and I

Provenance -Only Integration Ashish Gehani Dawood Tariq SRI Provenance -Only Integration p.

Provenance Analytics and Visualization Juliana Freire VisTrails Group &amp; Web and Databases

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

An Education, Research and Innovation Ecosystem 2 Carnegie Mellon Portugal Program The

sttde : a time-depending and post- estimation within time-interval command Hugo Sjqvist

Introd u ction to iterators P YTH ON DATA SC IE N C E TOOL BOX ( PAR T 2 ) H u go Bo w ne -

The Astrophysical Multimessenger Observatory Network Hugo Ayala Entering a new era where we can

Neural Networks for Machine Learning Lecture 10a Why it helps to combine models Geoffrey Hinton

Complexity Theory Jan K ret nsk y Chair for Foundations of Software Reliability and

The Hugo Group meeting slides June 2016 Current Issues The Labour-Greens pact

Sambuz

Useful Links

Newsletter

Mail Us

Provenance Analytics and Visualization Juliana Freire VisTrails Group & Web and Databases