Prediction of workflow execution time using provenance traces: practical applications in medical data processing Hugo Hiden Simon Woodman Paul Watson
How long will my program take to run?
Part of a bigger picture Can I repeat my results? What are the implications of errors How long will my program take to run? What version of the program ran? How was a result generated?
Provenance Research • Used to answer these questions • Important in scientific research • Lots of work done to capture and represent provenance • Active research area OPM PROV
e-Science Central • Source of all our provenance data – Platform used for many projects • Repository of code and data – Users can add their own code • Well instrumented and understood – Used to collect OPM – Now PROV • Plenty of data sets – Diverse projects – Large applications • Workflows for data processing
The workflow model • Simple workflow implementation – Acyclic directed graph – Composed of connected “Blocks” – Deploys at reasonable scale in clouds
Modelling performance • Execution time for a single block – Workflow is some combination of individual block models • There should be some predictors: – The input data sizes – The configuration of the block – The machine it is running on • The issues are: – What types of model are most appropriate – How accurate are they
Execution time of a block me= f (i time ti (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors) More data Each block has The Machine load, increases different configuration of network traffic, execution time characteristics, the block hardware so a model is instance can variations,… needed for each change block behavior A workflow is a connected pathway of blocks…
Requirements for a “real” system • Proactively build models – In response to more data – When more blocks are added • Select the most appropriate model – Pick based on best error • Aim to always return some estimate – Mechanisms to return estimate if no models are available
Complications • Gathering data – Collect data ”non-invasively” • Model types – Different blocks display different characteristics – Different algorithms and versions • Dynamic environment – New blocks being added – Block behaviour only becomes apparent as data is collected
Data collected via provenance • Provenance collection already captures: – Data sizes – Code versions – Algorithm settings • Extra instrumentation for – Block start and end times – Number of concurrent workflows – CPU / Memory usage
e-SC Architecture Tooling External API HTTP - Maven Plugins REST - File uploader - Domain specific apps/websites User MGMT - Friends Postgres - Groups MySQL - Projects Workflow Engines e-SC DB SQL Server - Quotas Workflow Queue Filesystem S3 e-SC Blob Postgres Security Store Azure Blob Store Migration Queue - ACL HDFS - Authentication Control Topic Service/Lib Cache New e-SC Blob Store Storage Private API Provenance Store - Versioning - Archiving RMI OpenID Postgres REST Shibboleth Provenance Queue Processing - Services External Auth - Workflows Neo4j - Libs SWORD Provenance/ Archive Queues Audit Filesystem - Capture Archive AWS Glacier - Query/Search - Presentation
Data capture architecture Tooling External API HTTP - Maven Plugins REST Provenance and performance data capture - File uploader - Domain specific apps/websites User MGMT - Friends - Groups - Projects Workflow Engines Data / model storage - Quotas Workflow Queue Security - ACL - Authentication Control Topic Service/Lib Cache Data Models Storage Private API - Versioning - Archiving RMI OpenID REST Shibboleth Processing - Services Model building / updating External Auth - Workflows - Libs SWORD Provenance/ Audit - Capture - Query/Search - Presentation
Data collected • Each execution of a block creates a single data point: ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size Identifying Model X data Model Y data data
Block models Execution Time No relationship Blocks may exhibit very Observed Execution Data different behaviors depending on their Linear relationship Execution Time implementation details / configuration Observed Execution Data Execution Time Non-linear relationship Observed Execution Data
Selecting the most appropriate model
Selecting the most appropriate model
Selecting the most appropriate model
Selecting the most appropriate model
Dynamic model updating • Impossible (difficult) to know what the best model will be – Gathering more data may change our view • Need to implement model updating – Models can be rebuilt and replaced on the fly • Return best available estimate at a given time – This may improve
“Panel of experts” pattern • Maintain a suite of different models – Rebuild them all when new data arrives – Use the best one until the next update • Drug modelling project: Quantitative Structure Activity Relationship f( ) Activity ≈
Model fallbacks • What happens if there is no model? – Still want to return something • We used the following logic: – Use version agnostic model – Use average execution time of block – Use average execution time of all blocks • This will always return some prediction as long as a single block of any type has executed
Medical data processing • Measure acceleration in 3-axes Wrist worn – Typically 100Hz accelerometers – Worn for 2 weeks – Analyse sleep patterns, general activity levels etc • Data collected and analysed – Clinicians view results and modify exercise regime – Collections of 100k data sets (24TB)
Results Physical Activity Classification (PAC1) 100 3500 Prediction Prediction Fitted Fitted 95 Ideal 3000 Ideal 90 2500 85 Predicted (seconds) Predicted (KB) 80 2000 75 1500 70 65 1000 60 500 55 50 0 55 60 65 70 75 80 85 90 95 100 0 500 1000 1500 2000 2500 3000 3500 Actual (KB) Actual (seconds) Output size model Duration model
Results GGIR GENEActiv processing 22000 1600 Prediction Prediction (RMSE=34.670, r 2 =0.987) Fitted 20000 Fitted Ideal 1400 Ideal 18000 1200 16000 Predicted (seconds) Predicted (KB) 14000 1000 12000 800 10000 8000 600 6000 400 4000 2000 200 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 200 400 600 800 1000 1200 1400 1600 Actual (KB) Actual (seconds) Output size model Duration model
Not always successful 80 Prediction Fitted 70 Ideal 60 Predicted (seconds) 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Actual (seconds)
Predicting Workflow duration Modelling is complicated by connected nature of workflow how big are the All data for model ? intermediate data … not the case here readily available… transfers ? ? ? ? ? ?
Data volume produced by a block size= f (i si (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors) More data Each block has The Machine load, increases different configuration of network traffic, execution time characteristics, the block hardware so a model is instance can variations, needed for each change phase of moon block behavior
Modelling total execution time Execution time = Sum(block predictions)
Results Chemical property modelling Models built for each individual • block Prediction generated by • propagating size predictions 110 Training Prediction (RMSE=5.008,r 2 =0.980) 100 Testing Prediction (RMSE=4.698,r 2 =0.981) Fitted Training 90 Ideal 80 Predicted (seconds) 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 Actual (seconds)
Modelling workflows: caveats • Much harder to model workflow duration – Propagation of errors • Works for simple workflows – Rapidly fails for larger workflows • Possible solutions – More data collection – Model groups of blocks – Build models of whole workflows
Conclusions • Extended provenance capture to build predictive models – Asynchronous collection of data and model building • Demonstrated it is possible to model block execution time • Show it may be possible to combine predictions to estimate workflow execution time – Large workflows / poor block models are issues
Recommend
More recommend