day 11 workflows with dagman
play

Day 11: Workflows with DAGMan Suggested reading: Condor 7.7 Manual: - PowerPoint PPT Presentation

Computer Sciences 368 Scripting for CHTC Day 11: Workflows with DAGMan Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Section 2.10: DAGMan Applications Chapter 9: condor_submit_dag 2012 Spring Cartwright 1


  1. Computer Sciences 368 Scripting for CHTC Day 11: Workflows with DAGMan Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Section 2.10: DAGMan Applications Chapter 9: condor_submit_dag 2012 Spring Cartwright 1

  2. Computer Sciences 368 Scripting for CHTC Turn In Homework 2012 Spring Cartwright 2

  3. Computer Sciences 368 Scripting for CHTC Homework Review 2012 Spring Cartwright 3

  4. Computer Sciences 368 Scripting for CHTC Workflows 2012 Spring Cartwright 4

  5. Computer Sciences 368 Scripting for CHTC Introduction to Workflow • Series of related steps to complete a complex task Review slides NO NO Attend Write Print YES YES OK? OK? class code code • Organize, manage, and make a process reliable • Important in science, where repeatability is key 2012 Spring Cartwright 5

  6. Computer Sciences 368 Scripting for CHTC Workflow Components • Workflows are essentially algorithmic! • Steps – Prerequisites and inputs – Process (black box / white box) – Outputs • Connections – Sequence – Branching – Parallelism • Metadata: Resources, owners, timing, etc. 2012 Spring Cartwright 6

  7. Computer Sciences 368 Scripting for CHTC Workflow Example I Bioinformatics @ Yale: C. Mason, S. Sanders, M. State 2012 Spring Cartwright 7

  8. Computer Sciences 368 Scripting for CHTC Workflow Example II 9 Terrain data files NAM, RUC, GFS data 3 3D Model Data 1 Interpolator 3D Model (lateral Boundary Terrain Surface data, Conditions) Data Preprocessor upper air mesonet data and Interpolator 11 wind profiler data 15 (Initial Boundary 2 Conditions) ARPS to WRF IDV WRF Static Data Preprocessor Interpolator 4 88D Radar Re-mapper 7 Surface, terrestrial WRF 10 data files ADAS WRF WRF 12 WRF ARPS Radar data Run once per Ensemble (Level II) forecast 5 Generator 13 region 8 NIDS Radar Radar data WRF to ARPS Data Re-mapper ADAM (Level III) Interpolator 6 Satellite Visualization data Satellite Data on users 14 request Re-mapper Repeat ARPS Plotting periodically Program Data mining: for new data look for storm Triggered if a 5" signature storm is detected Static data Real time data Initialization Forecast Visualization Analysis Data Mining LEAD Weather Forecasting 2012 Spring Cartwright 8

  9. Computer Sciences 368 Scripting for CHTC Automated Workflows • Ideally, we want to automate workflows – Minimize wait times and (certain kinds of) errors – Allow humans to concentrate on design and results • Broad objectives: – Capture whole workflow – Define steps clearly – Identify easy automation ✦ Copying files ✦ Changing data formats ✦ Running jobs! – Balance costs vs. savings! 2012 Spring Cartwright 9

  10. Computer Sciences 368 Scripting for CHTC Workflows in CHTC 2012 Spring Cartwright 10

  11. Computer Sciences 368 Scripting for CHTC Directed Acyclic Graphs (DAGs) • Abstract, formal definition of allowable workflows • Terminology – Step (typically, a job) = Node – Connection is directed: Parent → Child “… must succeed before running …” – No loops (or cycles, hence acyclic ) – Each node may have 0– n children – Each node may have 0– n parents Parent Child 2012 Spring Cartwright 11

  12. Computer Sciences 368 Scripting for CHTC Example DAG Shapes A B C D Disconnected A B C D Linear / Serial B A D Diamond C 2012 Spring Cartwright 12

  13. Computer Sciences 368 Scripting for CHTC A Real Scientific DAG datafind_L_1 tmplt_L1_1 tmplt_L1_2 tmplt_L1_3 tmplt_L1_4 tmplt_L1_5 tmplt_L1_6 datafind_L_2 insp_L1_1 insp_L1_2 insp_L1_3 insp_L1_4 insp_L1_5 insp_L1_6 datafind_L_3 tmplt_L1_7 tmplt_L1_8 tmplt_L1_9 trigbank_H2_1 trigbank_H2_2 trigbank_H1_1 trigbank_H1_2 trigbank_H1_3 datafind_H_1 trigbank_H1_4 tmplt_L1_11 tmplt_L1_10 insp_L1_7 insp_L1_8 insp_L1_9 insp_H1_1 insp_H1_2 insp_H1_3 insp_H1_4 insp_L1_11 insp_L1_10 datafind_H_2 trigbank_H1_5 trigbank_H1_6 trigbank_H2_4 trigbank_H2_3 inca_L1H1_3 inca_L1H1_1 trigbank_H1_8 datafind_H_3 trigbank_H1_7 insp_H1_5 insp_H1_6 trigbank_H2_6 datafind_H_4 insp_H1_8 insp_H1_7 inca_L1H1_2 insp_H2_1 insp_H2_2 trigbank_H2_5 trigbank_H2_7 datafind_H_5 inca_L1H1_4 inca_L1H2_1 datafind_H_6 trigbank_H2_9 insp_H2_4 insp_H2_3 insp_H2_9 insp_H2_5 insp_H2_6 datafind_H_7 trigbank_H2_8 inca_L1H2_2 inca_H1H2_1 insp_H2_8 insp_H2_7 inca_L1H1_5 inca_H1H2_2 inca_L1H1_6 Laser Interferometer Gravitational-wave Observatory (LIGO) 2012 Spring Cartwright 13

  14. Computer Sciences 368 Scripting for CHTC Condor DAGMan • DAGMan: D irected A cyclic G raph Man ager • Organize Condor jobs into a DAG • Condor handles all details of running workflow – Submits individual jobs when appropriate – Tracks overall workflow – Can retry failed nodes and resume failed workflow – Can limit amount of work done at once • DAGs up to 1,000,000 nodes have been run! 2012 Spring Cartwright 14

  15. Computer Sciences 368 Scripting for CHTC DAGMan Nodes I 2012 Spring Cartwright 15

  16. Computer Sciences 368 Scripting for CHTC DAGMan Nodes I Job (Cluster) 2012 Spring Cartwright 15

  17. Computer Sciences 368 Scripting for CHTC DAGMan Nodes I Pre-Script Job (Cluster) Post-Script 2012 Spring Cartwright 15

  18. Computer Sciences 368 Scripting for CHTC DAGMan Nodes I • prepare data • check prereq.s • skip node Pre-Script Job (Cluster) Post-Script 2012 Spring Cartwright 15

  19. Computer Sciences 368 Scripting for CHTC DAGMan Nodes I • prepare data • check prereq.s • skip node Pre-Script Job (Cluster) Post-Script • clean up files • check success 2012 Spring Cartwright 15

  20. Computer Sciences 368 Scripting for CHTC DAGMan Nodes II • Order of execution 1. Pre-script on submit machine 2. Job(s) on pool 3. Post-script on submit machine • Failure handling – Pre-script exit ≠ 0 : Skip job, run post-script (if any) – Any job exit ≠ 0 : Run post-script (if any) – Last exit status determines success/failure of node • Make sure scripts exit 0 upon success! • Can skip job & post on given pre-script exit status 2012 Spring Cartwright 16

  21. Computer Sciences 368 Scripting for CHTC DAGMan Files 2012 Spring Cartwright 17

  22. Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18

  23. Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File Comment # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18

  24. Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes JOB First first.sub Declare node JOB Analyze1 stats-1.sub name and its JOB Analyze2 stats-2.sub Condor submit file JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18

  25. Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub Define pre/post SCRIPT PRE Sum verify-all.py 2 scripts for nodes # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18

  26. Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 Define node connections # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18

  27. Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes First JOB First first.sub JOB Analyze1 stats-1.sub Analyze1 Analyze2 JOB Analyze2 stats-2.sub Sum JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18

  28. Computer Sciences 368 Scripting for CHTC Define a Job JOB name submit-file • One per node • Defines node’s name , unique within this DAG • Associated with a Condor submit-file • Job must yield 1 cluster; may have many processes JOB Collate collate.sub JOB Rjob3 run-r-3.sub 2012 Spring Cartwright 19

  29. Computer Sciences 368 Scripting for CHTC Define Dependencies PARENT parent1 p2 … CHILD child1 c2 … • Defines the “lines” (dependencies) between nodes • Parent and child names are node names (cf. JOB ) • EACH child depends on ALL parents PARENT p1 p2 p3 CHILD c1 c2 p1 p2 p3 c1 c2 2012 Spring Cartwright 20

Recommend


More recommend