Computer Sciences 368 Scripting for CHTC Day 11: Workflows with DAGMan Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Section 2.10: DAGMan Applications Chapter 9: condor_submit_dag 2012 Spring Cartwright 1
Computer Sciences 368 Scripting for CHTC Turn In Homework 2012 Spring Cartwright 2
Computer Sciences 368 Scripting for CHTC Homework Review 2012 Spring Cartwright 3
Computer Sciences 368 Scripting for CHTC Workflows 2012 Spring Cartwright 4
Computer Sciences 368 Scripting for CHTC Introduction to Workflow • Series of related steps to complete a complex task Review slides NO NO Attend Write Print YES YES OK? OK? class code code • Organize, manage, and make a process reliable • Important in science, where repeatability is key 2012 Spring Cartwright 5
Computer Sciences 368 Scripting for CHTC Workflow Components • Workflows are essentially algorithmic! • Steps – Prerequisites and inputs – Process (black box / white box) – Outputs • Connections – Sequence – Branching – Parallelism • Metadata: Resources, owners, timing, etc. 2012 Spring Cartwright 6
Computer Sciences 368 Scripting for CHTC Workflow Example I Bioinformatics @ Yale: C. Mason, S. Sanders, M. State 2012 Spring Cartwright 7
Computer Sciences 368 Scripting for CHTC Workflow Example II 9 Terrain data files NAM, RUC, GFS data 3 3D Model Data 1 Interpolator 3D Model (lateral Boundary Terrain Surface data, Conditions) Data Preprocessor upper air mesonet data and Interpolator 11 wind profiler data 15 (Initial Boundary 2 Conditions) ARPS to WRF IDV WRF Static Data Preprocessor Interpolator 4 88D Radar Re-mapper 7 Surface, terrestrial WRF 10 data files ADAS WRF WRF 12 WRF ARPS Radar data Run once per Ensemble (Level II) forecast 5 Generator 13 region 8 NIDS Radar Radar data WRF to ARPS Data Re-mapper ADAM (Level III) Interpolator 6 Satellite Visualization data Satellite Data on users 14 request Re-mapper Repeat ARPS Plotting periodically Program Data mining: for new data look for storm Triggered if a 5" signature storm is detected Static data Real time data Initialization Forecast Visualization Analysis Data Mining LEAD Weather Forecasting 2012 Spring Cartwright 8
Computer Sciences 368 Scripting for CHTC Automated Workflows • Ideally, we want to automate workflows – Minimize wait times and (certain kinds of) errors – Allow humans to concentrate on design and results • Broad objectives: – Capture whole workflow – Define steps clearly – Identify easy automation ✦ Copying files ✦ Changing data formats ✦ Running jobs! – Balance costs vs. savings! 2012 Spring Cartwright 9
Computer Sciences 368 Scripting for CHTC Workflows in CHTC 2012 Spring Cartwright 10
Computer Sciences 368 Scripting for CHTC Directed Acyclic Graphs (DAGs) • Abstract, formal definition of allowable workflows • Terminology – Step (typically, a job) = Node – Connection is directed: Parent → Child “… must succeed before running …” – No loops (or cycles, hence acyclic ) – Each node may have 0– n children – Each node may have 0– n parents Parent Child 2012 Spring Cartwright 11
Computer Sciences 368 Scripting for CHTC Example DAG Shapes A B C D Disconnected A B C D Linear / Serial B A D Diamond C 2012 Spring Cartwright 12
Computer Sciences 368 Scripting for CHTC A Real Scientific DAG datafind_L_1 tmplt_L1_1 tmplt_L1_2 tmplt_L1_3 tmplt_L1_4 tmplt_L1_5 tmplt_L1_6 datafind_L_2 insp_L1_1 insp_L1_2 insp_L1_3 insp_L1_4 insp_L1_5 insp_L1_6 datafind_L_3 tmplt_L1_7 tmplt_L1_8 tmplt_L1_9 trigbank_H2_1 trigbank_H2_2 trigbank_H1_1 trigbank_H1_2 trigbank_H1_3 datafind_H_1 trigbank_H1_4 tmplt_L1_11 tmplt_L1_10 insp_L1_7 insp_L1_8 insp_L1_9 insp_H1_1 insp_H1_2 insp_H1_3 insp_H1_4 insp_L1_11 insp_L1_10 datafind_H_2 trigbank_H1_5 trigbank_H1_6 trigbank_H2_4 trigbank_H2_3 inca_L1H1_3 inca_L1H1_1 trigbank_H1_8 datafind_H_3 trigbank_H1_7 insp_H1_5 insp_H1_6 trigbank_H2_6 datafind_H_4 insp_H1_8 insp_H1_7 inca_L1H1_2 insp_H2_1 insp_H2_2 trigbank_H2_5 trigbank_H2_7 datafind_H_5 inca_L1H1_4 inca_L1H2_1 datafind_H_6 trigbank_H2_9 insp_H2_4 insp_H2_3 insp_H2_9 insp_H2_5 insp_H2_6 datafind_H_7 trigbank_H2_8 inca_L1H2_2 inca_H1H2_1 insp_H2_8 insp_H2_7 inca_L1H1_5 inca_H1H2_2 inca_L1H1_6 Laser Interferometer Gravitational-wave Observatory (LIGO) 2012 Spring Cartwright 13
Computer Sciences 368 Scripting for CHTC Condor DAGMan • DAGMan: D irected A cyclic G raph Man ager • Organize Condor jobs into a DAG • Condor handles all details of running workflow – Submits individual jobs when appropriate – Tracks overall workflow – Can retry failed nodes and resume failed workflow – Can limit amount of work done at once • DAGs up to 1,000,000 nodes have been run! 2012 Spring Cartwright 14
Computer Sciences 368 Scripting for CHTC DAGMan Nodes I 2012 Spring Cartwright 15
Computer Sciences 368 Scripting for CHTC DAGMan Nodes I Job (Cluster) 2012 Spring Cartwright 15
Computer Sciences 368 Scripting for CHTC DAGMan Nodes I Pre-Script Job (Cluster) Post-Script 2012 Spring Cartwright 15
Computer Sciences 368 Scripting for CHTC DAGMan Nodes I • prepare data • check prereq.s • skip node Pre-Script Job (Cluster) Post-Script 2012 Spring Cartwright 15
Computer Sciences 368 Scripting for CHTC DAGMan Nodes I • prepare data • check prereq.s • skip node Pre-Script Job (Cluster) Post-Script • clean up files • check success 2012 Spring Cartwright 15
Computer Sciences 368 Scripting for CHTC DAGMan Nodes II • Order of execution 1. Pre-script on submit machine 2. Job(s) on pool 3. Post-script on submit machine • Failure handling – Pre-script exit ≠ 0 : Skip job, run post-script (if any) – Any job exit ≠ 0 : Run post-script (if any) – Last exit status determines success/failure of node • Make sure scripts exit 0 upon success! • Can skip job & post on given pre-script exit status 2012 Spring Cartwright 16
Computer Sciences 368 Scripting for CHTC DAGMan Files 2012 Spring Cartwright 17
Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18
Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File Comment # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18
Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes JOB First first.sub Declare node JOB Analyze1 stats-1.sub name and its JOB Analyze2 stats-2.sub Condor submit file JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18
Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub Define pre/post SCRIPT PRE Sum verify-all.py 2 scripts for nodes # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18
Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 Define node connections # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18
Computer Sciences 368 Scripting for CHTC Basic DAGMan Submit File # Define nodes First JOB First first.sub JOB Analyze1 stats-1.sub Analyze1 Analyze2 JOB Analyze2 stats-2.sub Sum JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum 2012 Spring Cartwright 18
Computer Sciences 368 Scripting for CHTC Define a Job JOB name submit-file • One per node • Defines node’s name , unique within this DAG • Associated with a Condor submit-file • Job must yield 1 cluster; may have many processes JOB Collate collate.sub JOB Rjob3 run-r-3.sub 2012 Spring Cartwright 19
Computer Sciences 368 Scripting for CHTC Define Dependencies PARENT parent1 p2 … CHILD child1 c2 … • Defines the “lines” (dependencies) between nodes • Parent and child names are node names (cf. JOB ) • EACH child depends on ALL parents PARENT p1 p2 p3 CHILD c1 c2 p1 p2 p3 c1 c2 2012 Spring Cartwright 20
Recommend
More recommend