Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud Huangshi Tian, Yunchuan Zheng, Wei Wang @ HKUST Nov. 21, 2019 1
Every job is born equal, but some are more complicated ... Hadoop Spark Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud Ecosystem (MLlib, SQL, GraphX) Huangshi Tian, Yunchuan Zheng, Wei Wang @ HKUST Nov. 21, 2019 2 1
Do job DAGs have anything special? Every job is born equal, but some are more complicated ... Hadoop Spark Ecosystem (MLlib, SQL, GraphX) 2 3
Do job DAGs have anything special? A Glimpse into Production Clusters In the year of 2018, Alibaba has released a trace that ... spans 8 days, records the activity of both long-running containers and batch jobs ... from a cluster of 4034 machines. 3 4
A Glimpse into Production Clusters Zoom in on Batch Jobs In the year of 2018, Alibaba has released a trace that ... Terminologies spans 8 days, task instance records the activity of both long-running containers and batch jobs ... dependency from a cluster of 4034 machines. Dataset Scale 4.2M jobs 14.3M tasks 1.4B instances Applications SQL queries (90%) data analytics (10%) 4 5
Overview of DAG Jobs Zoom in on Batch Jobs Temporal Distribution Resource Consumption Terminologies task instance dependency Dataset Scale 4.2M jobs 14.3M tasks 1.4B instances Applications SQL queries (90%) data analytics (10%) 5 6
Overview of DAG Jobs Overview of DAG Jobs Temporal Distribution Temporal Distribution Resource Consumption Resource Consumption Takeaway: DAG jobs are prevalent and sometimes consume disproportionately many resources. 6 7
Overview of DAG Jobs First Impression on Job DAGs: Trees Everywhere Temporal Distribution 78.54% of all jobs are gatter jobs; Resource Consumption Within complex jobs, 81.68% of tasks can be decomposed into scatter or gather jobs. 36.03% are scatter jobs; Takeaway: DAG jobs are prevalent and sometimes consume disproportionately many resources. 8 7
First Impression on Job DAGs: Trees Everywhere First Impression on Job DAGs: Trees Everywhere 78.54% of all jobs are gatter jobs; 78.54% of all jobs are gatter jobs; Within complex jobs, 81.68% of Within complex jobs, 81.68% of tasks can be decomposed into tasks can be decomposed into scatter or gather jobs. scatter or gather jobs. 36.03% are scatter jobs; 36.03% are scatter jobs; Takeaway: There are opportunities for algorithmic scheduling. 8 9
Commonality or Peculiarity? First Impression on Job DAGs: Trees Everywhere 78.54% of all jobs are gatter jobs; Within complex jobs, 81.68% of tasks can be decomposed into We introduce four datasets of DAGs for comparison: scatter or gather jobs. 1. Alibaba DAGs extracted from the trace, 2. Random DAGs generated by a uniformly random algorithm, 36.03% are scatter jobs; 3. TPC-DS DAGs from the namesake benchmark, 4. TPC-H DAGs similar as above. Takeaway: There are opportunities for algorithmic scheduling. 10 9
Commonality or Peculiarity? Sparsity and Probable Cause Edge density defined as: We introduce four datasets of DAGs for comparison: # dependencies # possible dependencies 1. Alibaba DAGs extracted from the trace, 2. Random DAGs generated by a uniformly random algorithm, 3. TPC-DS DAGs from the namesake benchmark, 4. TPC-H DAGs similar as above. 10 11
Sparsity and Probable Cause Sparsity and Probable Cause Edge density defined as: Edge density defined as: Chain ratio defined as: # dependencies # dependencies # tasks with only one parent/child # possible dependencies # possible dependencies # all tasks 12 11
Sparsity and Probable Cause Sparsity and Probable Cause Edge density defined as: Edge density defined as: Chain ratio defined as: Chain ratio defined as: # dependencies # dependencies # tasks with only one parent/child # tasks with only one parent/child # possible dependencies # possible dependencies # all tasks # all tasks Takeaway: Job DAGs are sparse and have many chains. 12 13
In- and Out-Degrees Sparsity and Probable Cause Edge density defined as: Chain ratio defined as: # dependencies # tasks with only one parent/child # possible dependencies # all tasks Takeaway: Job DAGs are sparse and have many chains. 13 14
In- and Out-Degrees In- and Out-Degrees Takeaway: A task can have many dependencies, but typically a few children. 14 15
In- and Out-Degrees Shape of DAG Maximum Parallelism Critical Path Takeaway: A task can have many dependencies, but typically a few children. 16 15
Shape of DAG Shape of DAG Maximum Parallelism Critical Path 16 17
Shape of DAG Shape of DAG Takeaway: Production DAGs grow "wider" instead of "longer". 18 17
Runtime Performance of DAG Jobs Shape of DAG Runtime Variability: troublemaker for cluster schedulers straggler tasks resource fragmentation Measuring Variation dependent pair: ratio between dependent set: geometric metrics mean of all pairwise ratios Takeaway: Production DAGs grow "wider" instead of "longer". 18 19
Runtime Performance of DAG Jobs Does Dependency Constrain Runtime Variability? ... vary over 5x Proportion Runtime Variability: troublemaker for cluster schedulers Instance # 26.46% straggler tasks Duration 20.77% resource fragmentation CPU Usage 1.89% Memory 20.12% Usage Measuring Variation dependent pair: ratio between dependent set: geometric metrics mean of all pairwise ratios 20 19
Does Dependency Constrain Runtime Variability? Does Dependency Constrain Runtime Variability? ... vary over 5x ... vary over 5x Proportion Proportion Instance # Instance # 26.46% 26.46% Duration Duration 20.77% 20.77% CPU Usage CPU Usage 1.89% 1.89% Memory Memory 20.12% 20.12% Usage Usage Takeaway: Unfortunately, not that much. 20 21
Variability of "Recurrent" Jobs Does Dependency Constrain Runtime Variability? ... vary over 5x Proportion We select "recurrent" jobs by the criteria of (1) isomorphic structures, (2) periodic Instance # 26.46% submission intervals and (3) identical resource requests. Duration 20.77% CPU Usage 1.89% ... vary over 2x Proportion Memory 20.12% Instance # 69.25% Usage Duration 75.69% CPU Usage 54.15% Memory 57.61% Usage Takeaway: Unfortunately, not that much. 22 21
Variability of "Recurrent" Jobs Variability of "Recurrent" Jobs We select "recurrent" jobs by the criteria of (1) isomorphic structures, (2) periodic We select "recurrent" jobs by the criteria of (1) isomorphic structures, (2) periodic submission intervals and (3) identical resource requests. submission intervals and (3) identical resource requests. ... vary over 2x ... vary over 2x Proportion Proportion Instance # Instance # 69.25% 69.25% Duration Duration 75.69% 75.69% CPU Usage CPU Usage 54.15% 54.15% Memory Memory 57.61% 57.61% Usage Usage Takeaway: Recurrent tasks can have high variability. 22 23
Variability of "Recurrent" Jobs How to Synthesize a DAG We select "recurrent" jobs by the criteria of (1) isomorphic structures, (2) periodic STEP I: Randomly draw a critical path length from the distribution. submission intervals and (3) identical resource requests. STEP II: Randomly decide how tasks are distributed along the path. ... vary over 2x Proportion STEP III: Randomly connect tasks on adjacent levels. Instance # 69.25% Duration 75.69% CPU Usage 54.15% Memory 57.61% Usage Takeaway: Recurrent tasks can have high variability. (Please refer to the paper for the evaluation results.) 23 24
Trace Generator How to Synthesize a DAG STEP I: Randomly draw a critical path length from the distribution. No need to manipulate 200GB+ of raw data. STEP II: Randomly decide how tasks are distributed along the path. Flexibly control the duration, load and heterogeneity of the trace. STEP III: Randomly connect tasks on adjacent levels. (Please refer to the paper for the evaluation results.) 24 25
Trace Generator Summary No need to manipulate 200GB+ of raw data. Structural Properties of Job DAGs, ... sparse Flexibly control the duration, load and heterogeneity of the trace. "bounded" critical path increasing parallelism Runtime Performance, ... salient variability ... even among recurrent tasks and Trace Generator 26 25
Recommend
More recommend