A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li , Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018
1/61 Introduction • Data-parallel clusters Introduction Introduction • Used to process large datasets efficiently Related Work • Deployed in many large organizations NAS • E.g., Facebook, Google and Yahoo! Evaluation • Shared by users from different groups Conclusion
2/61 Motivations • Network-intensive stages in data-parallel jobs Introduction Introduction Related Work NAS Evaluation Conclusion [1] M. Chowdhury, Y. Zhong, and I. Stoica . “Efficient coflow scheduling with varys ”. In: Proc. of SIGCOMM. 2014.
3/61 MapReduce A MapReduce job Map Shuffle Output/ Shuffle Introduction Introduction data Input Map task Reduce task Related block Work Input Map task Reduce task NAS block Evaluation Input Map task Reduce task block Conclusion Map stage Reduce stage
4/61 Motivations • Network-intensive stage • E.g., 60% and 20% of the jobs on the Yahoo and Facebook clusters, Introduction Introduction respectively, are reported to be shuffle-heavy • Jobs with large shuffle data size, generating a large amount of network Related traffic Work • More than 50% of time spent in network communication [2] Problem: A large number of shuffle-heavy jobs may cause bottleneck on the NAS cross-rack network • Oversubscribed network from rack-to-core in datacenter Evaluation • Oversubscription ratio ranging from 3:1 to 20:1 Conclusion • Nearly 50% of cross-rack bandwidth used by background transfer [1] M. Chowdhury, Y. Zhong, and I. Stoica . “Efficient coflow scheduling with varys ”. In: Proc. of SIGCOMM. 2014.
5/61 Outline • Introduction Introduction • Related Work Related Work • Network-Aware Scheduler Design (NAS) NAS • Evaluation Evaluation Conclusion • Conclusion
6/61 Related Work – Fair and Delay Map input data Introduction Map task Related Related Work Work Reduce task NAS Evaluation Rack 1 Rack 2 Rack 3 Conclusion Place the map task close to the input data – data locality Problem: Place the reduce task randomly
7/61 Related Work – ShuffleWatcher (ATC’14) Introduction Map input data Related Related Map task Work Work Reduce task NAS Evaluation Conclusion Rack 1 Rack 2 Rack 3 Pre-compute the map and reduce placement and attempt to place map and reduce on the same racks to minimize the cross-rack traffic
8/61 Related Work – ShuffleWatcher (ATC’14) Map input data Introduction Map task Related Related Work Work Reduce task NAS Evaluation Rack 1 Rack 2 Rack 3 Conclusion Problem: Reduce the cross-rack shuffle traffic at the cost of reading remote map input data.
9/61 Related Work – ShuffleWatcher (ATC’14) Map input data Introduction Map task Related Related Work Work Reduce task NAS Evaluation Rack 1 Rack 2 Rack 3 Conclusion Problem: Resource contention on the racks – intra-job and inter-job
10/61 Outline • Introduction Introduction • Related Work Related Work • Network-Aware Scheduler Design (NAS) NAS • Evaluation Evaluation Conclusion • Conclusion
11/61 Challenges • Network-aware scheduler • How to reduce cross-rack congestion Introduction • How to reduce cross-rack traffic Related Work • Idea NAS NAS • Network not saturated at all time Evaluation • Designing schedulers to place tasks • Balance the network load Conclusion • Consider shuffle data locality in addition to input data locality
12/61 Network-Aware Scheduler (NAS) • Map task scheduling (MTS) • Balance the network load Introduction • Congestion-avoidance reduce task scheduling (CA-RTS) Related Work • Consider shuffle data locality • adaptively adjusts the map completion threshold of jobs based on their NAS shuffle data sizes NAS • Congestion-reduction reduce task scheduling (CR-RTS) Evaluation • Balance the network load Conclusion
13/61 Map task scheduling (MTS) • Goal: balancing the network load Introduction • Set a TrafficThreshold for each node • Cannot process more shuffle data than this threshold at one time Related • Constrain the generated shuffle data size at a time Work • Map task scheduling NAS NAS • Map input data locality and fairness Evaluation • Whether the generated shuffle data size on a node exceeds the TrafficThreshold after placing a task Conclusion
14/61 Map task scheduling (MTS) • Setting the TrafficThreshold • Could be changed based on workloads Introduction • Distribute the shuffle data into each wave Related • Task wave Work • Number of tasks >> number of containers • Tasks scheduled to all available containers, forming the first wave NAS • Second wave, third wave … NAS 𝑈𝑇 • TrafficThreshold = 𝑂∗𝑋 Evaluation • TS – total shuffle data size of jobs in the cluster • N – the total number of nodes in the cluster Conclusion • W – the number of waves: the total number of map tasks/the total number of containers 14
15/61 Map task scheduling (MTS) • User 1: • Job1: 6 map tasks and 6 reduce tasks Introduction • Job3: 6 map tasks and 6 reduce tasks • User 2: Related • Job2: 6 map tasks and 6 reduce tasks Work • Job4: 6 map tasks and 6 reduce tasks NAS NAS • Each map -> each reduce Evaluation • Job1 and Job2: 8 • Job3 and Job4: 1 Conclusion 15
16/61 Congestion-avoidance Reduce Task Scheduling (CA-RTS) • Check network status -- CongestionThreshold (e.g., 80% of cross- rack bandwidth) Introduction • Used when the CongestionThreshold is NOT reached • Goal: reduce cross-rack traffic Related • A rack has more shuffle data of a job assign more reduce Work tasks of the job on this rack to reduce cross-rack traffic NAS • NAS The number of reduce tasks of a job scheduled on a rack does not exceed ReduceNum Rack 1 Rack 2 • ReduceNum = TotalReduceNum * MapOutputPortion Evaluation 30% 10 reduce 70% tasks Conclusion 7 3
17/61 Congestion-reduction Reduce Task Scheduling (CR-RTS) • Used when the CongestionThreshold is reached Introduction • Goal: reduce cross-rack network congestion Related Work • Launch a reduce task from a shuffle-light job • Small shuffle data size NAS NAS • Minimal impact on the cross-rack traffic Evaluation • If no, search the next user until a reduce task from a shuffle- light job is found Conclusion
18/61 Outline • Introduction Introduction • Related Work Related Work • Network-Aware Scheduler Design (NAS) NAS • Evaluation Evaluation Conclusion • Conclusion
19/61 Evaluation • Real cluster experiment • Throughput Introduction • Average job completion time Related Work • Cross-rack congestion NAS • Cross-rack traffic Evaluation Evaluation • Sensitivity analysis Conclusion • Simulation study
20/61 Evaluation • Real cluster experiment • 40-node cluster organized into 8 racks, 5 nodes each rack Introduction • 8 racks interconnected by a core switch • Oversubscription 5:1 from the rack to core Related Work • Workload • 200 jobs from the Facebook synthesized execution framework [1] NAS • Baselines Evaluation Evaluation • Fair Scheduler (current scheduler in Hadoop) • Delay Scheduler (current scheduler in Hadoop) Conclusion • ShuffleWatcher (ATC’14) [1] Y. Chen, A. Ganapathi , R. Griffith, and R. Katz. “The Case for Evaluating MapReduce Performance Using Workload Suites.” In: Proc. of MASCOTS. 2011.
21/61 Throughput 2 Normalized throughput 1.63 Introduction 1.5 1.24 1.1 Related 1 Work 0.5 NAS 0 Fair Delay ShuffleWatcher NAS Evaluation Evaluation NAS improves the throughput over Fair, Delay and Conclusion ShuffleWatcher by 63%, 48%, 31%, respectively
22/61 Average Job Completion Time Normalized average job Introduction 1 0.89 completion time 0.83 Related 0.56 Work 0.5 NAS 0 Fair Delay ShuffleWatcher NAS Evaluation Evaluation NAS reduces the average job completion time over Fair, Delay Conclusion and ShuffleWatcher by 44%, 37%, 33%, respectively
23/61 Cross-rack congestion 1.2 occurrences of cross-rack 1 Introduction 0.91 Total number of 0.84 0.8 congestions Related 0.6 0.55 Work 0.4 0.2 NAS 0 Fair Delay ShuffleWatcher NAS Evaluation Evaluation Conclusion NAS reduces the cross-rack congestion over Fair, Delay and ShuffleWatcher by 45%, 40%, 34%. 23
24/61 Conclusion We can improve the performance of current state-of-the-art schedulers (e.g., Fair and Delay schedulers in Hadoop) by Introduction • balancing the network traffic and enforcing the data locality for shuffle Related data, Work NAS • aggregating the data transfers to efficiently exploit optical circuit switch in hybrid electrical/optical datacenter network while still guaranteeing Evaluation parallelism, Conclusion Conclusion • and adaptively scheduling a job to either scale-up machines or scale-out machines that benefit the job the most in hybrid scale-up/out cluster.
25/61 Introduction Related Work Zhuozhao Li NAS Ph.D. Candidate Evaluation Department of Computer Science University of Virginia Conclusion ZL5UQ@VIRGINIA.edu
Backup 26
Shuffle Data Size Predictor • MapOutput = (map output/input ratio) ∗ MapInput • Unpredicted and predicted job • Update in real time 27
Recommend
More recommend