shared clusters
play

Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan - PowerPoint PPT Presentation

Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute of Technology Hewlett Packard Labs Shared Clusters for Big Data


  1. Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute of Technology Hewlett Packard Labs

  2. Shared Clusters for Big Data Systems – Dynamic resource sharing across multiple frameworks, apps and users – Examples - Google cluster (Omega), Mesos , Hadoop YARN, Bing’s Dryad improved utilization and data sharing, reduced cost Streaming In-Memory Graph Online Batch (MR) (Storm) (Spark) (Giraph) (Vertica) Batch Streaming Online (MR) (Storm) (Vertica) Cluster Cluster Cluster Cluster Manager (e.g. YARN, Mesos) Shared Hardware Hardware Hardware Hardware Dedicated Clusters Shared Cluster 3

  3. Preemption in Shared Clusters – Coordinate resource sharing, guarantee QoS and enforce fairness preemption – Problem: preemption in shared clusters is expensive! – Simply kill and restart jobs later – Significant resource waste – Delays completion time of long running or low priority jobs 3

  4. Real World Examples 29-day trace from Google: 672,000 jobs on 12,500 machines Preemption Rate Timeline 100 Preemption Rate [%] Task Priority Num. of Percent 80 Tasks Evicted 60 Free (0-1) 28.4M 20.26% 40 20 Middle (2-8) 17.3M 0.55% 0 Production (9-11) 1.70M 1.02% 0 5 10 15 20 25 30 Time [Day] Many tasks preempted Low Priority Medium Priority High Priority Latency Num. of Percent Sensitivity Tasks Evicted – Google Cluster: 12.4% of scheduled tasks 0 (lowest) 37.4M 11.76% preempted and up to 30k CPU-hours (35% of total capacity) wasted! 1 5.94M 18.87% – Microsoft Dryad cluster [1] : ~21% jobs killed 2 3.70M 8.14% – Facebook Hadoop cluster [2] : repeatedly kill 3 (highest) 0.28M 14.80% and restart long running jobs Even latency-sensitive tasks are evicted [1] Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. Ananthanarayanan et. al. EuroSys 2011. 4 [2] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.

  5. Real World Examples 29-day trace from Google: 672,000 jobs on 12,500 machines Preemption Rate Timeline Preemption Frequency Distribution 100 1000 Preemption Rate [%] Task Priority Num. of Percent Distinct Tasks 80 800 [thousands] Tasks Evicted 60 600 Free (0-1) 28.4M 20.26% 43% preempted 40 400 more than once 17% 10 times 20 Middle (2-8) 17.3M 0.55% 200 or more 0 Production (9-11) 1.70M 1.02% 0 5 10 15 20 25 30 0 Time [Day] 1 2 3 4 5 6 7 8 9 >10 Many tasks preempted Num. of Preemptions Low Priority Medium Priority High Priority Latency Num. of Percent Sensitivity Tasks Evicted – Google Cluster: 12.4% of scheduled tasks 0 (lowest) 37.4M 11.76% preempted and up to 30k CPU-hours (35% of total capacity) wasted! 1 5.94M 18.87% – Microsoft Dryad cluster [1] : ~21% jobs killed 2 3.70M 8.14% – Facebook Hadoop cluster [2] : repeatedly kill 3 (highest) 0.28M 14.80% and restart long running jobs Even latency-sensitive tasks are evicted [1] Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. Ananthanarayanan et. al. EuroSys 2011. 5 [2] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011.

  6. Checkpointing-based Preemptive Scheduling Our solution: use checkpoint/restore for preemption instead of kill/restart Use system level, application-transparent checkpointing mechanism – Linux CRIU ( C heckpoint- R estore I n U serspace) – Distributed and remote checkpoint/restart Leverage fast storage such as NVM for efficient checkpointing – Store checkpoints on NVM (NVMFS or NVRAM) Adaptive preemption policies and optimization techniques – Combine checkpoint and kill, local and remote checkpointing/resumption – Incremental checkpointing with memory trackers 6

  7. Application-transparent Suspend-Resume Checkpointing using CRIU (Checkpoint/Restore In Userspace) – Freeze a running program and suspend it in memory or output to disk – Saves sockets, threads, namespaces, memory mappings, pipes Dump – Build process tree from /proc/$pid/task/$tid/children and seize them with ptrace – Collect VMA areas, file descriptor numbers, registers, etc… of each process Restore – Read process tree from file and start saved processes with clone() call – New memory map created filled with checkpointed data 7

  8. Suspend-Resume with DFS and NVM Support distributed and remote Node A Node B Node C checkpoint-resume process address process address process address space space space – Save checkpoints on HDFS DRAM DRAM DRAM Dump Restore Checkpoint with NVM checkpoint checkpoint checkpoint – Use NVM as fast disk files files files – Save CRIU checkpoints in NVM- HDD, SSD, NVM HDD, SSD, NVM HDD, SSD, NVM based file systems (e.g, PMFS) Distributed File System – Use NVM as virtual memory Node A Node B Node C (NVRAM) process address space – Copy checkpoints from DRAM to NVM DRAM DRAM using memory operations DRAM memory copy – Shadow buffer NVRAM Incremental checkpointing NVRAM NVRAM Distributed Shared NVRAM 8

  9. Suspend and Restore Performance Local File System HDFS 600 600 Total Dump Restore Time [s] Total Dump Restore Time [s] 500 500 400 400 300 300 200 200 100 100 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Checkpoint Size [GB] Checkpoint Size [GB] HDD SSD NVM HDD SSD PMFS 9

  10. Benefits of Incremental Checkpointing 5GB initial dump size, change 10% of the memory and dump again Storage First Checkpoint Second Checkpoint HDD 169.18s 15.34s SSD 43.73s 4.08s PMFS 2.92s 0.28s 10

  11. Google Trace-driven Simulation Performance Resource Wastage Energy Consumption 3500 4100 1.4 Wasted CPU Capacity Normalized Response Preempt-Kill 23% 1.2 -6% 3000 4050 Basic-HDD Energy [kW*h) 1 2500 [core-hours] 4000 Basic-SSD 0.8 2000 Time Basic-NVM 3950 0.6 74% 5% 1500 76% 0.4 3900 1000 0.2 3850 500 0 Lowest Medium Highest 0 3800 Preempt Method Preempt Method Priority Priority Priority High Priority Job Performance Low Priority Job Performance Energy Consumption 3.5 6 4 Wait Normalized Completion Time Normalized Completion Time 3.5 3 Normalized Power 5 Kill 3 Consumption 2.5 4 2.5 Checkpoint 2 2 3 1.5 1.5 2 1 1 1 0.5 0.5 0 0 0 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Checkpoint Bandwidth (GB/s) Checkpoint Bandwidth (GB/s) Checkpoint Bandwidth (GB/s) 11

  12. Adaptive Policies and Optimization Adaptive preemption dynamically selects victim tasks and preemption mechanisms (checkpoint or kill) based on the progress of each task and its checkpoint/restore overhead. Adaptive resumption restores preempted jobs/tasks locally or remotely according to their overheads and available resources. Incremental checkpointing with memory trackers 12

  13. Adaptive Preemption Algorithms 13

  14. Performance Improvement with Adaptive Policies SSD HDD 8% 1 1 12% Normalized Response Time Normalized Response Time 17% 29% 0.8 0.8 36% 0.6 55% 0.6 0.4 0.4 0.2 0.2 0 Lowest Priority Medium Priority Highest Priority 0 Basic Adaptive Lowest Priority Medium Priority Highest Priority NVM -0.5% 3% 1 8% Normalized Response Time 0.8 0.6 0.4 0.2 0 Lowest Priority Medium Priority Highest Priority 14

  15. Implementation with Hadoop YARN YARN – cluster resource manager DistributedShell – Global resource scheduler – Comes standard with YARN (ResourceManager) – Runs a shell command in a set of – Submit ApplicationMasters (jobs) to RM containers in a distributed and – Supports capacity and fair scheduling parallel manner YARN ApplicationMaster 2. Preemption YARN Resource Manager 1. New Request Application Preemption Job YARN Cluster Scheduler Manager 5. Container 3. Suspend Request 6. Resume 3. Suspend YARN NodeManager YARN NodeManager 6. Resume CRIU CRIU dump dump restore restore Task Task Task Task 4. Suspend Complete HDFS HDFS HDD, SSD, NVM (PMFS) HDD, SSD, NVM (PMFS) 15

  16. Testbed and Experiment Setup – 8 node Hadoop YARN cluster – Dual socket Xeon 5650 CPU (6 cores/each) – 96GB memory (48GB emulated NVM using PMFS) – 500GB HDD (un-optimized) – 120GB SSD – 24 concurrent containers (1 CPU/2 GB memory) – Workload – Modeled after Facebook workload [1] – Mix of high/low priority jobs (7,000+ tasks) [1] Mitigating the Negative Impact of Preemption on Heterogeneous MapReduce Workloads. Cheng et. al. CNSM 2011. 16

  17. Comparison of Different Preemption Policies on YARN Resource Wastage Energy Consumption Average Job Performance 250 10 16 Kill Response Time [min] 14 200 8 Chk-HDD 35% 12 CPU Wastage Energy [kW*h] [core-hours] 61% 10 67% Chk-SSD 150 6 8 Chk-NVM -22% 6 100 4 4 2 50 2 0 Low Priority High Priority 0 0 Basic Preemption CDF 1 0.75 Kill 0.5 Chk-HDD Chk-SSD 0.25 Chk-NVM 0 0 5 10 15 20 25 30 17 Response Time [min]

  18. Benefits of Adaptive Preemption HDD SSD 14 8 Response Time [min] Response Time [min] 12 7 16% 28% 10 6 7% 7% 8 5 6 4 4 3 2 2 0 1 Low Priority High Priority 0 Basic Adaptive Low Priority High Priority NVM 7 Response Time [min] 6 20% 5 14% 4 3 2 1 0 18 Low Priority High Priority

Recommend


More recommend