The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John R. Lange ROSS 2017
Ch Changi ging g Face ce of HPC Environments • Task-based Runtimes: Potential solution Traditional: Dedicated Resources Future: Collocated Workloads Simulation Visualization Simulation Visualization Supercomputer Processing Cluster Supercomputer Storage Cluster Goal: Can asynchronous task-based runtimes handle asymmetric performance 2
Ta Task-ba based R d Run unti times • Experiencing renewal in interest in systems community • Assumed to better address performance variability • Adopt (Over-)Decomposed task-based model • Allow fine-grained scheduling decisions • Able to adapt to asymmetric/variable performance • But… • Originally designed for application induced load imbalances, e.g., an adaptive mesh refinement (AMR) based application • Performance asymmetry can be of finer granularity, e.g., variable CPU time in time-shared environments 3
Basic c Experimental Evaluation • Synthetic situation • Emulate performance asymmetry in time-shared configuration • Static and predictable setting • Benchmark on 12 cores, share one core with background workload • Vary the percentage of CPU time of competing workload • Environment: 12 core dual socket compute node, hyperthreading disabled • Used cpulimit to control percentage of CPU time 4
Wo Workload Configuration Node 1 Node 0 Idle 11 cores settings Benchmark Competing Workload 12 cores settings 5
Ex Expe peri rimental Setup up • Evaluated two different runtimes: • Charm++ : LeanMD • HPX-5 : LULESH, HPCG, LibPXGL • Competing Workload: • Prime Number Generator : entirely CPU bound, a minimal memory footprint • Kernel Compilation : stresses internal OS features such as I/O and memory subsystem 6
Ch Charm+ m++ • Iterative over-decomposed applications • Object based programming model • Tasks implemented as C++ objects • Objects can migrate across intra and inter-node boundaries 7
Ch Charm+ m++ • A separate centralized load balancer component • Preempts application progress • Actively migrates objects based on current state • Causes computation to block across the other cores 8
Choice ce of Lo Load Balance cer Matters • Comparing performance of different load balancing strategies and without any load balancer 350 225.09 198% divergence Percentage performance degradation 300 178.65 250 132.21 Runtime (s) 200 85.77 150 39.33 100 -7.11 0 10 20 30 40 50 60 70 80 90 Perc. of CPU utilized by the background workload of prime number generator running on 12th core GreedyLB RotateLB RandCentLB RefineLB RefineSwapLB No load balabncer We selected RefineSwapLB for the rest of the experiments. 9
Invocation Frequency cy Matters • MetaLB: • Invoke load balancer less frequently based on heuristics 450 400 350 300 Time (s) 250 200 150 100 50 0 0 10 20 30 40 50 60 70 80 90 Perc. of CPU utilized by the background workload of prime number generator running on 12th core Total Runtime of RefineSwapLB with MetaLB Overhead of RefineSwapLB with MetaLB Total Runtime of RefineSwapLB without MetaLB Overhead of RefineSwapLB without MetaLB Load balancing overhead of RefineSwapLB with or without MetaLB We enabled MetaLB for our experiments. 10
Charm+ Ch m++: LE : LEANM ANMD 190 76.5 Percentage performance degradation 180 67.21 25% to 170 57.92 background 160 48.63 workload Runtime (s) 150 39.34 53% divergence 140 30.05 130 20.76 120 11.47 110 2.18 100 -7.11 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of prime number generator on 12th core 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) 12 threads on 12 cores Sensitivity of perc. of CPU utilization by the background workload of prime number generator • 12 cores are worse than 11 cores • …unless you have at least 75% of the core’s capacity. • If the application cannot get more than 75% of the core’s capacity, then is better off ignoring the core completely. 11
Ch Charm+ m++: LE : LEANM ANMD More variable, but consistent mean performance. 25% to background workload 150 39.88 Percentage performace degradation 145 35.14 140 30.4 135 25.66 Runtime (s) 130 20.92 125 16.18 120 11.44 115 6.7 110 1.96 105 -2.78 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of kernel compilation on 12th core 12 threads on 12 cores 11 threads on 11 cores Sensitivity of perc. of CPU utilization by the background workload of kernel compilation 12
HP HPX-5 f f f f f f f f f f f f f f f f • Parcel: • Contains a computational task and a reference to the data the task operates on • Follows Work-First principle of Cilk-5. • Every scheduling entity processes parcels from top of their scheduling queues. 13
HP HPX-5 f f f f f • Implemented using Random Work Stealing • No centralized decision making process • Overhead of work stealing is assumed by the stealer. 14
Ope OpenMP nMP: : LULE LESH • Overall application performance determined by the slowest rank. • Vulnerable to asymmetries in performance. • Rely on collective based communication. 220 215.7 Percentage performance degardation 200 187 180 158.3 Runtime (s) 160 129.6 185% divergence 140 100.9 120 72.2 100 43.5 80 14.8 60 -13.9 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator 15
HP HPX-5: L 5: LULESH • A traditional BSP application implemented using task-based programming 190 99.53 Percentage performance degradation 180 89.03 170 78.53 160 68.03 Runtime (s) 150 42% divergence 57.53 140 47.03 130 36.53 120 26.03 110 15.53 100 5.03 90 -5.47 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator • No cross-over point • 12 cores are consistently worse than 11 cores 16
HPX-5: H HP 5: HPCG • Another BSP application implemented in task- based model 255 11.465 10% to Percentage performance degradation background 250 9.28 workload 245 7.095 5% divergence Runtime (s) 240 4.91 235 2.725 230 0.54 225 -1.645 220 -3.83 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator • Better than the theoretical expectation • 12 cores are consistently worse than 11 cores 17
HPX-5: HP 5: Li LibPXGL • An asynchronous graph processing library • A more natural fit 135 17.2 22% to 5% divergence Percentage performance degradation background 130 12.86 workload Runtime (s) 125 8.52 120 4.18 115 -0.16 110 -4.5 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator • No cross-over point • 12 cores are consistently worse than 11 cores 18
HP HPX-5: K 5: Kernel Co Comp mpilati tion More immediate, instead of gradual decline. 140 45.74 280 21.28 Percentage performance degradation Percentage performance degradation 135 40.54 270 16.95 130 35.34 125 30.14 260 12.62 Runtime (s) Runtime (s) 120 24.94 115 19.74 250 8.29 110 14.54 240 3.96 105 9.34 100 4.14 230 -0.37 95 -1.06 90 -6.26 220 -4.7 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Percentage of CPU speed consumed by background workload on 12th core Percentage of CPU speed consumed by background workload on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores 11 threads on 11 cores LULESH HPCG 135 16.95 Percentage performance degradation 130 12.62 Runtime (s) 125 8.29 120 3.96 115 -0.37 110 -4.7 0 10 20 30 40 50 60 70 80 90 100 Percentage of CPU speed consumed by background workload on 12th core 12 threads on 12 cores 11 threads on 11 cores LibPXGL 19
Concl clusion • Performance asymmetry is still challenging • Preliminary evaluation: • Tightly controlled time-shared CPUs • Static and consistent configuration • Better than BSP, but… • On average a CPU loses its utility to a task based runtime as soon as its performance diverges by only 25%. 20
Th Than ank You ou • Debashis Ganguly • Ph.D. Student, Computer Science Department, University of Pittsburgh • debashis@cs.pitt.edu • https://people.cs.pitt.edu/~debashis/ • The Prognostic Lab • http://www.prognosticlab.org 21
Recommend
More recommend