CPU+GPU Load Balance Guided by Execution Time Prediction Jean-François Dollinger, Vincent Loechner Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr 19 January 2015 1 / 29
Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 1 / 29
Introduction Achieving and predicting performance on CPU/GPU is difficult. Sensitive to: • Input dataset (CUDA grid size, cache effects) • Compiler optimizations (unrolling, fission) • Cloudy infrastructures • Hardware availability • Efficient resources exploitation 2 / 29
Introduction Because of dynamic behaviors compilers miss performance opportunities • PLUTO • PPCG • Par4All • openACC/HMPP: manual tuning → Automatic methods are the way to go (typical use case) → Our interest: polyhedral codes 3 / 29
Introduction How to get performance? • Right code with right PU (Processing Unit) • Select best code version on each given PU • Ensure load balance between PUs → Multi-versioning + runtime code selection = win 4 / 29
Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 5 / 29
Prediction Overview Multi-versioning: performance factors • Static factors (instruction) • External dynamic factors (scheduler) • Internal dynamic factors (cache effects, memory contention) 6 / 29
Prediction Overview Static code generation #pragma omp parallel for #pragma scop for(t0 = lb; t0 < ub; ...) for(...) Parallelize for(...) Extract for(...) and chunk for(...) scop for(...) S0(...); (ppcg) S0(...); #pragma endscop Launch Launch PPCG PLUTO 2 2 version 0 version 0 Build templates O ine pro ling Pro le Pro le memcpy kernels Bandwidth table Ranking tables Runtime prediction GPU CPU Application binary object le memcpy Kernel Kernel ... duration duration duration /* scop */ call schedule(...) + call dispatch(...) /* endscop */ ... version 1 version 0 version 1 version 2 version 2 version 0 Schedule 7 / 29
Prediction Overview Pedro Framework [Benoit Pradelle et al. 2011] • Multi-versioning of polyhedral loop nests • Target : multicore CPUs 8 / 29
Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 9 / 29
Prediction Code generation Code version • Block size • Tile size • Schedule → controlled by PPCG cmd line options PPCG, source-to-source compiler • Transforms C to CUDA • Generates: • Ehrhart polynomials • Loop nest parameters Python scripts • Fill templates in C code 10 / 29
Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 11 / 29
Prediction Data transfers: host ↔ device • Parameter: message size • Asymetric and non-uniform bandwidth Code simulation • Parameters: number of CUDA blocks, sequential parameters • Load balance • Memory contention How to model the performance curves ? • Affine intervals detection 12 / 29
Prediction Testbed 1st test platform • 2 Nvidia GTX 590 (16 (SM) * 32 (SP)) • Asus P8P67-Pro (PCIe 2, x8 per card) • Core i7 2700k, stock 2nd test platform • Nvidia GTX 680 (8 (SM) * 192 (SP)) • Asus P8P67-Deluxe (PCIe 2, x16) • Core i7 2600 13 / 29
Prediction Data transfers (testbed 1) 7000 real dev-host GTX590 real host-dev GTX590 prof. dev-host GTX590 prof. host-dev GTX590 6000 5000 bandwidth (MB/s) 4000 3000 2000 1000 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 size (KB) 14 / 29
Prediction Data transfers (testbed 2) 7000 real dev-host GTX680 real host-dev GTX680 prof. dev-host GTX680 prof. host-dev GTX680 6000 5000 bandwidth (MB/s) 4000 3000 2000 1000 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 size (KB) 15 / 29
Prediction Kernel simulation (testbed 1) gemm 32x16 - GTX 590 1 real profiled execution time per iteration (ns) 0.1 0.01 10 0 10 1 10 2 10 3 10 4 10 5 number of blocks 16 / 29
Prediction Kernel simulation (testbed 1) syrk2 - GTX 590 1 real e i =p i β +u i (e)xecution time per iteration (ns) 0.1 0.01 0 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 (p)arameter size 17 / 29
Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 18 / 29
Runtime CPU + GPU Outermost parallel loop split into chunks • Each chunk associated to one PU • PUs performance differ → Ensure load balance Multi-Versioning • Code optimized towards target (PLUTO + PPCG) • Multiple code versions (combined) Two components: • Scheduler: • Execution time of chunks [B. Pradelle et al.] + [J-F. Dollinger et al.] • Adjust chunks sizes • Dispatcher 19 / 29
Runtime CPU + GPU Scheduler functioning 1 T 0 = t 0 ∗ Card D 0 ≈ t 1 ∗ Card D 1 ≈ ... ≈ t n ∗ Card D n 2 T i must tend to 1 / n ∗ � n − 1 i = 0 ( t i ∗ Card D i ) = 1 / n ∗ T all 3 t i = f ( G i , seq ) on GPU 4 t i = g ( P i , S i ) on CPU Eliminate inefficient PUs 20 / 29
Runtime CPU + GPU 1 1 0.8 0.8 0.6 0.6 -workload proportion | exec. time proportion -workload proportion | exec. time proportion 0.4 0.4 0.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 -0.8 -0.8 -1 -1 1 1 2 2 3 3 4 4 5 5 6 6 step step 21 / 29
Runtime CPU + GPU (speedup to one PU) 30 CPU 1GPU CPU+1GPU CPU+2GPUs 25 CPU+3GPUs CPU+4GPUs 20 speedup 15 10 5 0 g 2 3 s s d g m g e m m y y o e e r r v m i s m m m k 2 t t g u m k v e m e n m r v 22 / 29
Runtime CPU + GPU (load imbalance) 1 CPU+1GPU CPU+2GPUs CPU+3GPUs CPU+4GPUs 0.8 0.6 imbalance 0.4 0.2 0 g 2 3 s s d g m g e m m y y o e e r r v m i s m m m k 2 t t g u m k v e m e n m r v 23 / 29
Runtime Multiversioning CPU + GPU (speedup to worst) 6 syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) 5 syr2k (c5) syr2k (c6) syr2k (c7) syr2k (c8) 4 syr2k (c9) syr2k (all) speedup 3 2 1 0 C 1 C C C C P G P P P P P U U U U U U + + + + 1 2 3 4 G G G G P P P P U U U U s s s 24 / 29
Runtime Multiversioning CPU + GPU (imbalance) 1 syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) syr2k (c5) 0.8 syr2k (c6) syr2k (c7) syr2k (c8) syr2k (c9) syr2k (all) 0.6 imbalance 0.4 0.2 0 C C C C P P P P U U U U + + + + 1 2 3 4 G G G G P P P P U U U U s s s 25 / 29
Conclusion Framework capabilities • Execution time prediction • Fastest version selection • CPU vs GPU competition • CPU + GPU joint usage Future work • Energy consumption 26 / 29
Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 27 / 29
Annex Offline profiling: ranking table Number of threads version 1 version 2 version 3 1 40 ms 55 ms 32 ms 2 32 ms 28 ms 17 ms 3 22 ms 15 ms 9 ms 4 14 ms 7 ms 8 ms Online prediction: execution time computation observation = { 2000 , 600 , 300 , 300 } prediction ( version 1 ) = (( 2000 − 600 ) ∗ 40 ) + (( 600 − 300 ) ∗ 32 ) + ( 0 ∗ 22 ) + ( 300 ∗ 14 ) = 69800 ms 28 / 29
Annex The algorithm stages: • Init.: distribute iterations equitably amongst PUs • Repeat 10 times: • Compute per chunk execution time • r i = T i / T all • Adjust chunk size according to r i 29 / 29
Recommend
More recommend