spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1
spcl.inf.ethz.ch @spcl_eth Evading various “ends” – the hardware view 2
spcl.inf.ethz.ch @spcl_eth Parallel Hardware Sequential Software Multi-Core CPU Fortran row = 0 ; output_image_ptr = output_image ; C/C++ output_image_ptr += ( NN * dead_rows ); for ( r = 0 ; r < NN - KK + 1 ; r ++) { CPU CPU CPU CPU output_image_offset = output_image_ptr ; output_image_offset += dead_cols ; col = 0 ; for ( c = 0 ; c < NN - KK + 1 ; c ++) { CPU CPU CPU CPU input_image_ptr = input_image ; input_image_ptr += ( NN * row ); kernel_ptr = kernel ; S0: * output_image_offset = 0 ; for ( i = 0 ; i < KK ; i ++) { input_image_offset = input_image_ptr ; input_image_offset += col ; GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU kernel_offset = kernel_ptr ; for ( j = 0 ; j < KK ; j ++) { S1: temp1 = * input_image_offset ++; S1: temp2 = * kernel_offset ++; S1: * output_image_offset += temp1 * temp2 ; } GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU kernel_ptr += KK ; input_image_ptr += NN ; } S2: * output_image_offset = ((* output_image_offset )/ normal_factor ); output_image_offset ++ ; col ++; } Accelerator output_image_ptr += NN ; row ++; } } 3
spcl.inf.ethz.ch @spcl_eth Design Goals Automatic Non-Goal: Automatic accelerator mapping Algorithmic Changes - How close can we get? “Regression Free” High Performance 4
spcl.inf.ethz.ch @spcl_eth Tool: Polyhedral Modeling Iteration Space Program Code i ≤ N = 4 5 for (i = 0; i <= N; i++) i 4 for (j = 0; j <= i; j++) 3 S(i,j); 0 ≤ j 2 j ≤ i 1 0 (i, j) = (0,0) (3,3) (4,0) (4,4) (4,3) (4,2) (1,0) (1,1) (2,0) (2,1) (4,1) (2,2) (3,0) (3,1) (3,2) 0 ≤ i N = 4 0 1 2 3 4 5 j Polly -- Performing Polyhedral Optimizations on a Low-Level D = { (i,j ) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i } Intermediate Representation Tobias Grosser et al, Parallel Processing Letter, 2012 4
spcl.inf.ethz.ch @spcl_eth Mapping Computation to Device Device Blocks & Threads Iteration Space 0 1 0 1 2 3 0 1 2 3 0 i 0 1 0 0 1 1 2 j 0 4 % 2, 𝑘 𝑗 𝐶𝐽𝐸 = { 𝑗, 𝑘 → 3 % 2 } 1 1 𝑈𝐽𝐸 = { 𝑗, 𝑘 → 𝑗 % 4, 𝑘 % 3 } 2 6
spcl.inf.ethz.ch @spcl_eth Memory Hierarchy of a Heterogeneous System 7
spcl.inf.ethz.ch @spcl_eth Host-device date transfers 8
spcl.inf.ethz.ch @spcl_eth Host-device date transfers 9
spcl.inf.ethz.ch @spcl_eth Mapping onto fast memory 10
spcl.inf.ethz.ch @spcl_eth Mapping onto fast memory Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013 11
spcl.inf.ethz.ch @spcl_eth Profitability Heuristic Execution Modeling GPU All Loop Nests dynamic static Insufficient Compute Unsuitable Trivial T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch @spcl_eth From kernels to program – data transfers void heat(int n, float A[n], float hot, float cold) { float B[n] = {0}; initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } } 13
spcl.inf.ethz.ch @spcl_eth void heat(int n, float A[n], ...) { Data Transfer – Per Kernel initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); Host Memory Device Memory average(n, B, A); printf("Iteration %d done", t); } } D → 𝐼 initialize () D → 𝐼 setCenter() 𝐼 → 𝐸 𝐸 → 𝐼 average() time 𝐼 → 𝐸 𝐸 → 𝐼 average() 𝐼 → 𝐸 𝐸 → 𝐼 average() 14
spcl.inf.ethz.ch @spcl_eth void heat(int n, float A[n], ...) { Data Transfer – Inter Kernel Caching initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); Host Memory Host Memory Device Memory average(n, B, A); printf("Iteration %d done", t); } } initialize () setCenter() average() time 𝐸 → 𝐼 𝐼 → 𝐸 average() average() 15
spcl.inf.ethz.ch @spcl_eth Evaluation Evaluation Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell NVIDIA GT730M (Kepler) 16
spcl.inf.ethz.ch @spcl_eth LLVM Nightly Test Suite # Compute Regions / Kernels 10000 1000 100 10 1 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics 17 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch @spcl_eth Some results: Polybench 3.2 geomean: ~6x arithmean: ~30x Speedup over icc – O3 Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 18 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch Compiles all of SPEC CPU 2006 – Example: @spcl_eth LBM 8:24 essentially my 4-core x86 laptop with the (free) GPU that’s in there 7:12 6:00 Runtime (m:s) Xeon E5-2690 (10 cores, 0.5Tflop) vs. 4:48 ~20% Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 3:36 2:24 ~4x 1:12 0:00 Mobile Workstation icc icc -openmp clang Polly ACC 19 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) Workstation Mobile 20 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) - Data Transfer Workstation Mobile 21 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch @spcl_eth Polly-ACC http://spcl.inf.ethz.ch/Polly-ACC Automatic “Regression Free” High Performance 22 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch @spcl_eth Brave new compiler world!? Unfortunately not … Limited to affine code regions Maybe generalizes to control-restricted programs No distributed anything!! Good news: Much of traditional HPC fits that model Infrastructure is coming along Bad news: Modern data-driven HPC and Big Data fits less well Need a programming model for distributed heterogeneous machines! 23
spcl.inf.ethz.ch @spcl_eth How do we program GPUs today? l s d t l s d t … l s t d l s d t CUDA MPI • over-subscribe hardware • host controlled • use spare parallel slack for latency • full device hiding synchronization device compute core instruction latency active thread T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch @spcl_eth Latency hiding at the cluster level? s l pu l d t d t l s l s d t d t l s l s t d t d l pu l s d t d t dCUDA (distributed CUDA) • unified programming model for GPU clusters • avoid unnecessary device synchronization to enable system wide latency hiding device compute core instruction latency active thread T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch @spcl_eth Talk on Wednesday Tobias Gysi , Jeremiah Baer, TH: “ dCUDA: Hardware Supported Overlap of Computation and Communication” Wednesday, Nov. 16 th 4:00-4:30pm Room 355-D 26
Recommend
More recommend