Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1

spcl.inf.ethz.ch @spcl_eth Evading various “ends” – the hardware view 2

spcl.inf.ethz.ch @spcl_eth Parallel Hardware Sequential Software Multi-Core CPU Fortran row = 0 ; output_image_ptr = output_image ; C/C++ output_image_ptr += ( NN * dead_rows ); for ( r = 0 ; r < NN - KK + 1 ; r ++) { CPU CPU CPU CPU output_image_offset = output_image_ptr ; output_image_offset += dead_cols ; col = 0 ; for ( c = 0 ; c < NN - KK + 1 ; c ++) { CPU CPU CPU CPU input_image_ptr = input_image ; input_image_ptr += ( NN * row ); kernel_ptr = kernel ; S0: * output_image_offset = 0 ; for ( i = 0 ; i < KK ; i ++) { input_image_offset = input_image_ptr ; input_image_offset += col ; GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU kernel_offset = kernel_ptr ; for ( j = 0 ; j < KK ; j ++) { S1: temp1 = * input_image_offset ++; S1: temp2 = * kernel_offset ++; S1: * output_image_offset += temp1 * temp2 ; } GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU kernel_ptr += KK ; input_image_ptr += NN ; } S2: * output_image_offset = ((* output_image_offset )/ normal_factor ); output_image_offset ++ ; col ++; } Accelerator output_image_ptr += NN ; row ++; } } 3

spcl.inf.ethz.ch @spcl_eth Design Goals Automatic Non-Goal: Automatic accelerator mapping Algorithmic Changes - How close can we get? “Regression Free” High Performance 4

spcl.inf.ethz.ch @spcl_eth Tool: Polyhedral Modeling Iteration Space Program Code i ≤ N = 4 5 for (i = 0; i <= N; i++) i 4 for (j = 0; j <= i; j++) 3 S(i,j); 0 ≤ j 2 j ≤ i 1 0 (i, j) = (0,0) (3,3) (4,0) (4,4) (4,3) (4,2) (1,0) (1,1) (2,0) (2,1) (4,1) (2,2) (3,0) (3,1) (3,2) 0 ≤ i N = 4 0 1 2 3 4 5 j Polly -- Performing Polyhedral Optimizations on a Low-Level D = { (i,j ) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i } Intermediate Representation Tobias Grosser et al, Parallel Processing Letter, 2012 4

spcl.inf.ethz.ch @spcl_eth Mapping Computation to Device Device Blocks & Threads Iteration Space 0 1 0 1 2 3 0 1 2 3 0 i 0 1 0 0 1 1 2 j 0 4 % 2, 𝑘 𝑗 𝐶𝐽𝐸 = { 𝑗, 𝑘 → 3 % 2 } 1 1 𝑈𝐽𝐸 = { 𝑗, 𝑘 → 𝑗 % 4, 𝑘 % 3 } 2 6

spcl.inf.ethz.ch @spcl_eth Memory Hierarchy of a Heterogeneous System 7

spcl.inf.ethz.ch @spcl_eth Host-device date transfers 8

spcl.inf.ethz.ch @spcl_eth Host-device date transfers 9

spcl.inf.ethz.ch @spcl_eth Mapping onto fast memory 10

spcl.inf.ethz.ch @spcl_eth Mapping onto fast memory Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013 11

spcl.inf.ethz.ch @spcl_eth Profitability Heuristic Execution Modeling GPU All Loop Nests dynamic static Insufficient Compute Unsuitable Trivial T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch @spcl_eth From kernels to program – data transfers void heat(int n, float A[n], float hot, float cold) { float B[n] = {0}; initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } } 13

spcl.inf.ethz.ch @spcl_eth void heat(int n, float A[n], ...) { Data Transfer – Per Kernel initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); Host Memory Device Memory average(n, B, A); printf("Iteration %d done", t); } } D → 𝐼 initialize () D → 𝐼 setCenter() 𝐼 → 𝐸 𝐸 → 𝐼 average() time 𝐼 → 𝐸 𝐸 → 𝐼 average() 𝐼 → 𝐸 𝐸 → 𝐼 average() 14

spcl.inf.ethz.ch @spcl_eth void heat(int n, float A[n], ...) { Data Transfer – Inter Kernel Caching initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); Host Memory Host Memory Device Memory average(n, B, A); printf("Iteration %d done", t); } } initialize () setCenter() average() time 𝐸 → 𝐼 𝐼 → 𝐸 average() average() 15

spcl.inf.ethz.ch @spcl_eth Evaluation Evaluation Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell NVIDIA GT730M (Kepler) 16

spcl.inf.ethz.ch @spcl_eth LLVM Nightly Test Suite # Compute Regions / Kernels 10000 1000 100 10 1 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics 17 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch @spcl_eth Some results: Polybench 3.2 geomean: ~6x arithmean: ~30x Speedup over icc – O3 Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 18 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch Compiles all of SPEC CPU 2006 – Example: @spcl_eth LBM 8:24 essentially my 4-core x86 laptop with the (free) GPU that’s in there 7:12 6:00 Runtime (m:s) Xeon E5-2690 (10 cores, 0.5Tflop) vs. 4:48 ~20% Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 3:36 2:24 ~4x 1:12 0:00 Mobile Workstation icc icc -openmp clang Polly ACC 19 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) Workstation Mobile 20 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) - Data Transfer Workstation Mobile 21 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch @spcl_eth Polly-ACC http://spcl.inf.ethz.ch/Polly-ACC Automatic “Regression Free” High Performance 22 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch @spcl_eth Brave new compiler world!? Unfortunately not …   Limited to affine code regions  Maybe generalizes to control-restricted programs  No distributed anything!!  Good news:  Much of traditional HPC fits that model  Infrastructure is coming along  Bad news:  Modern data-driven HPC and Big Data fits less well  Need a programming model for distributed heterogeneous machines! 23

spcl.inf.ethz.ch @spcl_eth How do we program GPUs today? l s d t l s d t … l s t d l s d t CUDA MPI • over-subscribe hardware • host controlled • use spare parallel slack for latency • full device hiding synchronization device compute core instruction latency active thread T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth Latency hiding at the cluster level? s l pu l d t d t l s l s d t d t l s l s t d t d l pu l s d t d t dCUDA (distributed CUDA) • unified programming model for GPU clusters • avoid unnecessary device synchronization to enable system wide latency hiding device compute core instruction latency active thread T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth Talk on Wednesday Tobias Gysi , Jeremiah Baer, TH: “ dCUDA: Hardware Supported Overlap of Computation and Communication” Wednesday, Nov. 16 th 4:00-4:30pm Room 355-D 26

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1 spcl.inf.ethz.ch @spcl_eth Evading various ends the hardware view 2 spcl.inf.ethz.ch @spcl_eth

SOCIAL SECURITY AND ACC Concluding remarks for The ACC Debate: How Do We Pay for ACC? ,

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

ACC Fan Gearboxes: Eskoms Experience in the Selection and Maintenance of ACC gearboxes. Hein

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Simulating Transparent Migration in Java Java doesnt provide transparent migration. non

Transparent Assessment Providing transparent goals and expectations for students Jonathon Adams

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

Polly Mertens and Pauline Hanuise Ask Your Questions on Facebook

Environmental Quality WEAT E & I Meeting November 18, 2015 Polly Porter Compliance

Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 European LLVM Developers

Oscillation Results from Oscillation Results from MiniBooNE MiniBooNE Chris Polly, Univ. of

Introduction to Computer Networks Polly Huang EE NTU http://homepage.ntu.edu.tw/~pollyhuang

Conco System s FinTech ACC Cleaning Technology Gary Fischer Conco Systems ACC Features

By Saria Tseng and John Schnurer ACC Docket ACC Docket 60 60 October 2009 October 2009

The Advanced Communication Center The Advanced Communication Center www.acc.tau.ac.il

Clacc 2019: An Update on OpenACC Support for Clang and LLVM Joel E. Denny, Seyong Lee, Jeffrey S.

All City Council Live! Info Academic Update Q&A Raise your hand to be called on! Use the

Where are you going with those types? Vincent St-Amour, Sam Tobin-Hochstadt, Matthew Flatt,

Model Estimation Within Planning and Learning Alborz Geramifard ICML W orkshop - June 2011

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David

Having an Effective Annual Career Conference (ACC) Center for Faculty Development Goals To

multi-threaded programs Authors: K. Rustan Leino, P. Mller Speaker: Martin Lanter 1

Abstract Read Permissions Fractional Permissions without the Fractions Alex Summers ETH Zurich

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1 spcl.inf.ethz.ch @spcl_eth Evading various ends the hardware view 2 spcl.inf.ethz.ch @spcl_eth

SOCIAL SECURITY AND ACC Concluding remarks for The ACC Debate: How Do We Pay for ACC? ,

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

ACC Fan Gearboxes: Eskoms Experience in the Selection and Maintenance of ACC gearboxes. Hein

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Simulating Transparent Migration in Java Java doesnt provide transparent migration. non

Transparent Assessment Providing transparent goals and expectations for students Jonathon Adams

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

Polly Mertens and Pauline Hanuise Ask Your Questions on Facebook

Environmental Quality WEAT E &amp; I Meeting November 18, 2015 Polly Porter Compliance

Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 European LLVM Developers

Oscillation Results from Oscillation Results from MiniBooNE MiniBooNE Chris Polly, Univ. of

Introduction to Computer Networks Polly Huang EE NTU http://homepage.ntu.edu.tw/~pollyhuang

Conco System s FinTech ACC Cleaning Technology Gary Fischer Conco Systems ACC Features

By Saria Tseng and John Schnurer ACC Docket ACC Docket 60 60 October 2009 October 2009

The Advanced Communication Center The Advanced Communication Center www.acc.tau.ac.il

Clacc 2019: An Update on OpenACC Support for Clang and LLVM Joel E. Denny, Seyong Lee, Jeffrey S.

All City Council Live! Info Academic Update Q&amp;A Raise your hand to be called on! Use the

Where are you going with those types? Vincent St-Amour, Sam Tobin-Hochstadt, Matthew Flatt,

Model Estimation Within Planning and Learning Alborz Geramifard ICML W orkshop - June 2011

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David

Having an Effective Annual Career Conference (ACC) Center for Faculty Development Goals To

multi-threaded programs Authors: K. Rustan Leino, P. Mller Speaker: Martin Lanter 1

Abstract Read Permissions Fractional Permissions without the Fractions Alex Summers ETH Zurich

Environmental Quality WEAT E & I Meeting November 18, 2015 Polly Porter Compliance

All City Council Live! Info Academic Update Q&A Raise your hand to be called on! Use the