GPU research in the ES-group Henk Corporaal (professor) Gert-Jan van den Braak (postdoc) Roel Jordans (postdoc) Erkan Diken (PhD) Rik Jongerius (PhD) Ang Li (PhD) Maurice Peemen (PhD) Luc Waeijen (PhD) Mark Wijtvliet (PhD)
PARsE research – http://parse.ele.tue.nl/ / Department of Electrical Engineering 3 December 2015 1
PARsE Parallel Architecture Research Eindhoven • Using advanced heterogeneous platforms • Multi-core CPUs • GPUs • DSPs • FPGAs • Efficient code generation • Code transformation & generation • Compilers • Even more efficient: new architectures • SIMD, CGRA, R-GPU • Accelerators − Neural networks (CNNs) / Department of Electrical Engineering 3 December 2015 2
GPU research – overview (selection) • Application mapping • Histogram, CNN • Understanding GPUs Hash function • Modeling of GPU L1 cache 0000 0000 1 0000 1 00 • Cache bypassing • Architecture modification lock bank • Hash functions in scratchpad memory C-code • Code generation • Bones source-to-source tools / Department of Electrical Engineering 3 December 2015 3
Application mapping • Histogram, • Convolutional Neural Networks (CNN) / Department of Electrical Engineering 3 December 2015 4
Application mapping: histogram • Load pixel • Update votes CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE [1] High Performance Predictable Histogramming on GPUs: / Department of Electrical Engineering 3 December 2015 5 Exploring and Evaluating Algorithm Trade-offs
Histogram – replication • Load pixel • Update votes CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE [2] GPU-Vote: A Framework for Accelerating / Department of Electrical Engineering 3 December 2015 6 Voting Algorithms on GPU
7 WARP SCHEDULER CORE CORE CORE CORE CORE CORE CORE CORE DISPATCH UNIT SCRATCHPAD MEMORY / L1 CACHE CORE CORE CORE CORE CORE CORE CORE CORE INSTRUCTION CACHE REGISTER FILE CORE CORE CORE CORE CORE CORE CORE CORE 3 December 2015 CORE CORE CORE CORE CORE CORE CORE CORE WARP SCHEDULER DISPATCH UNIT LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST SFU SFU SFU SFU Scratchpad memory layout • Each bank has 32 lock-bits, 1024 in total … … BANK 31 … … BANK 30 … • Scratchpad memory • Divided in 32 banks … … BANK 3 … … BANK 2 / Department of Electrical Engineering … … BANK 1 … … BANK 0 LOCK-BITS
Application mapping: CNN • Convolutional Neural Network (CNN) • GTX 460: 35fps • Tegra X1: ~20fps Layer 3 Object 80x173x313 Layer 4 Layer 1 Layer 2 input Category + Position 8x173x313 6x358x638 16x177x317 720 x 1280 at(x,y) at(x,y) 1x1 conv. 6x6 conv. with 6x6 conv. with 5x5 conv. 2x2 subsample 2x2 subsample [4] Speed Sign Detection and Recognition / Department of Electrical Engineering 3 December 2015 9 by Convolutional Neural Networks
Understanding GPUs • Modeling of GPU L1 cache • Cache bypassing • Transit model / Department of Electrical Engineering 3 December 2015 10
Understanding GPUs: L1 cache modeling • GPU Cache model: • Execution model (threads, thread blocks) • Memory latencies • MSHRs (pending memory requests) • Cache associativity [5] A Detailed GPU Cache Model Based / Department of Electrical Engineering 3 December 2015 11 on Reuse Distance Theory
L1 cache model – results Mean absolute error of 6.4% / Department of Electrical Engineering 3 December 2015 12
Understanding GPUs: Cache bypassing [6] Adaptive and Transparent / Department of Electrical Engineering 3 December 2015 13 Cache Bypassing for GPUs
Cache bypassing – results [6] Adaptive and Transparent / Department of Electrical Engineering 3 December 2015 14 Cache Bypassing for GPUs
Understanding GPUs: Transit model • Transit model: computation and memory sub-systems [7] Transit: A Visual Analytical Model / Department of Electrical Engineering 3 December 2015 15 for Multithreaded Machines
Architecture modifications Hash function • Scratchpad memory hash functions 0000 0000 1 0000 1 00 • R-GPU LD / ST CR lock bank CORE CR / Department of Electrical Engineering 3 December 2015 16
GPU modifications: bank & lock conflicts CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 32 * id address 0000 0000 1 00000 00 lock bank … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 all addresses in bank 0 / Department of Electrical Engineering 3 December 2015 17
Resolving bank conflicts: hash functions CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 32 * id hash function Hash function 0000 0000 1 00000 00 … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 lock bank [8] Simulation and Architecture Improvements of / Department of Electrical Engineering 3 December 2015 18 Atomic Operations on GPU Scratchpad Memory
Resolving bank conflicts: hash functions CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 33 * id configurable hash function Hash function 0000 0000 1 0000 1 00 … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 lock bank [9] Configurable XOR Hash Functions for / Department of Electrical Engineering 3 December 2015 19 Banked Scratchpad Memories in GPUs
Architecture modifications: R-GPU LD / ST CR LD / ST CR CORE CORE CORE CORE CORE CORE CORE CORE A SHARED MEMORY / L1 CACHE CORE CORE CORE CORE CORE CORE CORE CORE B CORE CORE CORE CORE CORE CORE CORE CORE C CORE CORE CORE CORE CORE CORE CORE CORE D LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST E LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST CORE CORE CORE CORE CR CR CR CR / Department of Electrical Engineering 3 December 2015 20
Code generation: ASET & Bones sequential C code How to generate efficient code for all these devices? GPU-OpenCL-AMD Multi-GPU CPU-OpenMP GPU-CUDA FPGA CPU-OpenCL-AMD (CUDA / OpenCL) CPU-OpenCL-Intel XeonPhi-OpenCL [10] Automatic Skeleton-Based Compilation through / Department of Electrical Engineering 21 Integration with an Algorithm Classification
Code generation: ASET & Bones sequential C code Algorithmic Species PET ‘ASET’ Extraction Tool (llvm) species-annotated C code skeleton-based ‘Bones’ compiler GPU-OpenCL-AMD Multi-GPU CPU-OpenMP GPU-CUDA FPGA CPU-OpenCL-AMD (CUDA / OpenCL) CPU-OpenCL-Intel XeonPhi-OpenCL [10] Automatic Skeleton-Based Compilation through / Department of Electrical Engineering 22 Integration with an Algorithm Classification
Example C to CUDA transformation template <unsigned int blockSize> 3 1 7 0 4 1 6 3 __device__ void warpReduce(volatile int *sm, unsigned int tid) { if (blockSize >= 64) sm[tid] += sm[tid + 32]; if (blockSize >= 32) sm[tid] += sm[tid + 16]; Example 1: Sum if (blockSize >= 16) sm[tid] += sm[tid + 8]; if (blockSize >= 8) sm[tid] += sm[tid + 4]; int sum = 0; if (blockSize >= 4) sm[tid] += sm[tid + 2]; if (blockSize >= 2) sm[tid] += sm[tid + 1]; for (int i=0; i<N; i++){ } sum = sum + in[i]; template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { } extern __shared__ int sm[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sm[tid] = 0; while (i < n) { sm[tid] += g_idata[i] sm[tid] += g_idata[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sm[tid] += sm[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sm[tid] += sm[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sm[tid] += sm[tid + 64]; } __syncthreads(); } if (tid < 32) { warpReduce<blockSize>(sm, tid); } if (tid == 0) { g_odata[blockIdx.x] = sm[0]; } } / Department of Electrical Engineering 23
Recommend
More recommend