gpu research in the es group
play

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan - PowerPoint PPT Presentation

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan van den Braak (postdoc) Roel Jordans (postdoc) Erkan Diken (PhD) Rik Jongerius (PhD) Ang Li (PhD) Maurice Peemen (PhD) Luc Waeijen (PhD) Mark Wijtvliet (PhD) PARsE research


  1. GPU research in the ES-group Henk Corporaal (professor) Gert-Jan van den Braak (postdoc) Roel Jordans (postdoc) Erkan Diken (PhD) Rik Jongerius (PhD) Ang Li (PhD) Maurice Peemen (PhD) Luc Waeijen (PhD) Mark Wijtvliet (PhD)

  2. PARsE research – http://parse.ele.tue.nl/ / Department of Electrical Engineering 3 December 2015 1

  3. PARsE Parallel Architecture Research Eindhoven • Using advanced heterogeneous platforms • Multi-core CPUs • GPUs • DSPs • FPGAs • Efficient code generation • Code transformation & generation • Compilers • Even more efficient: new architectures • SIMD, CGRA, R-GPU • Accelerators − Neural networks (CNNs) / Department of Electrical Engineering 3 December 2015 2

  4. GPU research – overview (selection) • Application mapping • Histogram, CNN • Understanding GPUs Hash function • Modeling of GPU L1 cache 0000 0000 1 0000 1 00 • Cache bypassing • Architecture modification lock bank • Hash functions in scratchpad memory C-code • Code generation • Bones source-to-source tools / Department of Electrical Engineering 3 December 2015 3

  5. Application mapping • Histogram, • Convolutional Neural Networks (CNN) / Department of Electrical Engineering 3 December 2015 4

  6. Application mapping: histogram • Load pixel • Update votes CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE [1] High Performance Predictable Histogramming on GPUs: / Department of Electrical Engineering 3 December 2015 5 Exploring and Evaluating Algorithm Trade-offs

  7. Histogram – replication • Load pixel • Update votes CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE [2] GPU-Vote: A Framework for Accelerating / Department of Electrical Engineering 3 December 2015 6 Voting Algorithms on GPU

  8. 7 WARP SCHEDULER CORE CORE CORE CORE CORE CORE CORE CORE DISPATCH UNIT SCRATCHPAD MEMORY / L1 CACHE CORE CORE CORE CORE CORE CORE CORE CORE INSTRUCTION CACHE REGISTER FILE CORE CORE CORE CORE CORE CORE CORE CORE 3 December 2015 CORE CORE CORE CORE CORE CORE CORE CORE WARP SCHEDULER DISPATCH UNIT LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST SFU SFU SFU SFU Scratchpad memory layout • Each bank has 32 lock-bits, 1024 in total … … BANK 31 … … BANK 30 … • Scratchpad memory • Divided in 32 banks … … BANK 3 … … BANK 2 / Department of Electrical Engineering … … BANK 1 … … BANK 0 LOCK-BITS

  9. Application mapping: CNN • Convolutional Neural Network (CNN) • GTX 460: 35fps • Tegra X1: ~20fps Layer 3 Object 80x173x313 Layer 4 Layer 1 Layer 2 input Category + Position 8x173x313 6x358x638 16x177x317 720 x 1280 at(x,y) at(x,y) 1x1 conv. 6x6 conv. with 6x6 conv. with 5x5 conv. 2x2 subsample 2x2 subsample [4] Speed Sign Detection and Recognition / Department of Electrical Engineering 3 December 2015 9 by Convolutional Neural Networks

  10. Understanding GPUs • Modeling of GPU L1 cache • Cache bypassing • Transit model / Department of Electrical Engineering 3 December 2015 10

  11. Understanding GPUs: L1 cache modeling • GPU Cache model: • Execution model (threads, thread blocks) • Memory latencies • MSHRs (pending memory requests) • Cache associativity [5] A Detailed GPU Cache Model Based / Department of Electrical Engineering 3 December 2015 11 on Reuse Distance Theory

  12. L1 cache model – results Mean absolute error of 6.4% / Department of Electrical Engineering 3 December 2015 12

  13. Understanding GPUs: Cache bypassing [6] Adaptive and Transparent / Department of Electrical Engineering 3 December 2015 13 Cache Bypassing for GPUs

  14. Cache bypassing – results [6] Adaptive and Transparent / Department of Electrical Engineering 3 December 2015 14 Cache Bypassing for GPUs

  15. Understanding GPUs: Transit model • Transit model: computation and memory sub-systems [7] Transit: A Visual Analytical Model / Department of Electrical Engineering 3 December 2015 15 for Multithreaded Machines

  16. Architecture modifications Hash function • Scratchpad memory hash functions 0000 0000 1 0000 1 00 • R-GPU LD / ST CR lock bank CORE CR / Department of Electrical Engineering 3 December 2015 16

  17. GPU modifications: bank & lock conflicts CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 32 * id address 0000 0000 1 00000 00 lock bank … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 all addresses in bank 0 / Department of Electrical Engineering 3 December 2015 17

  18. Resolving bank conflicts: hash functions CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 32 * id hash function Hash function 0000 0000 1 00000 00 … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 lock bank [8] Simulation and Architecture Improvements of / Department of Electrical Engineering 3 December 2015 18 Atomic Operations on GPU Scratchpad Memory

  19. Resolving bank conflicts: hash functions CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 33 * id configurable hash function Hash function 0000 0000 1 0000 1 00 … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 lock bank [9] Configurable XOR Hash Functions for / Department of Electrical Engineering 3 December 2015 19 Banked Scratchpad Memories in GPUs

  20. Architecture modifications: R-GPU LD / ST CR LD / ST CR CORE CORE CORE CORE CORE CORE CORE CORE A SHARED MEMORY / L1 CACHE CORE CORE CORE CORE CORE CORE CORE CORE B CORE CORE CORE CORE CORE CORE CORE CORE C CORE CORE CORE CORE CORE CORE CORE CORE D LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST E LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST CORE CORE CORE CORE CR CR CR CR / Department of Electrical Engineering 3 December 2015 20

  21. Code generation: ASET & Bones sequential C code How to generate efficient code for all these devices? GPU-OpenCL-AMD Multi-GPU CPU-OpenMP GPU-CUDA FPGA CPU-OpenCL-AMD (CUDA / OpenCL) CPU-OpenCL-Intel XeonPhi-OpenCL [10] Automatic Skeleton-Based Compilation through / Department of Electrical Engineering 21 Integration with an Algorithm Classification

  22. Code generation: ASET & Bones sequential C code Algorithmic Species PET ‘ASET’ Extraction Tool (llvm) species-annotated C code skeleton-based ‘Bones’ compiler GPU-OpenCL-AMD Multi-GPU CPU-OpenMP GPU-CUDA FPGA CPU-OpenCL-AMD (CUDA / OpenCL) CPU-OpenCL-Intel XeonPhi-OpenCL [10] Automatic Skeleton-Based Compilation through / Department of Electrical Engineering 22 Integration with an Algorithm Classification

  23. Example C to CUDA transformation template <unsigned int blockSize> 3 1 7 0 4 1 6 3 __device__ void warpReduce(volatile int *sm, unsigned int tid) { if (blockSize >= 64) sm[tid] += sm[tid + 32]; if (blockSize >= 32) sm[tid] += sm[tid + 16]; Example 1: Sum if (blockSize >= 16) sm[tid] += sm[tid + 8]; if (blockSize >= 8) sm[tid] += sm[tid + 4]; int sum = 0; if (blockSize >= 4) sm[tid] += sm[tid + 2]; if (blockSize >= 2) sm[tid] += sm[tid + 1]; for (int i=0; i<N; i++){ } sum = sum + in[i]; template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { } extern __shared__ int sm[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sm[tid] = 0; while (i < n) { sm[tid] += g_idata[i] sm[tid] += g_idata[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sm[tid] += sm[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sm[tid] += sm[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sm[tid] += sm[tid + 64]; } __syncthreads(); } if (tid < 32) { warpReduce<blockSize>(sm, tid); } if (tid == 0) { g_odata[blockIdx.x] = sm[0]; } } / Department of Electrical Engineering 23

Recommend


More recommend