Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Predicted CPU Clock Speed 1 Clock speed 1971: 740 kHz, 2018: 45 GHz Source: Kurzweil "The Singularity is Near" (2005) Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Actual CPU Clock Speed 2 Clock speed 2018: 3 GHz Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Why? 3 Intel estimate, around 2000: 400 kW by 2018? Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Moore’s Law 4 Number of transitors per chip still exponential Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
What to do with the Transitors? 5 • More parallelism → faster execution of instructions • More processors on a chip Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
6 multi-processors Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Intel Core i7: Quad-Core 7 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Intel Xeon Phi: 72 cores (2017) 8 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Handling Multiple Processes 9 • Kernel can keep multiple processes running • Each process is assigned to a core – each core has a local cache – all cores share a common cache, common memory • Synchronization between cores not trivial e.g., cache coherence Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
More Parallelism 10 • Multiple processes not always the best way to parallelize • Often, within a process parallel execution would be helpful • Example: matrix multiplication – loops over different parts of the data – instructions highly independent → can be executed in parallel Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Multi-Threading 11 • Parallel execution within process • No switching of process context (e.g., virtual address space) • Supported by various libraries – pthread in C++ – thread in C++11 – thread in Python • Programmer has to take care of conflicts Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
12 computer graphics Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Computer Graphics 13 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Computer Graphics 14 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
tl;dr 15 • Given – 3d models of objects – lighting, textures – ray tracing • Lots of vector and matrix operations • Color value for each pixel on the screen has to be computed Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
High Demand 16 • Computer games on regular PCs • Game consoles – Atari (1972-1996) – Nintendo/Wii (since 1977) – Playstation (since 1994) – X-Box (since 2001) • 100s of millions sold Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
17 history Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
VGA Controller 18 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
GPU 19 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Co-Processor 20 • CPU handles the bulk of the complexity • GPU focuses on specific problems Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Graphics Pipeline 21 Initially: dedicated hardware for core steps Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Unified GPU Architecture 22 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
23 gpu Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Streaming Multiprocessor (SM) 24 • Fetches instruction (I-Cache) • Has to apply it over a vector of data • Each vector element is processed in one thread (MT Issue) • Thread is handled by scalar processor (SP) • Special function units (SFU) Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Taxonomy 25 • SISD (single instruction, single data) – uni-processors (6502, Intel until 1990s) • MIMD (multi instruction, multiple data) – Intel Core i7 – multiple cores on a chip – each core runs instructions that operate on their own data • SIMD (single instruction, multiple data) – Streaming Multi-Processors – multiple cores on a chip – same instruction executed on different data Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
GPU Architecture 26 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Graphics Programming 27 • Libraries that support all steps of graphics pipeline • Open standard: OpenGL • Microsoft: Direct3D • Libraries handle mapping to GPU hardware Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Direct3D Pipeline 28 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
29 more uses for gpus Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Deep Learning 30 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Deep Learning 31 • The latest machine learning hype • Computationally – lots of matrix multiplications – lots of vector operations – massive data sets • Just what GPUs are good at Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
CUDA 32 • Extension of C++ to support general GPU programming • Fairly low-level – identify parts of program to be handled by GPU – define function to be executed by a thread – define how many threads are used • Key concepts – kernel = function to be executed by a thread – thread block = set of threads to be executed in parallel – thread grid = set of thread blocks Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Example 33 • Serial loop void example(int n, float alpha, float *x, float *y) { for( int i=0; i<n; n++) y[i] = alpha * x[i] + y[i] } example(n, 2.0, x, y); • Parallel with CUDA #define THREADS 256 void cuda_example(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIDx.x; if (i < n) y[i] = alpha*x[i] + y[i]; } int nblocks = (n + THREADS - 1) / THREADS; cuda_example<<< nblocks, THREADS >>>(n, 2.0, x, y); Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Memory Levels 34 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
35 multiprocessor architecture Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Nvidia Titan V 36 • 80 streaming multiprocessors, 5120 cores, 640 tensor cores • Clock speed 1455 MHz • Memory size 12 GB, bandwidth 650 GB/sec • Retail price $2999 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Multithreaded Multiprocessor 37 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Single Instruction, Multiple Thread 38 • Each scalar processors – executes same instruction – on different data – has own register file • Branch synchronization – if threads diverge on conditional branches → execute different paths separately • Shared memory Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
39 instructions Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Basics 40 • Design more similar to MIPS than x86 • Various data types - each of different sizes – untyped bit arrays (8, 16, 32, 64 bits) – unsigned integers (8, 16, 32, 64 bits) – signed integers (8, 16, 32, 64 bits) – floating points (16, 32, 64 bits) Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Basic Instructions 41 • Arithmetic instructions operate on registers – add d, a, b → d = a+b – mul d, a, b → d = a*b – mad d, a, b, c → d = a*b+c – mov d, a → d = a • Special functions handled by SFU processors – square root (sqrt) – sine (sin) – cosine (cos) – binary logarithm (lg2) Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Memory Access 42 • Different memory spaces (global, shared, local, const) • Different data sizes (8, 16, 32, 64 bits) • Load (ld) and store (st) • Atomic memory read, write, add, min, max, and, ... Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Control Flow 43 • Branch (conditional on register value = 0) • Subroutine call: call, ret • Synchronization: bar.sync forces all threads to synchronize • Terminate thread: exit Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018
Recommend
More recommend