Operating System Services for High Throughput Processors Mark Silberstein EE, Technion
Traditional Systems Software Stack Applications OS CPU 2 Feb 2014 Mark Silberstein - EE, Technion
Modern Systems Software Stack Accelerated applications OS Manycore CPU GPUs FPGA DSPs processors 3 Feb 2014 Mark Silberstein - EE, Technion
GPUs make a difference... ● Top 10 fastest supercomputers use GPUs HCI Metheo Vision Physics rology Bio Graph informatics Algorithms Chemistry Linear Finance Algebra 4 Feb 2014 Mark Silberstein - EE, Technion
GPUs make a difference, but only in HPC! HCI Metheo Vision Physics rology Bio Web Network Antivirus, Graph informatics servers services file search Algorithms ??? ??? ??? Chemistry Linear Finance Algebra 5 Feb 2014 Mark Silberstein - EE, Technion
Software-hardware gap is widening Accelerated applications Inadequate abstractions and OS management mechanisms Manycore Manycore Manycore Hybrid CPU GPUs GPUs GPUs FPGA FPGA FPGA DSPs processors processors processors CPU-GPU 6 Feb 2014 Mark Silberstein - EE, Technion
Fundamentals in question accelerators ≡ co-processors accelerators ≡ peer-processors 7 Feb 2014 Mark Silberstein - EE, Technion
Software stack for accelerat ed applications Accelerated Applications OS Accelerator abstractions and mechanisms Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs processors processors 8 Feb 2014 Mark Silberstein - EE, Technion
Software stack for accelerat or applications Accelerated Accelerator applications Applications (centralized and distributed) Accelerator I/O Accelerator OS support services (network, files) (Interprocessor I/O, OS file system, network APIs) Accelerator abstractions and mechanisms Hardware support for OS Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs processors processors 9 Feb 2014 Mark Silberstein - EE, Technion
This talk Accelerated Accelerator applications Applications centralized and distributed Accelerator I/O Accelerator OS support services (network, files) (Interprocessor I/O, OS file system, network APIs) Accelerator abstractions ASPLOS13, TOCS14 and mechanisms Hardware support for OS Storage Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs GPUs processors processors Network 10 Feb 2014 Mark Silberstein - EE, Technion
● GPU 101 ● GPUfs: File I/O support for GPUs ● Future work 11 Feb 2014 Mark Silberstein - EE, Technion
Hybrid GPU-CPU 101 Architecture CPU GPU Memory Memory 12 Feb 2014 Mark Silberstein - EE, Technion
Co-processor model CPU GPU Computation Memory Memory 13 Feb 2014 Mark Silberstein - EE, Technion
Co-processor model CPU GPU Computation tation Memory Memory 14 Feb 2014 Mark Silberstein - EE, Technion
Co-processor model GPU kernel CPU GPU Computation tation t a t i o n Memory Memory 15 Feb 2014 Mark Silberstein - EE, Technion
Co-processor model CPU GPU Computation Memory Memory 16 Feb 2014 Mark Silberstein - EE, Technion
Building systems with GPUs is hard Why? 17 Feb 2014 Mark Silberstein - EE, Technion
GPU kernels are isolated GPU CPU Data transfers Parallel Invocation Algorithm Memory management 18 Feb 2014 Mark Silberstein - EE, Technion
Example: accelerating photo collage Application CPU CPU CPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 19 Feb 2014 Mark Silberstein - EE, Technion
Offloading computations to GPU Application CPU CPU CPU Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 20 Feb 2014 Mark Silberstein - EE, Technion
Offloading computations to GPU CPU Data transfer GPU Kernel Kernel start termination 21 Feb 2014 Mark Silberstein - EE, Technion
Overheads Invocation latency CPU copy to o invoke t GPU U y p P o C c GPU Transfer overhead Synchronization 22 Feb 2014 Mark Silberstein - EE, Technion
Working around overheads Asynchronous invocation Data reuse management Double buffering Buffer size optimization CPU copy to copy to o invoke t GPU GPU U GPU-CPU y p P o low-level tricks C c GPU 23 Feb 2014 Mark Silberstein - EE, Technion
Management overhead Asynchronous invocation Asynchronous invocation Data reuse management Data reuse management Double buffering Double buffering Buffer size optimization Buffer size optimization GPU-CPU GPU-CPU low-level tricks low-level tricks Why do we need to deal with low-level system details? 24 Feb 2014 Mark Silberstein - EE, Technion
The reason is.... GPUs are peer-processors They need I/O OS services 25 Feb 2014 Mark Silberstein - EE, Technion
GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( mmap() “ s write() h a r e d _ f i l e ” ) GPUfs Host File System 26 Feb 2014 Mark Silberstein - EE, Technion
GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p System-wide e n ( mmap() shared namespace “ s write() h a r e d _ f i l e POSIX (CPU)-like API ” ) GPUfs Host File System Persistent storage 27 Feb 2014 Mark Silberstein - EE, Technion
Accelerating collage app with GPUfs No CPU management code GPUfs GPUfs GPU open/read from GPU 28 Feb 2014 Mark Silberstein - EE, Technion
Accelerating collage app with GPUfs CPU CPU CPU Read-ahead GPUfs buffer cache GPUfs GPUfs GPU Overlapping Overlapping computations and transfers 29 Feb 2014 Mark Silberstein - EE, Technion
Accelerating collage app with GPUfs CPU CPU CPU GPUfs GPU Data reuse Random data access 30 Feb 2014 Mark Silberstein - EE, Technion
Understanding the hardware 31 Feb 2014 Mark Silberstein - EE, Technion
GPU hardware characteristics Parallelism Low serial performance Heterogeneous memory 32 Feb 2014 Mark Silberstein - EE, Technion
GPU hardware parallelism 1. Multi-core GPU GPU GPU memory GPU memory Core MP Core MP Core MP Core MP 33 Feb 2014 Mark Silberstein - EE, Technion
GPU hardware parallelism 2. SIMD GPU GPU GPU memory GPU memory SIMD vector SIMD vector SIMD vector MP 34 Feb 2014 Mark Silberstein - EE, Technion
GPU hardware parallelism 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory T1 T2 MP T3 Execution state 35 Feb 2014 Mark Silberstein - EE, Technion
GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 T2 MP T3 Execution state 36 Feb 2014 Mark Silberstein - EE, Technion
GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP T3 Execution state 37 Feb 2014 Mark Silberstein - EE, Technion
GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP R 0x08 T3 Execution state 38 Feb 2014 Mark Silberstein - EE, Technion
GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP R 0x08 T3 Execution state 39 Feb 2014 Mark Silberstein - EE, Technion
Putting it all together: 3 levels of hardware parallelism GPU GPU GPU memory GPU memory Core MP Core MP Core MP Core MP SIMD vector Thread Ctx 1 State 1 Core MP Thread Ctx k State k 40 Feb 2014 Mark Silberstein - EE, Technion
Software-Hardware mapping GPU GPU GPU memory GPU memory Core MP MP Core MP MP Core MP MP Core MP MP SIMD vector Thread Ctx 1 State 1 T T h h r r Core MP MP e e a a d d n 1 Thread Ctx k State k 41 Feb 2014 Mark Silberstein - EE, Technion
(1) 10,000-s of concurrent threads! GPU GPU GPU memory GPU memory NVIDIA K20x GPU: 64x14x32= 28672 concurrent threads Core MP MP Core MP MP Core MP MP Core MP MP 14 32 SIMD vector Thread Ctx 1 State 1 T T h h 64 r r Core MP MP e e a a d d n 1 Thread Ctx k State k 42 Feb 2014 Mark Silberstein - EE, Technion
(2) Each thread is slow GPU GPU GPU memory GPU memory Core MP MP Core MP MP Core MP MP Core MP MP ~100x slower than a CPU thread SIMD vector Thread Ctx 1 State 1 T T h h r r Core MP MP e e a a d d n 1 Thread Ctx k State k 43 Feb 2014 Mark Silberstein - EE, Technion
(3) Heterogeneous memory CPU GPU 10-32GB/s 250GB/s x20 Memory Memory 12 GB/s 44 Feb 2014 Mark Silberstein - EE, Technion
GPUfs: file system layer for GPUs Joint work with Bryan Ford, Idit Keidar, Emmett Witchel [ASPLOS13, TOCS14] 45 Feb 2014 Mark Silberstein - EE, Technion
GPUfs: principled redesign of the whole file system stack ● Modified FS API semantics for massive parallelism ● Relaxed distributed FS consistency for non-uniform memory ● GPU-specific implementation of synchronization primitives, read-optimized data structures, memory allocation, …. 46 Feb 2014 Mark Silberstein - EE, Technion
Recommend
More recommend