GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin
Traditional System Architecture Applications OS CPU 2 Mark Silberstein - UT Austin
Modern System Architecture Accelerated applications OS Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 3 Mark Silberstein - UT Austin
Software-hardware gap is widening Accelerated applications OS Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 4 Mark Silberstein - UT Austin
Software-hardware gap is widening Accelerated applications Ad-hoc abstractions and OS management mechanisms Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 5 Mark Silberstein - UT Austin
On-accelerator OS support closes the programmability gap Accelerated Native accelerator applications applications On-accelerator OS support OS Coordination Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 6 Mark Silberstein - UT Austin
● GPUfs: File I/O support for GPUs ● Motivation ● Goals ● Understanding the hardware ● Design ● Implementation ● Evaluation 7 Mark Silberstein - UT Austin
Building systems with GPUs is hard. Why? 8 Mark Silberstein - UT Austin
Goal of GPU programming frameworks GPU CPU Parallel Data transfers GPU invocation Algorithm Memory management 9 Mark Silberstein - UT Austin
Headache for GPU programmers GPU CPU Data transfers Parallel Invocation Algorithm Memory management Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 10 Mark Silberstein - UT Austin
GPU kernels are isolated GPU CPU Data transfers Parallel Invocation Algorithm Memory management 11 Mark Silberstein - UT Austin
Example: accelerating photo collage http://www.codeproject.com/Articles/36347/Face-Collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12 Mark Silberstein - UT Austin
CPU Implementation Application CPU CPU CPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13 Mark Silberstein - UT Austin
Offloading computations to GPU Application CPU CPU CPU Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14 Mark Silberstein - UT Austin
Offloading computations to GPU Co-processor programming model CPU Data transfer GPU Kernel Kernel start termination 15 Mark Silberstein - UT Austin
Kernel start/stop overheads Invocation CPU latency copy to o invoke t Cache flush GPU U y p P o C c GPU Synchronization 16 Mark Silberstein - UT Austin
Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU 17 Mark Silberstein - UT Austin
Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU 18 Mark Silberstein - UT Austin
Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU Why do we need to deal with low-level system details? 19 Mark Silberstein - UT Austin
The reason is.... GPUs are peer-processors They need I/O OS services 20 Mark Silberstein - UT Austin
GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( ) “ ( s write() p h a a m r m e d _ f i l e ” ) GPUfs Host File System 21 Mark Silberstein - UT Austin
GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( System-wide ) “ ( s write() p h a a m shared r m e d namespace _ f i l e ” ) GPUfs POSIX (CPU)-like API Host File System Persistent storage 22 Mark Silberstein - UT Austin
Accelerating collage app with GPUfs No CPU management code CPU CPU CPU GPUfs GPUfs GPU open/read from GPU 23 Mark Silberstein - UT Austin
Accelerating collage app with GPUfs CPU CPU CPU Read-ahead GPUfs buffer cache GPUfs GPUfs GPU Overlapping Overlapping computations and transfers 24 Mark Silberstein - UT Austin
Accelerating collage app with GPUfs CPU CPU CPU GPUfs GPU Data reuse Random data access 25 Mark Silberstein - UT Austin
Challenge GPU ≠ CPU 26 Mark Silberstein - UT Austin
Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 31,000 active threads active threads From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs* 27 Mark Silberstein - UT Austin
Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s 288-360GB/s ~x20 Memory Memory 6-16 GB/s 28 Mark Silberstein - UT Austin
How to build an FS layer on this hardware? 29 Mark Silberstein - UT Austin
GPUfs: principled redesign of the whole file system stack ● Relaxed FS API semantics for parallelism ● Relaxed FS consistency for heterogeneous memory ● GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation, …. 30 Mark Silberstein - UT Austin
GPUfs high-level design CPU GPU GPU application Unchanged applications using GPUfs File API using OS File API Massive parallelism GPUfs hooks GPUfs GPU OS File System Interface File I/O library OS GPUfs Distributed Buffer Cache (Page cache) Heterogeneous CPU Memory GPU Memory memory Host File System Disk 31 Mark Silberstein - UT Austin
GPUfs high-level design CPU GPU GPU application Unchanged applications using GPUfs File API using OS File API GPUfs hooks GPUfs GPU OS File System Interface File I/O library OS GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 32 Mark Silberstein - UT Austin
Buffer cache semantics Local or Distributed file system data consistency? 33 Mark Silberstein - UT Austin
GPUfs buffer cache Weak data consistency model ● close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34 Mark Silberstein - UT Austin
On-GPU File I/O API open/close gopen/gclose r e read/write gread/gwrite p a p mmap/munmap gmmap/gmunmap e h t n fsync/msync gfsync/gmsync I ftrunc gftrunc Changes in the semantics are crucial 35 Mark Silberstein - UT Austin
Implementation bits ● Paging support ● Dynamic data structures and memory allocators r e ● Lock-free radix tree p a p ● Inter-processor communications (IPC) e h t n ● Hybrid H/W-S/W barriers I ● Consistency module in the OS kernel ~1,5K GPU LOC, ~600 CPU LOC 36 Mark Silberstein - UT Austin
Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 37 Mark Silberstein - UT Austin
Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075 CUDA piplined CUDA optimized GPU file I/O 3500 3000 2500 Throughput (MB/s) 2000 1500 1000 500 0 280 560 2800 5600 11200 Input matrix size (MB) 38 Mark Silberstein - UT Austin
Word frequency count in text ● Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size 39 Mark Silberstein - UT Austin
Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 6h 50m (7.2X) 53m (6.8X) 33,000 files, 524MB Shakespeare 292s 40s (7.3X) 40s (7.3X) 1 file, 6MB 40 Mark Silberstein - UT Austin
Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 6h 50m (7.2X) 53m (6.8X) 33,000 files, 524MB 8% overhead Shakespeare 292s 40s (7.3X) 40s (7.3X) 1 file, 6MB Unbounded input/output size support 41 Mark Silberstein - UT Austin
GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J 42 Mark Silberstein - UT Austin
Our life would have been easier with ● PCI atomics ● Preemptive background daemons ● GPU-CPU signaling support ● In-GPU exceptions ● GPU virtual memory API (host-based or device) ● Compiler optimizations for register-heavy libraries ● Seems like accomplished in 5.0 43 Mark Silberstein - UT Austin
Sequential access to file: 3 versions CUDA whole file transfer GPU file I/O GPU CPU gmmap() Read file Transfer to GPU CUDA pipelined transfer CPU Read chunk Transfer to GPU Read chunk Transfer to GPU Read chunk Transfer to GPU Read chunk Transfer to GPU 44 Mark Silberstein - UT Austin
Sequential read Throughput vs. Page size 4000 GPU File I/O CUDA whole file CUDA pipeline 3500 3000 Throughput (MB/s) 2500 2000 1500 1000 500 0 16K 64K 256K 512K 1M 2M Page size 45 Mark Silberstein - UT Austin
Recommend
More recommend