ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin
Traditional System Architecture Applications OS CPU 2 Mark Silberstein - UT Austin
Modern System Architecture Accelerated applications OS Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 3 Mark Silberstein - UT Austin
Software-hardware gap is widening Accelerated applications OS Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 4 Mark Silberstein - UT Austin
Software-hardware gap is widening Accelerated applications Ad-hoc abstractions and OS management mechanisms Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 5 Mark Silberstein - UT Austin
On-accelerator OS support closes the programmability gap Accelerated Native accelerator applications applications On-accelerator OS support OS Coordination Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 6 Mark Silberstein - UT Austin
● GPUfs: File I/O support for GPUs ● Motivation ● Goals ● Understanding the hardware ● Design ● Implementation ● Evaluation 7 Mark Silberstein - UT Austin
Building systems with GPUs is hard. Why? 8 Mark Silberstein - UT Austin
Goal of GPU programming frameworks GPU CPU Parallel Data transfers GPU invocation Algorithm Memory management 9 Mark Silberstein - UT Austin
Headache for GPU programmers GPU CPU Data transfers Parallel Invocation Algorithm Memory management Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 10 Mark Silberstein - UT Austin
GPU kernels are isolated GPU CPU Data transfers Parallel Invocation Algorithm Memory management 11 Mark Silberstein - UT Austin
Example: accelerating photo collage http://www.codeproject.com/Articles/36347/Face-Collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12 Mark Silberstein - UT Austin
CPU Implementation Application CPU CPU CPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13 Mark Silberstein - UT Austin
Offloading computations to GPU Application CPU CPU CPU Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14 Mark Silberstein - UT Austin
Offloading computations to GPU Co-processor programming model CPU Data transfer GPU Kernel Kernel start termination 15 Mark Silberstein - UT Austin
Kernel start/stop overheads Invocation CPU latency copy to o invoke t Cache flush GPU U y p P o C c GPU Synchronization 16 Mark Silberstein - UT Austin
Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU 17 Mark Silberstein - UT Austin
Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU 18 Mark Silberstein - UT Austin
Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU Why do we need to deal with low-level system details? 19 Mark Silberstein - UT Austin
The reason is.... GPUs are peer-processors They need I/O OS services 20 Mark Silberstein - UT Austin
GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( ) “ ( s write() p h a a m r m e d _ f i l e ” ) GPUfs Host File System 21 Mark Silberstein - UT Austin
GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( System-wide ) “ ( s write() p h a a m shared r m e d namespace _ f i l e ” ) GPUfs POSIX (CPU)-like API Host File System Persistent storage 22 Mark Silberstein - UT Austin
Accelerating collage app with GPUfs No CPU management code CPU CPU CPU GPUfs GPUfs GPU open/read from GPU 23 Mark Silberstein - UT Austin
Accelerating collage app with GPUfs CPU CPU CPU Read-ahead GPUfs buffer cache GPUfs GPUfs GPU Overlapping Overlapping computations and transfers 24 Mark Silberstein - UT Austin
Accelerating collage app with GPUfs CPU CPU CPU GPUfs GPU Data reuse Random data access 25 Mark Silberstein - UT Austin
Challenge GPU ≠ CPU 26 Mark Silberstein - UT Austin
Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 31,000 active threads active threads From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs* 27 Mark Silberstein - UT Austin
Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s 288-360GB/s ~x20 Memory Memory 6-16 GB/s 28 Mark Silberstein - UT Austin
How to build an FS layer on this hardware? 29 Mark Silberstein - UT Austin
GPUfs: principled redesign of the whole file system stack ● Relaxed FS API semantics for parallelism ● Relaxed FS consistency for heterogeneous memory ● GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation, …. 30 Mark Silberstein - UT Austin
GPUfs high-level design CPU GPU GPU application Unchanged applications using GPUfs File API using OS File API Massive parallelism GPUfs hooks GPUfs GPU OS File System Interface File I/O library OS GPUfs Distributed Buffer Cache (Page cache) Heterogeneous CPU Memory GPU Memory memory Host File System Disk 31 Mark Silberstein - UT Austin
GPUfs high-level design CPU GPU GPU application Unchanged applications using GPUfs File API using OS File API GPUfs hooks GPUfs GPU OS File System Interface File I/O library OS GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 32 Mark Silberstein - UT Austin
Buffer cache semantics Local or Distributed file system data consistency? 33 Mark Silberstein - UT Austin
GPUfs buffer cache Weak data consistency model ● close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34 Mark Silberstein - UT Austin
On-GPU File I/O API open/close gopen/gclose r e read/write gread/gwrite p a p mmap/munmap gmmap/gmunmap e h t fsync/msync gfsync/gmsync n I ftrunc gftrunc Changes in the semantics are crucial 35 Mark Silberstein - UT Austin
Implementation bits ● Paging support ● Dynamic data structures and memory allocators r e ● Lock-free radix tree p a p ● Inter-processor communications (IPC) e h t n ● Hybrid H/W-S/W barriers I ● Consistency module in the OS kernel ~1,5K GPU LOC, ~600 CPU LOC 36 Mark Silberstein - UT Austin
Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 37 Mark Silberstein - UT Austin
Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075 CUDA piplined CUDA optimized GPU file I/O 3500 3000 2500 Throughput (MB/s) 2000 1500 1000 500 0 280 560 2800 5600 11200 Input matrix size (MB) 38 Mark Silberstein - UT Austin
Word frequency count in text ● Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size 39 Mark Silberstein - UT Austin
Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 6h 50m (7.2X) 53m (6.8X) 33,000 files, 524MB Shakespeare 292s 40s (7.3X) 40s (7.3X) 1 file, 6MB 40 Mark Silberstein - UT Austin
Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 6h 50m (7.2X) 53m (6.8X) 33,000 files, 524MB 8% overhead Shakespeare 292s 40s (7.3X) 40s (7.3X) 1 file, 6MB Unbounded input/output size support 41 Mark Silberstein - UT Austin
GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J 42 Mark Silberstein - UT Austin
Recommend
More recommend