gpufs integrating a file system with gpus
play

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT - PowerPoint PPT Presentation

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin Traditional System Architecture Applications OS CPU 2


  1. GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin

  2. Traditional System Architecture Applications OS CPU 2 Mark Silberstein - UT Austin

  3. Modern System Architecture Accelerated applications OS Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 3 Mark Silberstein - UT Austin

  4. Software-hardware gap is widening Accelerated applications OS Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 4 Mark Silberstein - UT Austin

  5. Software-hardware gap is widening Accelerated applications Ad-hoc abstractions and OS management mechanisms Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 5 Mark Silberstein - UT Austin

  6. On-accelerator OS support closes the programmability gap Accelerated Native accelerator applications applications On-accelerator OS support OS Coordination Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 6 Mark Silberstein - UT Austin

  7. ● GPUfs: File I/O support for GPUs ● Motivation ● Goals ● Understanding the hardware ● Design ● Implementation ● Evaluation 7 Mark Silberstein - UT Austin

  8. Building systems with GPUs is hard. Why? 8 Mark Silberstein - UT Austin

  9. Goal of GPU programming frameworks GPU CPU Parallel Data transfers GPU invocation Algorithm Memory management 9 Mark Silberstein - UT Austin

  10. Headache for GPU programmers GPU CPU Data transfers Parallel Invocation Algorithm Memory management Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 10 Mark Silberstein - UT Austin

  11. GPU kernels are isolated GPU CPU Data transfers Parallel Invocation Algorithm Memory management 11 Mark Silberstein - UT Austin

  12. Example: accelerating photo collage http://www.codeproject.com/Articles/36347/Face-Collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12 Mark Silberstein - UT Austin

  13. CPU Implementation Application CPU CPU CPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13 Mark Silberstein - UT Austin

  14. Offloading computations to GPU Application CPU CPU CPU Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14 Mark Silberstein - UT Austin

  15. Offloading computations to GPU Co-processor programming model CPU Data transfer GPU Kernel Kernel start termination 15 Mark Silberstein - UT Austin

  16. Kernel start/stop overheads Invocation CPU latency copy to o invoke t Cache flush GPU U y p P o C c GPU Synchronization 16 Mark Silberstein - UT Austin

  17. Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU 17 Mark Silberstein - UT Austin

  18. Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU 18 Mark Silberstein - UT Austin

  19. Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU Why do we need to deal with low-level system details? 19 Mark Silberstein - UT Austin

  20. The reason is.... GPUs are peer-processors They need I/O OS services 20 Mark Silberstein - UT Austin

  21. GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( ) “ ( s write() p h a a m r m e d _ f i l e ” ) GPUfs Host File System 21 Mark Silberstein - UT Austin

  22. GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( System-wide ) “ ( s write() p h a a m shared r m e d namespace _ f i l e ” ) GPUfs POSIX (CPU)-like API Host File System Persistent storage 22 Mark Silberstein - UT Austin

  23. Accelerating collage app with GPUfs No CPU management code CPU CPU CPU GPUfs GPUfs GPU open/read from GPU 23 Mark Silberstein - UT Austin

  24. Accelerating collage app with GPUfs CPU CPU CPU Read-ahead GPUfs buffer cache GPUfs GPUfs GPU Overlapping Overlapping computations and transfers 24 Mark Silberstein - UT Austin

  25. Accelerating collage app with GPUfs CPU CPU CPU GPUfs GPU Data reuse Random data access 25 Mark Silberstein - UT Austin

  26. Challenge GPU ≠ CPU 26 Mark Silberstein - UT Austin

  27. Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 31,000 active threads active threads From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs* 27 Mark Silberstein - UT Austin

  28. Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s 288-360GB/s ~x20 Memory Memory 6-16 GB/s 28 Mark Silberstein - UT Austin

  29. How to build an FS layer on this hardware? 29 Mark Silberstein - UT Austin

  30. GPUfs: principled redesign of the whole file system stack ● Relaxed FS API semantics for parallelism ● Relaxed FS consistency for heterogeneous memory ● GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation, …. 30 Mark Silberstein - UT Austin

  31. GPUfs high-level design CPU GPU GPU application Unchanged applications using GPUfs File API using OS File API Massive parallelism GPUfs hooks GPUfs GPU OS File System Interface File I/O library OS GPUfs Distributed Buffer Cache (Page cache) Heterogeneous CPU Memory GPU Memory memory Host File System Disk 31 Mark Silberstein - UT Austin

  32. GPUfs high-level design CPU GPU GPU application Unchanged applications using GPUfs File API using OS File API GPUfs hooks GPUfs GPU OS File System Interface File I/O library OS GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 32 Mark Silberstein - UT Austin

  33. Buffer cache semantics Local or Distributed file system data consistency? 33 Mark Silberstein - UT Austin

  34. GPUfs buffer cache Weak data consistency model ● close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34 Mark Silberstein - UT Austin

  35. On-GPU File I/O API open/close gopen/gclose r e read/write gread/gwrite p a p mmap/munmap gmmap/gmunmap e h t n fsync/msync gfsync/gmsync I ftrunc gftrunc Changes in the semantics are crucial 35 Mark Silberstein - UT Austin

  36. Implementation bits ● Paging support ● Dynamic data structures and memory allocators r e ● Lock-free radix tree p a p ● Inter-processor communications (IPC) e h t n ● Hybrid H/W-S/W barriers I ● Consistency module in the OS kernel ~1,5K GPU LOC, ~600 CPU LOC 36 Mark Silberstein - UT Austin

  37. Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 37 Mark Silberstein - UT Austin

  38. Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075 CUDA piplined CUDA optimized GPU file I/O 3500 3000 2500 Throughput (MB/s) 2000 1500 1000 500 0 280 560 2800 5600 11200 Input matrix size (MB) 38 Mark Silberstein - UT Austin

  39. Word frequency count in text ● Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size 39 Mark Silberstein - UT Austin

  40. Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 6h 50m (7.2X) 53m (6.8X) 33,000 files, 524MB Shakespeare 292s 40s (7.3X) 40s (7.3X) 1 file, 6MB 40 Mark Silberstein - UT Austin

  41. Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 6h 50m (7.2X) 53m (6.8X) 33,000 files, 524MB 8% overhead Shakespeare 292s 40s (7.3X) 40s (7.3X) 1 file, 6MB Unbounded input/output size support 41 Mark Silberstein - UT Austin

  42. GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J 42 Mark Silberstein - UT Austin

  43. Our life would have been easier with ● PCI atomics ● Preemptive background daemons ● GPU-CPU signaling support ● In-GPU exceptions ● GPU virtual memory API (host-based or device) ● Compiler optimizations for register-heavy libraries ● Seems like accomplished in 5.0 43 Mark Silberstein - UT Austin

  44. Sequential access to file: 3 versions CUDA whole file transfer GPU file I/O GPU CPU gmmap() Read file Transfer to GPU CUDA pipelined transfer CPU Read chunk Transfer to GPU Read chunk Transfer to GPU Read chunk Transfer to GPU Read chunk Transfer to GPU 44 Mark Silberstein - UT Austin

  45. Sequential read Throughput vs. Page size 4000 GPU File I/O CUDA whole file CUDA pipeline 3500 3000 Throughput (MB/s) 2500 2000 1500 1000 500 0 16K 64K 256K 512K 1M 2M Page size 45 Mark Silberstein - UT Austin

Recommend


More recommend