GPUfs: Integrating a file system with GPUs Mark Silberstein (UT - PowerPoint PPT Presentation

ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin

Traditional System Architecture Applications OS CPU 2 Mark Silberstein - UT Austin

Modern System Architecture Accelerated applications OS Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 3 Mark Silberstein - UT Austin

Software-hardware gap is widening Accelerated applications OS Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 4 Mark Silberstein - UT Austin

Software-hardware gap is widening Accelerated applications Ad-hoc abstractions and OS management mechanisms Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 5 Mark Silberstein - UT Austin

On-accelerator OS support closes the programmability gap Accelerated Native accelerator applications applications On-accelerator OS support OS Coordination Manycore Hybrid CPU GPUs FPGA processors CPU-GPU 6 Mark Silberstein - UT Austin

● GPUfs: File I/O support for GPUs ● Motivation ● Goals ● Understanding the hardware ● Design ● Implementation ● Evaluation 7 Mark Silberstein - UT Austin

Building systems with GPUs is hard. Why? 8 Mark Silberstein - UT Austin

Goal of GPU programming frameworks GPU CPU Parallel Data transfers GPU invocation Algorithm Memory management 9 Mark Silberstein - UT Austin

Headache for GPU programmers GPU CPU Data transfers Parallel Invocation Algorithm Memory management Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 10 Mark Silberstein - UT Austin

GPU kernels are isolated GPU CPU Data transfers Parallel Invocation Algorithm Memory management 11 Mark Silberstein - UT Austin

Example: accelerating photo collage http://www.codeproject.com/Articles/36347/Face-Collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12 Mark Silberstein - UT Austin

CPU Implementation Application CPU CPU CPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13 Mark Silberstein - UT Austin

Offloading computations to GPU Application CPU CPU CPU Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14 Mark Silberstein - UT Austin

Offloading computations to GPU Co-processor programming model CPU Data transfer GPU Kernel Kernel start termination 15 Mark Silberstein - UT Austin

Kernel start/stop overheads Invocation CPU latency copy to o invoke t Cache flush GPU U y p P o C c GPU Synchronization 16 Mark Silberstein - UT Austin

Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU 17 Mark Silberstein - UT Austin

Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU 18 Mark Silberstein - UT Austin

Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU copy to copy to o invoke t GPU GPU U y p P o C c GPU Why do we need to deal with low-level system details? 19 Mark Silberstein - UT Austin

The reason is.... GPUs are peer-processors They need I/O OS services 20 Mark Silberstein - UT Austin

GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( ) “ ( s write() p h a a m r m e d _ f i l e ” ) GPUfs Host File System 21 Mark Silberstein - UT Austin

GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( System-wide ) “ ( s write() p h a a m shared r m e d namespace _ f i l e ” ) GPUfs POSIX (CPU)-like API Host File System Persistent storage 22 Mark Silberstein - UT Austin

Accelerating collage app with GPUfs No CPU management code CPU CPU CPU GPUfs GPUfs GPU open/read from GPU 23 Mark Silberstein - UT Austin

Accelerating collage app with GPUfs CPU CPU CPU Read-ahead GPUfs buffer cache GPUfs GPUfs GPU Overlapping Overlapping computations and transfers 24 Mark Silberstein - UT Austin

Accelerating collage app with GPUfs CPU CPU CPU GPUfs GPU Data reuse Random data access 25 Mark Silberstein - UT Austin

Challenge GPU ≠ CPU 26 Mark Silberstein - UT Austin

Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 31,000 active threads active threads From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs* 27 Mark Silberstein - UT Austin

Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s 288-360GB/s ~x20 Memory Memory 6-16 GB/s 28 Mark Silberstein - UT Austin

How to build an FS layer on this hardware? 29 Mark Silberstein - UT Austin

GPUfs: principled redesign of the whole file system stack ● Relaxed FS API semantics for parallelism ● Relaxed FS consistency for heterogeneous memory ● GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation, …. 30 Mark Silberstein - UT Austin

GPUfs high-level design CPU GPU GPU application Unchanged applications using GPUfs File API using OS File API Massive parallelism GPUfs hooks GPUfs GPU OS File System Interface File I/O library OS GPUfs Distributed Buffer Cache (Page cache) Heterogeneous CPU Memory GPU Memory memory Host File System Disk 31 Mark Silberstein - UT Austin

GPUfs high-level design CPU GPU GPU application Unchanged applications using GPUfs File API using OS File API GPUfs hooks GPUfs GPU OS File System Interface File I/O library OS GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 32 Mark Silberstein - UT Austin

Buffer cache semantics Local or Distributed file system data consistency? 33 Mark Silberstein - UT Austin

GPUfs buffer cache Weak data consistency model ● close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34 Mark Silberstein - UT Austin

On-GPU File I/O API open/close gopen/gclose r e read/write gread/gwrite p a p mmap/munmap gmmap/gmunmap e h t fsync/msync gfsync/gmsync n I ftrunc gftrunc Changes in the semantics are crucial 35 Mark Silberstein - UT Austin

Implementation bits ● Paging support ● Dynamic data structures and memory allocators r e ● Lock-free radix tree p a p ● Inter-processor communications (IPC) e h t n ● Hybrid H/W-S/W barriers I ● Consistency module in the OS kernel ~1,5K GPU LOC, ~600 CPU LOC 36 Mark Silberstein - UT Austin

Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 37 Mark Silberstein - UT Austin

Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075 CUDA piplined CUDA optimized GPU file I/O 3500 3000 2500 Throughput (MB/s) 2000 1500 1000 500 0 280 560 2800 5600 11200 Input matrix size (MB) 38 Mark Silberstein - UT Austin

Word frequency count in text ● Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size 39 Mark Silberstein - UT Austin

Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 6h 50m (7.2X) 53m (6.8X) 33,000 files, 524MB Shakespeare 292s 40s (7.3X) 40s (7.3X) 1 file, 6MB 40 Mark Silberstein - UT Austin

Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 6h 50m (7.2X) 53m (6.8X) 33,000 files, 524MB 8% overhead Shakespeare 292s 40s (7.3X) 40s (7.3X) 1 file, 6MB Unbounded input/output size support 41 Mark Silberstein - UT Austin

GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J 42 Mark Silberstein - UT Austin

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT - PowerPoint PPT Presentation

ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin Traditional System Architecture Applications

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford

Chapter 12: File System Implementation File System Structure File System Implementation

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

File Systems Chapter 11, 13 OSPP What is a File? What is a Directory? Goals of File System

CSE 120 File System Interface What the user/programmer sees File System Implementation

File System Implementation Summer 2016 Cornell University Today File allocation Unix

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Windows File System Efficiency and Stability of a file system are contradicting requirements. How

Distributed Systems - III Open a file, check status on a file, close a file; Read data

File System Thierry Sans (recap) File System Abstraction File system specifics of which disk

FILE SYSTEM IMPLEMENTATION Sunu Wibirama Outline File-System Structure File-System

Week 10: File Management What is a file? Elements of file management File

Speeding up file system checks in ext4 Theodore Ts'o Why File System Checks Are Necessary

Chapter 11: File-System Interface File Concept Access Methods Directory Structure

File-System: Implementation Summer 2013 Cornell University 1 Today How is the file system

Network File System - NFS NFS Specification NFS is a distributed file system (DFS) originally

File System Aging: Increasing the Relevance of File System Benchmarks Keith A. Smith Margo I.

CPSC 410/611: File Management What is a file? Elements of file management File

Virtual File System Don Porter CSE 306 History Early OSes provided a single file system In

Roadmap for Section 8.3 Encrypting File System (EFS) Terminology EFS Operation Data Encryption

Unix File System API Operating System Hebrew University Spring 2009 1 File System API manuals

The Berkeley File System The Original File System Background Why is the bandwidth low? The

Last class: File System Implementation Basics Today: File System Implementation

Last class: File System Interface Today: File System Implementation Disks as