GPU Direct IO with HDF5 John Ravi Quincey Koziol Suren Byna - PowerPoint PPT Presentation

GPU Direct IO with HDF5 John Ravi • Quincey Koziol • Suren Byna

Motivation • With large-scale computing systems are moving towards using GPUs as workhorses of computing • file I/O to move data between GPUs and storage devices becomes critical • I/O performance optimizing technologies • NVIDIA’s GPU Direct Storage (GDS) - reducing the latency of data movement between GPUs and storage. • In this presentation, we will talk about a recently developed virtual file driver (VFD) that takes advantage of the GDS technology allowing data transfers between GPUs and storage without using CPU memory as a “bounce buffer”

3 Traditional Data Transfer without GPUDirect Storage 1. fd = open(“file.txt”, O_RDONLY); 2. buf = malloc(size); 3. pread(fd, buf, size, 0); 4. cudaMalloc(d_buf, size); 5. cudaMemcpy(d_buf, buf, size, cudaMemcpyHostToDevice);

4 Data Transfer with GPUDirect Storage (GDS) Traditional Data Transfer 1. fd = open(“file.txt”, O_RDONLY, …); No need for a 2. buf = malloc(size); “bounce buffer” 3. pread(fd, buf, size, 0); 4. cudaMalloc(d_buf, size); 5. cudaMemcpy(d_buf, buf, size, cudaMemcpyHostToDevice); NVIDIA GPUDirect Storage 1. fd = open(“file.txt”, O_RDONLY | O_DIRECT, …); 2. cudaMalloc(d_buf, size); 3. cuFileRead(fhandle, d_buf, size, 0);

HPC I/O software stack High Level I/O Library Objectives Applications • Ease-of-use High Level I/O Library (HDF5, netCDF, ADIOS) • Standardized format I/O Middleware (MPI-IO) I/O Forwarding • Portable Performance Optimizations Parallel File System (Lustre, GPFS, …) I/O Hardware (disk-based, SSD-based, …)

HDF5 Virtual File Driver(s) API Apps HDFview netCDF-4 h5dump H5Hut High Level VFD Description APIs Java C++/FORTRAN/Python SEC2 default driver Data Model Objects Tunable Properties Infrastructure Files, Groups, Datasets, APIs Chunk Size, I/O Driver, … POSIX file-system functions Datatype, Dataspace, IDs, … Attributes, … like read and write to perform I/O to a single file HDF5 Library Memory Datatype Chunked Version I/O Internals Compatibility et cetera… DIRECT force data to be written Mgmt Conversion Storage Filters directly to file-system disables OS buffering Virtual File … SEC2 DIRECT MPI I/O Custom Layer MPIIO used with Parallel HDF5, to provide parallel I/O support Direct IO File on Storage HDF5 File File Other to Parallel Format Filesystem Filesystem

HDF5 Virtual File Driver(s) API Apps HDFview netCDF-4 h5dump H5Hut High Level VFD Description APIs Java C++/FORTRAN/Python SEC2 default driver Data Model Objects Tunable Properties Infrastructure Files, Groups, Datasets, APIs Chunk Size, I/O Driver, … POSIX file-system functions Datatype, Dataspace, IDs, … Attributes, … like read and write to perform I/O to a single file HDF5 Library Memory Datatype Chunked Version I/O Internals Compatibility et cetera… DIRECT force data to be written Mgmt Conversion Storage Filters directly to file-system disables OS buffering Virtual File SEC2 DIRECT GDS Layer MPIIO used with Parallel HDF5, to provide parallel I/O support GDS Enable GPUDirect Storage Direct IO GPUDirect Storage HDF5 File File to to Format Filesystem Filesystem

8 GPU Data Management CPU GPU Compute Memory Hierarchy Cores Apps Registers L1 Cache SMEM Host NVMe Storage Memory L2 Cache I/O Call Global Memory OS Kernel Copy Engine PCIe 3.0 16 GB/s PCIe 3.0 16 GB/s

19 GPU Data Management (with GDS) CPU GPU Compute Memory Hierarchy Cores Apps Registers L1 Cache SMEM Host NVMe Storage Memory L2 Cache I/O Call Global Memory OS Kernel Copy Engine PCIe 3.0 16 GB/s PCIe 3.0 16 GB/s

HDF5 GDS – Virtual File Driver • GDS VFD differences from SEC2 VFD • File Descriptor is open with O_DIRECT (disables all OS buffering) • Read and Write handlers needs to distinguish between CPU (metadata) and GPU memory pointers • cuFileDriver needs to be initialized per run • Some overhead for each I/O call • Querying CUDA Runtime for information about memory pointers • cuFile buffer registration and deregistration

Experimental Evaluation – Lustre File System • GDS VFD knobs • num_threads – number of pthreads servicing one cuFile request • blocksize – transfer size of one cuFile request Image Source: https://wiki.lustre.org/Introduction_to_Lustre

Experimental Evaluation • System Configuration • NVIDIA DGX-2 • 16x Tesla v100 • 2x Samsung NVMe SM961/PM961 RAID0 (Seq Reads = ~6.4 GB/s, Seq Write = ~3.6 GB/s) • Lustre File System (4 OSTs, 1MB strip size) • Benchmarks • Local Storage • Sequential R/W Rates • Lustre File System • Multi-threaded Sequential R/W Rates • Multi-GPU (one GPU per process, one file per process)

Write Performance – Local Storage • HDF5 GDS achieves higher write rates for requests greater than 512 MB • Possible Optimizations: • make user specify the location of the memory pointer for each memory transfer • cuFile buffer register before I/O call

Read Performance – Local Storage • HDF5 GDS achieves higher read rates for requests greater than 256 MB • Possible Optimizations: • make user specify the location of the memory pointer for each memory transfer • cuFile buffer register before I/O call

Multi-Threaded Writes, Single GPU, Lustre File System • Using more threads increases write rates dramatically (almost 2x speed for using 8 threads instead of 4 threads) • Varying blocksize did not change much • Default behavior of SEC2 (no threading) • Requires a significant change • Some developers are working on relaxing Serial HDF5 “global lock”

Multi-Threaded Read, Single GPU, Lustre File System • SEC2 read rates are best in most cases • More threads did not offer an improvement in read rate • Read ahead was left on for this experiment

GPU Direct IO with HDF5 John Ravi Quincey Koziol Suren Byna - PowerPoint PPT Presentation

GPU Direct IO with HDF5 John Ravi Quincey Koziol Suren Byna Motivation With large-scale computing systems are moving towards using GPUs as workhorses of computing file I/O to move data between GPUs and storage devices becomes

JASON BRUDVIK JASON BRUDVIK HDF5 WEB VIEWER HDF5 WEB VIEWER MAX IV LABORATORY MAX IV

Benchmarking HDF5 Compression Filters in R Mike L. Smith @grimbough HDF5 is a file format for

Introduction to serial HDF5 Matthieu Haefele Saclay, April 2018, Parallel filesystems and

S-100 Maintenance Proposals Part 10c (HDF5) Part 8 (Gridded data) S100WG4 / S102PT 25 February

V ISUALIZING HDF5 DATA WITH O PEN DX Ireneusz Szcze sniak John Cary Center for Integrated

Definition(of(Keywords(and(Its(Organization(in(HDF5( By(Jixia(Li( 1. ! Introduction,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Great Lakes Chloride, Inc. Direct Liquid Application (DLA) Direct Liquid Application (DLA)

State of Collaboration Direct Deposit and Payroll Reissuance 1 1 Topics Direct Deposit

Direct loan Direct loan Information Information Feder deral Direct Student Loans l Direct

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Preparatory Course in Computer programming experience Science Computer Science 1 : Theoretical

Anne Bracy CS 3410 Computer Science Cornell University [K. Bala, A. Bracy, S. McKee, E. Sirer,

RESTful Microservices from the Ground Up Mike Amundsen API Academy @mamund Agenda 9:00 -

Register Allocaon and Instrucon Scheduling in Unison Roberto Castaeda Lozano SICS, KTH

Modern Update Approaches for Embedded Linux Petros Angelatos April 2016 About me Petros

SWARM SWARM INTELLIGENCE INTELLIGENCE Milad Abolhassani Supervisor: Hamid Mir Vaziri 3 WHY?

Reference Monitors Gang Tan Penn State University Spring 2019 CMPSC 447, Software Security

Jameson Rollins August 26, 2013 LIGO-G1300872-v1 Outline 1 Introduction 2 System description and

GPU Direct IO with HDF5 John Ravi Quincey Koziol Suren Byna - PowerPoint PPT Presentation

GPU Direct IO with HDF5 John Ravi Quincey Koziol Suren Byna Motivation With large-scale computing systems are moving towards using GPUs as workhorses of computing file I/O to move data between GPUs and storage devices becomes

JASON BRUDVIK JASON BRUDVIK HDF5 WEB VIEWER HDF5 WEB VIEWER MAX IV LABORATORY MAX IV

Benchmarking HDF5 Compression Filters in R Mike L. Smith @grimbough HDF5 is a file format for

Introduction to serial HDF5 Matthieu Haefele Saclay, April 2018, Parallel filesystems and

S-100 Maintenance Proposals Part 10c (HDF5) Part 8 (Gridded data) S100WG4 / S102PT 25 February

V ISUALIZING HDF5 DATA WITH O PEN DX Ireneusz Szcze sniak John Cary Center for Integrated

Definition(of(Keywords(and(Its(Organization(in(HDF5( By(Jixia(Li( 1. ! Introduction,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Great Lakes Chloride, Inc. Direct Liquid Application (DLA) Direct Liquid Application (DLA)

State of Collaboration Direct Deposit and Payroll Reissuance 1 1 Topics Direct Deposit

Direct loan Direct loan Information Information Feder deral Direct Student Loans l Direct

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Preparatory Course in Computer programming experience Science Computer Science 1 : Theoretical

Anne Bracy CS 3410 Computer Science Cornell University [K. Bala, A. Bracy, S. McKee, E. Sirer,

RESTful Microservices from the Ground Up Mike Amundsen API Academy @mamund Agenda 9:00 -

Register Allocaon and Instrucon Scheduling in Unison Roberto Castaeda Lozano SICS, KTH

Modern Update Approaches for Embedded Linux Petros Angelatos April 2016 About me Petros

SWARM SWARM INTELLIGENCE INTELLIGENCE Milad Abolhassani Supervisor: Hamid Mir Vaziri 3 WHY?

Reference Monitors Gang Tan Penn State University Spring 2019 CMPSC 447, Software Security

Jameson Rollins August 26, 2013 LIGO-G1300872-v1 Outline 1 Introduction 2 System description and

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team