Achieving Near-Native GPU Performance in the Cloud John Paul Walters Project Leader, USC Information Sciences Institute jwalters@isi.edu
Outline Motivation ISI’s HPC Cloud Effort Background: PCI Passthrough, SR-IOV Results Conclusion 2
Motivation Scientific workloads demand increasing performance with greater power efficiency – Architectures have been driven towards specialization, heterogeneity Infrastructure-as-a-Service (IaaS) clouds can democratize access to the latest, most powerful accelerators – If performance goals are met Can we provide HPC-class performance in the cloud? 3
ISI’s HPC Cloud Work Cloud computing is traditionally seen as a resource for IT – Web servers, databases More recently researchers have begun to leverage the public cloud as an HPC resource – AWS virtual cluster is 101 on Top500 list Major difference between HPC and IT in the cloud: – Types of resources, heterogeneity Our contribution: we’re developing the heterogeneous HPC extensions for the OpenStack cloud computing platform 4
OpenStack Background OpenStack founded by Rackspace and NASA Google Trends Searches for Common In use by Rackspace, HP, and Open Source IaaS Projects 120 others for their public clouds openstack 100 cloudstack Open source with hundreds of 80 opennebula 60 eucalyptus cloud participating companies 40 In use for both public and private 20 0 clouds Current stable release: OpenStack Juno – OpenStack Kilo to be released in April 5
Accessing GPUs from Virtual Hosts Using API Remoting Host to Device Bandwidth, Matrix Multiply for Increasing Pageable NxM 4000 3500 200 3000 GFlops/Sec MB/sec 150 2500 2000 100 1500 Host 50 Host 1000 500 0 gVirtus LXC 0 LXC gVirtus Size (NxM), Single Precision Real Bytes Larger matrix multiply amortizes I/O performance low for I/O transfer cost, LXC and native gVirtus/KVM, LXC much closer to performance indistinguishable. native performance. 6
Accelerators and Virtualization SHOC Performance for Common Signal Processing Kernels • Combine non- KVM Xen LXC VMWare virtualized 1.01 accelerators with Relative Performance 1 virtual hosts 0.99 • 0.98 Results in > 99% 0.97 efficiency 0.96 sgemm_t_p… sgemm_n_p… dgemm_n_… dgemm_t_p… fft_sp fft_sp_pcie ifft_sp ifft_sp_pcie fft_dp fft_dp_pcie ifft_dp ifft_dp_pcie sgemm_n sgemm_t dgemm_n dgemm_t 7
PCI Passthrough Background 1:1 mapping of physical device to virtual machine Device remains non- virtualized 8
SR-IOV Background SR-IOV partitions a single physical device into multiple virtual functions Virtual functions almost indistinguishable from physical functions. Virtual functions passed to virtual machines using PCI passthrough Image from: http://docs.oracle.com/cd/E23824_01/html/819-3196/figures/sriov-intro.png 9
Multi-GPU with SR-IOV and GPUDirect Many real applications extend beyond a single node’s capabilities Test multi-node performance with Infiniband SR-IOV and GPUDirect 4 Sandy Bridge nodes equipped with K20/K40 GPUs – ConnectX-3 IB with SR-IOV enabled – Ported Mellanox OFED 2.1-1 to 3.13 kernel – KVM hypervisor Test with LAMMPS, OSU Microbenchmarks, and HOOMD 10
LAMMPS Rhodopsin with SR-IOV Performance LAMMPS Rhodopsin Performance 3.5 Millions of atom-timesteps per second 3 2.5 2 VM 32c/4g VM 4c/4g 1.5 Base 32c/4g 1 Base 4c/4g 0.5 0 32k 64k 128k 256k 512k Problem Size 11
LAMMPS Lennard-Jones with SR-IOV Performance LAMMPS Lennard-Jones Performance 140 Millions of atom-timesteps per second 120 100 80 VM 32c/4g VM 4c/4g 60 Base 32c/4g 40 Base 4c/4g 20 0 2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k 2048k Problem Size 12
LAMMPS Virtualized Performance Achieve 96% - 99% efficiency – Performance gap decreases with increasing problem size Future work needed to validate results across much larger systems – This work is in the early stages 13
GPUDirect Advantage Validate GPUDirect over SR-IOV – Uses nvidia_peer_memory- 1.0-0 kernel module OSU GDR Microbenchmarks HOOMD MD Image source: http://old.mellanox.com/content/pages.php?pg=products_dyn &product_family=116 14
OSU GDR Microbenchmarks: Latency 500 40 450 400 30 Avg Latency (us) 350 20 300 10 250 Native 200 0 1 4 16 64 Virtualized 150 100 50 0 131072 262144 524288 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 1048576 Size (Bytes) 15
OSU GDR Microbenchmarks: Bandwidth Bandwidth (MB/s) 1000 1500 2000 2500 3000 3500 500 0 1 2 4 8 16 32 64 128 256 512 Size (Bytes) 1024 2048 16 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 VIrtualized Native
GPUDirect-enabled VM Performance HOOMD GPUDirect Performance, 256K Particles Lennard-Jones Simulation 800 Average Timesteps per second 700 600 500 VM GPUDirect 400 VM No GPUDirect 300 Base GPUDirect 200 Base No GPUDirect 100 0 0 1 2 3 4 N Nodes 17
Discussion Take-away: GDR provides nearly 10% improvement SR-IOV interconnect results in < 2% overhead Further work needed to validate these results in larger systems – Small-scale results are promising 18
Future Work For full results see: – J.P. Walters, et al. GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications , IEEE Cloud 2014 – A.J. Younge, et al. Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect, to appear in VEE 2015. Next steps: – Extend scalability results – OpenStack integration Code: https://github.com/usc-isi/nova 19
Questions and Comments Contact me: – jwalters@isi.edu – www.isi.edu/people/jwalters/ 20
Recommend
More recommend