john paul walters
play

John Paul Walters Project Leader, USC Information Sciences Institute - PowerPoint PPT Presentation

Achieving Near-Native GPU Performance in the Cloud John Paul Walters Project Leader, USC Information Sciences Institute jwalters@isi.edu Outline Motivation ISIs HPC Cloud Effort Background: PCI Passthrough, SR-IOV Results


  1. Achieving Near-Native GPU Performance in the Cloud John Paul Walters Project Leader, USC Information Sciences Institute jwalters@isi.edu

  2. Outline  Motivation  ISI’s HPC Cloud Effort  Background: PCI Passthrough, SR-IOV  Results  Conclusion 2

  3. Motivation  Scientific workloads demand increasing performance with greater power efficiency – Architectures have been driven towards specialization, heterogeneity  Infrastructure-as-a-Service (IaaS) clouds can democratize access to the latest, most powerful accelerators – If performance goals are met  Can we provide HPC-class performance in the cloud? 3

  4. ISI’s HPC Cloud Work  Cloud computing is traditionally seen as a resource for IT – Web servers, databases  More recently researchers have begun to leverage the public cloud as an HPC resource – AWS virtual cluster is 101 on Top500 list  Major difference between HPC and IT in the cloud: – Types of resources, heterogeneity  Our contribution: we’re developing the heterogeneous HPC extensions for the OpenStack cloud computing platform 4

  5. OpenStack Background  OpenStack founded by Rackspace and NASA Google Trends Searches for Common  In use by Rackspace, HP, and Open Source IaaS Projects 120 others for their public clouds openstack 100 cloudstack  Open source with hundreds of 80 opennebula 60 eucalyptus cloud participating companies 40  In use for both public and private 20 0 clouds  Current stable release: OpenStack Juno – OpenStack Kilo to be released in April 5

  6. Accessing GPUs from Virtual Hosts Using API Remoting Host to Device Bandwidth, Matrix Multiply for Increasing Pageable NxM 4000 3500 200 3000 GFlops/Sec MB/sec 150 2500 2000 100 1500 Host 50 Host 1000 500 0 gVirtus LXC 0 LXC gVirtus Size (NxM), Single Precision Real Bytes Larger matrix multiply amortizes I/O performance low for I/O transfer cost, LXC and native gVirtus/KVM, LXC much closer to performance indistinguishable. native performance. 6

  7. Accelerators and Virtualization SHOC Performance for Common Signal Processing Kernels • Combine non- KVM Xen LXC VMWare virtualized 1.01 accelerators with Relative Performance 1 virtual hosts 0.99 • 0.98 Results in > 99% 0.97 efficiency 0.96 sgemm_t_p… sgemm_n_p… dgemm_n_… dgemm_t_p… fft_sp fft_sp_pcie ifft_sp ifft_sp_pcie fft_dp fft_dp_pcie ifft_dp ifft_dp_pcie sgemm_n sgemm_t dgemm_n dgemm_t 7

  8. PCI Passthrough Background  1:1 mapping of physical device to virtual machine  Device remains non- virtualized 8

  9. SR-IOV Background  SR-IOV partitions a single physical device into multiple virtual functions  Virtual functions almost indistinguishable from physical functions.  Virtual functions passed to virtual machines using PCI passthrough Image from: http://docs.oracle.com/cd/E23824_01/html/819-3196/figures/sriov-intro.png 9

  10. Multi-GPU with SR-IOV and GPUDirect  Many real applications extend beyond a single node’s capabilities  Test multi-node performance with Infiniband SR-IOV and GPUDirect  4 Sandy Bridge nodes equipped with K20/K40 GPUs – ConnectX-3 IB with SR-IOV enabled – Ported Mellanox OFED 2.1-1 to 3.13 kernel – KVM hypervisor  Test with LAMMPS, OSU Microbenchmarks, and HOOMD 10

  11. LAMMPS Rhodopsin with SR-IOV Performance LAMMPS Rhodopsin Performance 3.5 Millions of atom-timesteps per second 3 2.5 2 VM 32c/4g VM 4c/4g 1.5 Base 32c/4g 1 Base 4c/4g 0.5 0 32k 64k 128k 256k 512k Problem Size 11

  12. LAMMPS Lennard-Jones with SR-IOV Performance LAMMPS Lennard-Jones Performance 140 Millions of atom-timesteps per second 120 100 80 VM 32c/4g VM 4c/4g 60 Base 32c/4g 40 Base 4c/4g 20 0 2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k 2048k Problem Size 12

  13. LAMMPS Virtualized Performance  Achieve 96% - 99% efficiency – Performance gap decreases with increasing problem size  Future work needed to validate results across much larger systems – This work is in the early stages 13

  14. GPUDirect Advantage  Validate GPUDirect over SR-IOV – Uses nvidia_peer_memory- 1.0-0 kernel module  OSU GDR Microbenchmarks  HOOMD MD Image source: http://old.mellanox.com/content/pages.php?pg=products_dyn &product_family=116 14

  15. OSU GDR Microbenchmarks: Latency 500 40 450 400 30 Avg Latency (us) 350 20 300 10 250 Native 200 0 1 4 16 64 Virtualized 150 100 50 0 131072 262144 524288 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 1048576 Size (Bytes) 15

  16. OSU GDR Microbenchmarks: Bandwidth Bandwidth (MB/s) 1000 1500 2000 2500 3000 3500 500 0 1 2 4 8 16 32 64 128 256 512 Size (Bytes) 1024 2048 16 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 VIrtualized Native

  17. GPUDirect-enabled VM Performance HOOMD GPUDirect Performance, 256K Particles Lennard-Jones Simulation 800 Average Timesteps per second 700 600 500 VM GPUDirect 400 VM No GPUDirect 300 Base GPUDirect 200 Base No GPUDirect 100 0 0 1 2 3 4 N Nodes 17

  18. Discussion  Take-away: GDR provides nearly 10% improvement  SR-IOV interconnect results in < 2% overhead  Further work needed to validate these results in larger systems – Small-scale results are promising 18

  19. Future Work  For full results see: – J.P. Walters, et al. GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications , IEEE Cloud 2014 – A.J. Younge, et al. Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect, to appear in VEE 2015.  Next steps: – Extend scalability results – OpenStack integration  Code: https://github.com/usc-isi/nova 19

  20. Questions and Comments  Contact me: – jwalters@isi.edu – www.isi.edu/people/jwalters/ 20

Recommend


More recommend