MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi Lu, Mark Arnold and Dhabaleswar. K. Panda
Outline • Introduction • Problem Statement • Proposed Design • Performance Evaluation Network Based Computing Laboratory 2
Single Root I/O Virtualization (SR-IOV) • Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design HPC cloud with very little low overhead – Allows a single physical device, Guest 1 Guest 2 Guest 3 or a Physical Function (PF), to Guest OS Guest OS Guest OS present itself as multiple VF Driver VF Driver VF Driver virtual devices, or Virtual Functions (VFs) Hypervisor PF Driver – Each VF can be dedicated to a single VM through PCI pass- I/O MMU PCI Express through – VFs are designed based on the Virtual Virtual Virtual Physical Function Function Function Function existing non-virtualized PFs, no SR-IOV Hardware need for driver change Network Based Computing Laboratory 3
Inter-VM Shared Memory (IVShmem) Guest 1 Guest 2 user space user space MPI MPI proc proc kernel space kernel space PCI VF PCI VF Device Driver Device Driver Hypervisor PF Driver Virtual Virtual Physical IV-SHM Function Function Function /dev/shm/ Infiniband Adapter IV-Shmem Channel Host Environment SR-IOV Channel • SR-IOV shows near to native performance for inter-node point to point communication • However, NOT VM locality aware • IVShmem offers zero-copy access to data on shared memory of co-resident VMs Network Based Computing Laboratory 4
Outline • Introduction • Problem Statement • Proposed Design • Performance Evaluation Network Based Computing Laboratory 5
Problem Statement • How to design a high performance MPI library to efficiently take advantage SR- IOV and IVShmem to deliver VM locality aware communication and optimal performance? • How to build an HPC Cloud with near native performance for MPI applications over SR-IOV enabled InfiniBand clusters? • How much performance improvement can be achieved by our proposed design on MPI point-to-point, collective operations and applications in HPC clouds? • How much benefit the proposed approach with InfiniBand can provide compared to Amazon EC2? Network Based Computing Laboratory 6
Outline • Introduction • Problem Statement • Proposed Design • Performance Evaluation Network Based Computing Laboratory 7
VM Locality Aware MVAPICH2 Design Overview Application Application MPI Layer MPI Layer ADI3 Layer Communication Coordinator MPI Library Virtual Machine ADI3 Layer Aware Locality Detector SMP Network SMP IVShmem SR-IOV Channel Channel Channel Channel Channel Communication Communication Shared InfiniBand Shared Memory InfiniBand API Device APIs Device APIs Memory API Native Hardware Virtualized Hardware MVAPICH2 library running in native and virtualization environments • In virtualized environment • - Support shared-memory channels (SMP, IVShmem) and SR-IOV channel - Locality detection - Communication coordination Network Based Computing Laboratory 8
Virtual Machine Locality Detection Create a VM List • structure on IVShmem user space user space user space user space MPI MPI MPI rank4 MPI rank0 rank1 rank5 region of each host proc proc proc proc kernel space kernel space kernel space kernel space Each MPI process writes • PCI VF VF PCI VF PCI VF PCI Device Driver Device Driver Device Driver Device Driver its own membership information into shared PF Driver Hypervisor VM List structure according to its global rank VM VM List st IVShmem 1 1 0 0 1 1 0 0 0 0 0 0 One byte each, lock-free, • /dev/shm/ O(N) Host Environment Network Based Computing Laboratory 9
Communication Coordination Retrieve VM locality • Guest1 Guest2 user space user space detection information MPI Process Rank 1 MPI Process Rank 4 Communication Communication Schedule Coordinator Coordinator • 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 communication IVShmem SR-IOV IVShmem SR-IOV Channel Channel Channel Channel channels based on kernel space kernel space VM locality PCI Device PCI Device VF Driver VF Driver information PF Driver Hypervisor Fast index, light- • weight Virtual Virtual Physical IVShmem Function Function Function /dev/shm InfiniBand Adapter Host Environment Network Based Computing Laboratory 10
MVAPICH2 with SR-IOV over OpenStack Heat • OpenStack is one of the most Orchestrates cloud Provides popular open-source solutions to UI Horizon build a cloud and manage huge Provisions Nova amounts of virtual machines Neutron Provides Provides VM • Deployment with OpenStack images Network – Supporting SR-IOV configuration Stores Glance Swift images in Provides – Extending Nova in OpenStack to support Volumes Backup Cinder IVShmem volumes in – Virtual Machine Aware design of Ceilometer Monitors MVAPICH2 with SR-IOV Provides Keystone • An efficient approach to build HPC Auth for Clouds Network Based Computing Laboratory 11
Experimental HPC Cloud Network Based Computing Laboratory 12
Outline • Introduction • Problem Statement • Proposed Design • Performance Evaluation Network Based Computing Laboratory 13
Cloud Testbeds Cluster Nowlab Cloud Amazon EC2 Instance 4 Core/VM 8 Core/VM 4 Core/VM 8 Core/VM Platform RHEL 6.5 Qemu+KVM Amazon Linux (EL6) Amazon Linux HVM Xen HVM C3.xlarge (EL6) Instance Xen HVM C3.2xlarge Instance CPU SandyBridge Intel(R) IvyBridge Intel(R) Xeon E5-2680v2 Xeon E5-2670 (2.6GHz) (2.8GHz) RAM 6 GB 12 GB 7.5 GB 15 GB Interconnect FDR (56Gbps) InfiniBand 10 GigE with Intel ixgbevf SR-IOV driver Mellanox ConnectX-3 with SR-IOV Network Based Computing Laboratory 14
Performance Evaluation • Performance of MPI Level Point-to-point Operations – Inter-node MPI Level Two-sided Operations – Intra-node MPI Level Two-sided Operations – Intra-node MPI Level One-sided Operations • Performance of MPI Level Collectives Operations – Broadcast, Allreduce, Allgather and Alltoall • Performance of Typical MPI Benchmarks and Applications – NAS and Graph500 *Amazon EC2 does not support users to explicitly allocate VMs in one physical node so far. We allocate multiple VMs in one logical group and compare the point-to-point performance for each pair of VMs. We see the VMs who have the lowest latency as located within one physical node (Intra-node), otherwise Inter-node. Network Based Computing Laboratory 15
Inter-node MPI Level Two-sided Point-to-Point Performance 0% • EC2 C3.xlarge instances • Similar performance with SR-IOV-Def • Compared to Native, similar overhead as basic IB level • Compared to EC2, up to 29X and 16X performance speedup on Lat & BW Network Based Computing Laboratory 16
Intra-node MPI Level Two-sided Point-to-Point Performance • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 84% and 158% performance improvement on Lat & BW • Compared to Native, 3%-7% overhead for Lat, 3%-8% overhead for BW • Compared to EC2, up to 160X and 28X performance speedup on Lat & BW Network Based Computing Laboratory 17
Intra-node MPI Level One-sided Put Performance • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 63% and 42% improvement on Lat & BW • Compared to EC2, up to 134X and 33X performance speedup on Lat & BW Network Based Computing Laboratory 18
Intra-node MPI Level One-sided Get Performance • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 70% improvement on both Lat & BW • Compared to EC2, up to 121X and 24X performance speedup on Lat & BW Network Based Computing Laboratory 19
MPI Level Collectives Operations Performance (4 cores/VM * 4 VMs) • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 74% and 60% performance improvement on Broadcast & Allreduce • Compared to EC2, up to 65X and 22X performance speedup on Bcast & Allreduce Network Based Computing Laboratory 20
MPI Level Collectives Operations Performance (4 cores/VM * 4 VMs) • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 74% and 81% performance improvement on Allgather & Alltoall • Compared to EC2, up to 28X and 45X performance speedup on Allgather & Alltoall Network Based Computing Laboratory 21
MPI Level Collectives Operations Performance (4 cores/VM * 16 VMs) • Compared to SR-IOV-Def, up to 41% and 45% performance improvement on Bcast & Allreduce Network Based Computing Laboratory 22
MPI Level Collectives Operations Performance (4 cores/VM * 16 VMs) • Compared to SR-IOV-Def, up to 40% and 39% performance improvement on Allgather & Alltoall Network Based Computing Laboratory 23
Performance of Typical MPI Benchmarks and Applications (8 cores/VM * 4 VMs) • EC2 C3.2xlarge instances • Compared to Native, 2%-9% overhead for NAS, around 6% overhead for Graph500 • Compared to EC2, up to 4.4X (FT) speedup for NAS, up to 12X (20,10) speedup for Graph500 Network Based Computing Laboratory 24
Performance of Typical MPI Benchmarks and Applications (8 cores/VM * 8 VMs) • EC2 C3.2xlarge instances • Compared to Native, 6%-9% overhead for NAS, around 8% overhead for Graph500 Network Based Computing Laboratory 25
Recommend
More recommend