Building Efficient HPC Clouds with MVAPICH2 and OpenStack
- ver SR-IOV-enabled Heterogeneous Clusters
Building Efficient HPC Clouds with MVAPICH2 and OpenStack over - - PowerPoint PPT Presentation
Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous Clusters Talk at OpenStack Summit 2018 Vancouver, Canada by Xiaoyi Lu Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University
OpenStack Summit 2018 2 Network Based Computing Laboratory
OpenStack Summit 2018 3 Network Based Computing Laboratory
High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-/Many-core Processors SSDs, Object Storage Clusters
Large memory nodes
(Upto 2 TB)
Cloud Cloud
SDSC Comet TACC Stampede
OpenStack Summit 2018 4 Network Based Computing Laboratory
Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs)
non-virtualized PFs, no need for driver change
through PCI pass-through
Guest 1 Guest OS VF Driver Guest 2 Guest OS VF Driver Guest 3 Guest OS VF Driver Hypervisor PF Driver I/O MMU SR-IOV Hardware Virtual Function Virtual Function Virtual Function Physical Function PCI Express
OpenStack Summit 2018 5 Network Based Computing Laboratory
– KVM, Docker, Singularity, etc.
– SR-IOV, IVShmem, IPC-Shm, CMA, etc.
– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)
– Offload; Non-blocking; Topology-aware
– Multiple end-points per node
– Migration support with virtual machines
– OpenStack, Slurm, etc.
OpenStack Summit 2018 6 Network Based Computing Laboratory
– Standalone, OpenStack
– SLURM alone, SLURM + OpenStack
– RDMA-Hadoop, OpenStack Swift
OpenStack Summit 2018 7 Network Based Computing Laboratory
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,900 organizations in 86 countries – More than 469,000 (> 0.46 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘17 ranking)
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
OpenStack Summit 2018 8 Network Based Computing Laboratory
– SR-IOV shows near to native performance for inter-node point to point communication – IVSHMEM offers shared memory based data access across co-resident VMs
– Locality Detector: maintains the locality information of co-resident virtual machines – Communication Coordinator: selects the communication channel (SR-IOV, IVSHMEM) adaptively
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter Physical Function user space kernel space
MPI proc PCI Device VF Driver
Guest 2
user space kernel space
MPI proc PCI Device VF Driver
Virtual Function Virtual Function /dev/shm/ IV-SHM
IV-Shmem Channel SR-IOV Channel
Applications on SR-IOV based Virtualized InfiniBand Clusters? Euro-Par, 2014
Library over SR-IOV Enabled InfiniBand Clusters. HiPC, 2014
with SR-IOV: An Efficient Approach to Build HPC Clouds. CCGrid, 2015
OpenStack Summit 2018 9 Network Based Computing Laboratory
50 100 150 200 250 300 350 400 milc leslie3d pop2 GAPgeofem zeusmp2 lu Execution Time (s) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native
1000 2000 3000 4000 5000 6000 22,20 24,10 24,16 24,20 26,10 26,16 Execution Time (ms) Problem Size (Scale, Edgefactor) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native
SPEC MPI2007 Graph500
OpenStack Summit 2018 10 Network Based Computing Laboratory
– Standalone, OpenStack
– SLURM alone, SLURM + OpenStack
– RDMA-Hadoop, OpenStack Swift
OpenStack Summit 2018 11 Network Based Computing Laboratory
OpenStack Summit 2018 12 Network Based Computing Laboratory
MPI
Host Guest VM1
Hypervisor IB Adapter IVShmem Network Suspend Trigger Read-to- Migrate Detector Network Reactive Notifier
MPI
Guest VM2
MPI
Host Guest VM1
VF / SR-IOV Hypervisor IB Adapter IVShmem Ethernet Adapter MPI
Guest VM2
Ethernet Adapter Migration Done Detector VF / SR-IOV VF / SR-IOV VF / SR-IOV Migration Trigger
enabled InfiniBand Clusters. IPDPS, 2017
challenges of detachment/re-attachment of virtualized IB device and IB connection
Migration Controller
applications during migration (detach/reattach SR-IOV/IVShmem, migrate VMs, migration status)
reactivating
thread based (MT) design to optimize VM migration and MPI application performance
OpenStack Summit 2018 13 Network Based Computing Laboratory
0.5 1 1.5 2 2.5 3 TCP IPoIB RDMA Times (s)
Set POST_MIGRATION Add IVSHMEM Attach VF Migration Remove IVSHMEM Detach VF Set PRE_MIGRATION
Breakdown of VM migration
10 20 30 40 2 VM 4 VM 8 VM 16 VM Time (s) Sequential Migration Framework Proposed Migration Framework
Multiple VM Migration Time
OpenStack Summit 2018 14 Network Based Computing Laboratory
Bcast (4VMs * 2Procs/VM)
Pt2Pt Latency
50 100 150 200 1 4 16 64 256 1K 4K 16K 64K 256K 1M Latency (us) Message Size ( bytes) PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA 100 200 300 400 500 600 700 1 4 16 64 256 1K 4K 16K 64K 256K 1M 2Latency (us) Message Size ( bytes) PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA
OpenStack Summit 2018 15 Network Based Computing Laboratory
Graph500
NAS
20 40 60 80 100 120 LU.C EP.C IS.C MG.C CG.C Execution Time (s) PE MT-worst MT-typical NM 0.0 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 20,10 20,16 20,20 22,10 Execution Time (s) PE MT-worst MT-typical NM
OpenStack Summit 2018 16 Network Based Computing Laboratory
– Standalone, OpenStack
– SLURM alone, SLURM + OpenStack
– RDMA-Hadoop, OpenStack Swift
OpenStack Summit 2018 17 Network Based Computing Laboratory
VM1 Container1
OpenStack Summit 2018 18 Network Based Computing Laboratory
running MPI applications on multiple containers per host in HPC cloud?
bottleneck on such container-based HPC cloud?
performance for different container deployment scenarios?
and Shared memory channels for MPI communication across co-resident containers
ICPP, 2016
OpenStack Summit 2018 19 Network Based Computing Laboratory
10 20 30 40 50 60 70 80 90 100 MG.D FT.D EP.D LU.D CG.D
Execution Time (s)
Container-Def Container-Opt Native
Graph 500 NAS
50 100 150 200 250 300 1Cont*16P 2Conts*8P 4Conts*4P
BFS Execution Time (ms) Scale, Edgefactor (20,16)
Container-Def Container-Opt Native
OpenStack Summit 2018 20 Network Based Computing Laboratory 1000 2000 3000 4000 5000 6000 7000 8000 9000 1K 4K 16K 64K 256K 1M Bandwidth(MB/s) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node
maintaining cache coherence
interconnect outperforms shared memory-based transfer for large message size
BW on Haswell BW on KNL
7%
2000 4000 6000 8000 10000 12000 14000 16000 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node
OpenStack Summit 2018 21 Network Based Computing Laboratory
500 1000 1500 2000 2500 3000 22,16 22,20 24,16 24,20 26,16 26,20 Execution Time (ms) Singularity Native
50 100 150 200 250 300 CG EP FT IS LU MG Execution Time (s) Singularity Native
Class D NAS Graph500
7%
Clouds? UCC 2017. (Best Student Paper Award)
OpenStack Summit 2018 22 Network Based Computing Laboratory
– Standalone, OpenStack
– SLURM alone, SLURM + OpenStack
– RDMA-Hadoop, OpenStack Swift
OpenStack Summit 2018 23 Network Based Computing Laboratory
Hardware Host OS Hypervisor Redhat VM1 Docker Engine Container2 bins/ libs App Stack Container1 bins/ libs App Stack Ubuntu VM2 Docker Engine Container4 bins/ libs App Stack Container3 bins/ libs App Stack
OpenStack Summit 2018 24 Network Based Computing Laboratory
1. Intra-VM Intra-Container (across core 4 and core 5) 2. Intra-VM Inter-Container (across core 13 and core 14) 3. Inter-VM Inter-Container (across core 6 and core 12) 4. Inter-Node Inter-Container (across core 15 and the core on remote node)
QPI M e m
y C
t r
l e r
core 4 core 5 core 6 core 7
NUMA 0 VM 0 Container 0 M e m
y C
t r
l e r
core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11 core 12 core 13 core 14 core 15
NUMA 1 VM 1 Container 2 Container 1 Container 3 1 2 3 ... 4
OpenStack Summit 2018 25 Network Based Computing Laboratory
QPI Memory Controller
core 4 core 5 core 6 core 7
NUMA 0 VM 0 Container 0 Memory Controller
core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11 core 12 core 13 core 14 core 15
VM 1 Container 2 Container 1 Container 3 1 2 3 4 Two-Layer NUMA Aware Communication Coordinator NUMA 1 Container Locality Detector VM Locality Detector Nested Locality Combiner Two-Layer Locality Detector CMA Channel SHared Memory (SHM) Channel Network (HCA) Channel
Two-Layer Locality Detector: Dynamically detecting MPI processes in the co- resident containers inside one VM as well as the ones in the co-resident VMs Two-Layer NUMA Aware Communication Coordinator: Leverage nested locality info, NUMA architecture info and message to select appropriate communication channel
based HPC Cloud with SR-IOV Enabled InfiniBand, VEE, 2017
OpenStack Summit 2018 26 Network Based Computing Laboratory
50 100 150 200 250 1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency (us) Message Size (bytes)
Default 1Layer 2Layer Native(w/o CMA) Native
0.5 1 1.5 2 2.5 3 3.5 4 1 4 16 64 256 1K 4K
2000 4000 6000 8000 10000 12000 14000 16000 1 4 16 64 256 1K 4K 16K 64K 256K 1M
Bandwidth (MB/s) Message Size (bytes)
Default 1Layer 2Layer Native(w/o CMA) Native
OpenStack Summit 2018 27 Network Based Computing Laboratory
50 100 150 200 250 300 1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency (us) Message Size (bytes)
Default 1Layer 2Layer Basic-Hybrid Enhanced-Hybrid Native(w/o CMA) Native
0.5 1 1.5 2 2.5 3 3.5 4 1 4 16 64 256 1K 4K
2000 4000 6000 8000 10000 12000 14000 16000 1 4 16 64 256 1K 4K 16K 64K 256K 1M
Bandwidth (MB/s) Message Size (bytes)
Default 1Layer 2Layer Basic-Hybrid Enhanced-Hybrid Native(w/o CMA) Native
OpenStack Summit 2018 28 Network Based Computing Laboratory
1 2 3 4 5 6 7 8 9 10
22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16
BFS Execution Time (s) Default 1Layer 2Layer-Enhanced-Hybrid 20 40 60 80 100 120 140 160 180
IS MG EP FT CG LU
Execution Time (s) Default 1Layer 2Layer-Enhanced-Hybrid
execution time for Graph 500 and NAS, respectively
(LU) performance benefit Class D NAS Graph500
OpenStack Summit 2018 29 Network Based Computing Laboratory
– Standalone, OpenStack
– SLURM alone, SLURM + OpenStack
– RDMA-Hadoop, OpenStack Swift
OpenStack Summit 2018 30 Network Based Computing Laboratory
load SPANK reclaim VMs register job step reply register job step req
Slurmctld Slurmd Slurmd
release hosts run job step req run job step reply
mpirun_vm
MPI Job across VMs
VM Config Reader load SPANK VM Launcher load SPANK VM Reclaimer
Register all VM configuration
environment so that they are visible to all allocated nodes.
each allocated nodes.
and exclusively allocate free VF
and dynamically attach to each VM
VMs and reclaim resources
MPI MPI
vm hostfile
OpenStack Summit 2018 31 Network Based Computing Laboratory
Offload to underlying OpenStack infrastructure
launching VM
SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem. Euro-Par, 2016
reclaim VMs register job step reply register job step req
Slurmctld Slurmd
release hosts launch VM
mpirun_vm load SPANK VM Config Reader MPI
VM hostfile
OpenStack daemon
request launch VM
VM Launcher
return request reclaim VM
VM Reclaimer
return
...... ...... ...... ......
OpenStack Summit 2018 32 Network Based Computing Laboratory
Exclusive Allocations Sequential Jobs 500 1000 1500 2000 2500 3000 24,16 24,20 26,10
BFS Execution Time (ms) Problem Size (Scale, Edgefactor)
VM Native 50 100 150 200 250 22,10 22,16 22,20
BFS Execution Time (ms) Problem Size (Scale, Edgefactor)
VM Native 50 100 150 200 250 22 10 22 16 22 20
BFS Execution Time (ms) Problem Size (Scale, Edgefactor)
VM Native Shared-host Allocations Concurrent Jobs Exclusive Allocations Concurrent Jobs
4%
OpenStack Summit 2018 33 Network Based Computing Laboratory
– Standalone, OpenStack
– SLURM alone, SLURM + OpenStack
OpenStack Summit 2018 34 Network Based Computing Laboratory
1https://github.com/francopestilli/life
The Brain Connectome. Illustration of a set of fascicles (white matter bundles) obtained by using a tractography
matter tracts (shown with different colors here) connecting different cortical areas of the human brain. LiFE1 (Linear Fascicle Evaluation) is an approach to predict diffusion measurements in brain connectomes.
Technology is the key!
OpenStack Summit 2018 35 Network Based Computing Laboratory
– w = MTy and y = Mw
among multiple MPI processes
1https://github.com/francopestilli/life
2http://mvapich.cse.ohio-state.edu/ Computation of w = MTy using 2 MPI processes Computation of y = Mw using 2 MPI processes
OpenStack Summit 2018 36 Network Based Computing Laboratory
2 4 6 8 10 1 2 4 8 16 24 28 32 64 Speed Up MPI Processes
Speed Up on a Single Node
Stampede2 1000 2000 3000 4000 5000 1 2 4 8 16 32 64 Execution Time (s) MPI Processes
Time Breakup Evaluation on Stampede2
NNLS Load MM Reduce Gather Bcast
OpenStack Summit 2018 37 Network Based Computing Laboratory
1 2 3 4 5 6 1 2 4 8 16 24 Speed Up Nodes
Speed Up on Single Node with Docker (Chameleon)
Native 1 Container 2 Containers
OpenStack Summit 2018 38 Network Based Computing Laboratory
– Standalone, OpenStack
– SLURM alone, SLURM + OpenStack
OpenStack Summit 2018 39 Network Based Computing Laboratory
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
OpenStack Summit 2018 40 Network Based Computing Laboratory
main Hadoop components:
– HDFS: Virtualization-aware Block Management to improve fault-tolerance – YARN: Extensions to Container Allocation Policy to reduce network traffic – MapReduce: Extensions to Map Task Scheduling Policy to reduce network traffic – Hadoop Common: Topology Detection Module for automatic topology detection
go through RDMA-based designs over SR-IOV enabled InfiniBand
HDFS YARN Hadoop Common MapReduce HBase Others
Virtual Machines Bare-Metal nodes Containers
Big Data Applications
Topology Detection Module Map Task Scheduling Policy Extension Container Allocation Policy Extension CloudBurst MR-MS Polygraph Others Virtualization Aware Block Management
SR-IOV-enabled Clouds. CloudCom, 2016.
OpenStack Summit 2018 41 Network Based Computing Laboratory
– 14% and 24% improvement with Default Mode for CloudBurst and Self-Join – 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join
20 40 60 80 100 Default Mode Distributed Mode EXECUTION TIME
CloudBurst
RDMA-Hadoop RDMA-Hadoop-Virt 50 100 150 200 250 300 350 400 Default Mode Distributed Mode EXECUTION TIME
Self-Join
RDMA-Hadoop RDMA-Hadoop-Virt
30% reduction 55% reduction
OpenStack Summit 2018 42 Network Based Computing Laboratory
– HTTP-based
– Multiple Object Servers: To store data – Few Proxy Servers: Act as a proxy for all requests – Ring: Handles metadata
– Input/output source for Big Data applications (most common use case) – Software/Data backup – Storage of VM/Docker images
Send PUT or GET request PUT/GET /v1/<account>/<container>/<object> Proxy Server Object Server Object Server Object Server Ring
Disk 1 Disk 2 Disk 1 Disk 2 Disk 1 Disk 2
Swift Architecture
OpenStack Summit 2018 43 Network Based Computing Laboratory
– Proxy server is a bottleneck for large scale deployments – Object upload/download operations network intensive – Can an RDMA-based approach benefit?
– Re-designed Swift architecture for improved scalability and performance; Two proposed designs:
between client and object servers; bypass proxy server
– RDMA-based communication framework for accelerating networking performance – High-performance I/O framework to provide maximum
accepted at CCGrid’17, May 2017 Client-Oblivious Design (D1) Metadata Server-based Design (D2)
OpenStack Summit 2018 44 Network Based Computing Laboratory 5 10 15 20 25 Swift PUT Swift-X (D1) PUT Swift-X (D2) PUT Swift GET Swift-X (D1) GET Swift-X (D2) GET LATENCY (S) TIME BREAKUP OF GET AND PUT Communication I/O Hashsum Other
5 10 15 20
1MB 4MB 16MB 64MB 256MB 1GB 4GB
LATENCY (s) OBJECT SIZE GET LATENCY EVALUATION Swift Swift-X (D2) Swift-X (D1) Reduced by 66%
3.8x for PUT and up to 2.8x for GET
OpenStack Summit 2018 45 Network Based Computing Laboratory
Appliance Description CentOS 7 KVM SR- IOV Chameleon bare-metal image customized with the KVM hypervisor and a recompiled kernel to enable SR-IOV over InfiniBand. https://www.chameleoncloud.org/appliances/3/ MPI bare-metal cluster complex appliance (Based on Heat) This appliance deploys an MPI cluster composed of bare metal instances using the MVAPICH2 library over InfiniBand. https://www.chameleoncloud.org/appliances/ 29/ MPI + SR-IOV KVM cluster (Based on Heat) This appliance deploys an MPI cluster of KVM virtual machines using the MVAPICH2-Virt implementation and configured with SR-IOV for high-performance communication over InfiniBand. https://www.chameleoncloud.org/appliances/28/ CentOS 7 SR-IOV RDMA-Hadoop The CentOS 7 SR-IOV RDMA-Hadoop appliance is built from the CentOS 7 appliance and additionally contains RDMA-Hadoop library with SR-IOV. https://www.chameleoncloud.org/appliances/17/
– High-Performance SR-IOV + InfiniBand – High-Performance MVAPICH2 Library over bare-metal InfiniBand clusters – High-Performance MVAPICH2 Library with Virtualization Support over SR-IOV enabled KVM clusters – High-Performance Hadoop with RDMA-based Enhancements Support
[*] Only include appliances contributed by OSU NowLab
OpenStack Summit 2018 46 Network Based Computing Laboratory
– Standalone, OpenStack, Slurm, and Slurm + OpenStack – Support Virtual Machine Migration with SR-IOV InfiniBand devices – Support Virtual Machine, Container (Docker and Singularity), and Nested Virtualization
– SR-IOV, IVSHMEM, Docker support, OpenStack
MVAPICH2-GDR, ...) and RDMA-Spark/Memcached
OpenStack Summit 2018 47 Network Based Computing Laboratory
June 27, 2018 (Wednesday) 1:45-2:45 pm
HPC Cloud BoF 2017 was held in conjunction with SC’17 http://sc17.supercomputing.org/presentation/?id=bof165&sess=sess357
OpenStack Summit 2018 48 Network Based Computing Laboratory
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ MVAPICH/MVAPICH2 http://mvapich.cse.ohio-state.edu/