Building Efficient HPC Clouds with MVAPICH2 and OpenStack over - - PowerPoint PPT Presentation

building efficient hpc clouds with mvapich2 and openstack
SMART_READER_LITE
LIVE PREVIEW

Building Efficient HPC Clouds with MVAPICH2 and OpenStack over - - PowerPoint PPT Presentation

Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous Clusters Talk at OpenStack Summit 2018 Vancouver, Canada by Xiaoyi Lu Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University


slide-1
SLIDE 1

Building Efficient HPC Clouds with MVAPICH2 and OpenStack

  • ver SR-IOV-enabled Heterogeneous Clusters

Talk at OpenStack Summit 2018 Vancouver, Canada by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

slide-2
SLIDE 2

OpenStack Summit 2018 2 Network Based Computing Laboratory

  • Cloud Computing widely adopted in industry computing environment
  • Cloud Computing provides high resource utilization and flexibility
  • Virtualization is the key technology to enable Cloud Computing
  • Intersect360 study shows cloud is the fastest growing class of HPC
  • HPC Meets Cloud: The convergence of Cloud Computing and HPC

HPC Meets Cloud Computing

slide-3
SLIDE 3

OpenStack Summit 2018 3 Network Based Computing Laboratory

Drivers of Modern HPC Cluster and Cloud Architecture

  • Multi-core/many-core technologies, Accelerators
  • Large memory nodes
  • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters
  • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
  • Single Root I/O Virtualization (SR-IOV)

High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-/Many-core Processors SSDs, Object Storage Clusters

Large memory nodes

(Upto 2 TB)

Cloud Cloud

SDSC Comet TACC Stampede

slide-4
SLIDE 4

OpenStack Summit 2018 4 Network Based Computing Laboratory

  • Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design

HPC cloud with very little low overhead

Single Root I/O Virtualization (SR-IOV)

  • Allows a single physical device, or a

Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs)

  • VFs are designed based on the existing

non-virtualized PFs, no need for driver change

  • Each VF can be dedicated to a single VM

through PCI pass-through

  • Work with 10/40 GigE and InfiniBand

Guest 1 Guest OS VF Driver Guest 2 Guest OS VF Driver Guest 3 Guest OS VF Driver Hypervisor PF Driver I/O MMU SR-IOV Hardware Virtual Function Virtual Function Virtual Function Physical Function PCI Express

slide-5
SLIDE 5

OpenStack Summit 2018 5 Network Based Computing Laboratory

  • Virtualization Support with Virtual Machines and Containers

– KVM, Docker, Singularity, etc.

  • Communication coordination among optimized communication channels on Clouds

– SR-IOV, IVShmem, IPC-Shm, CMA, etc.

  • Locality-aware communication
  • Scalability for million to billion processors

– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)

  • Scalable Collective communication

– Offload; Non-blocking; Topology-aware

  • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)

– Multiple end-points per node

  • NUMA-aware communication for nested virtualization
  • Integrated Support for GPGPUs and Accelerators
  • Fault-tolerance/resiliency

– Migration support with virtual machines

  • QoS support for communication and I/O
  • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …)
  • Energy-Awareness
  • Co-design with resource management and scheduling systems on Clouds

– OpenStack, Slurm, etc.

Broad Challenges of Building Efficient HPC Clouds

slide-6
SLIDE 6

OpenStack Summit 2018 6 Network Based Computing Laboratory

  • MVAPICH2-Virt with SR-IOV and IVSHMEM

– Standalone, OpenStack

  • SR-IOV-enabled VM Migration Support in MVAPICH2
  • MVAPICH2 with Containers (Docker and Singularity)
  • MVAPICH2 with Nested Virtualization (Container over VM)
  • MVAPICH2-Virt on SLURM

– SLURM alone, SLURM + OpenStack

  • Neuroscience Applications on HPC Clouds
  • Big Data Libraries on Cloud

– RDMA-Hadoop, OpenStack Swift

Approaches to Build HPC Clouds

slide-7
SLIDE 7

OpenStack Summit 2018 7 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,900 organizations in 86 countries – More than 469,000 (> 0.46 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)

  • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
  • 9th, 556,104 cores (Oakforest-PACS) in Japan
  • 12th, 368,928-core (Stampede2) at TACC
  • 17th, 241,108-core (Pleiades) at NASA
  • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade
slide-8
SLIDE 8

OpenStack Summit 2018 8 Network Based Computing Laboratory

  • Redesign MVAPICH2 to make it virtual

machine aware

– SR-IOV shows near to native performance for inter-node point to point communication – IVSHMEM offers shared memory based data access across co-resident VMs

– Locality Detector: maintains the locality information of co-resident virtual machines – Communication Coordinator: selects the communication channel (SR-IOV, IVSHMEM) adaptively

  • Support deployment with OpenStack

Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter Physical Function user space kernel space

MPI proc PCI Device VF Driver

Guest 2

user space kernel space

MPI proc PCI Device VF Driver

Virtual Function Virtual Function /dev/shm/ IV-SHM

IV-Shmem Channel SR-IOV Channel

  • J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM Shmem Benefit MPI

Applications on SR-IOV based Virtualized InfiniBand Clusters? Euro-Par, 2014

  • J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High Performance MPI

Library over SR-IOV Enabled InfiniBand Clusters. HiPC, 2014

  • J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack

with SR-IOV: An Efficient Approach to Build HPC Clouds. CCGrid, 2015

slide-9
SLIDE 9

OpenStack Summit 2018 9 Network Based Computing Laboratory

50 100 150 200 250 300 350 400 milc leslie3d pop2 GAPgeofem zeusmp2 lu Execution Time (s) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native

1% 9.5%

1000 2000 3000 4000 5000 6000 22,20 24,10 24,16 24,20 26,10 26,16 Execution Time (ms) Problem Size (Scale, Edgefactor) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native

2%

  • 32 VMs, 6 Core/VM
  • Compared to Native, 2-5% overhead for Graph500 with 128 Procs
  • Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Application-Level Performance on Chameleon

SPEC MPI2007 Graph500

5%

slide-10
SLIDE 10

OpenStack Summit 2018 10 Network Based Computing Laboratory

  • MVAPICH2-Virt with SR-IOV and IVSHMEM

– Standalone, OpenStack

  • SR-IOV-enabled VM Migration Support in MVAPICH2
  • MVAPICH2 with Containers (Docker and Singularity)
  • MVAPICH2 with Nested Virtualization (Container over VM)
  • MVAPICH2-Virt on SLURM

– SLURM alone, SLURM + OpenStack

  • Neuroscience Applications on HPC Clouds
  • Big Data Libraries on Cloud

– RDMA-Hadoop, OpenStack Swift

Approaches to Build HPC Clouds

slide-11
SLIDE 11

OpenStack Summit 2018 11 Network Based Computing Laboratory

Execute Live Migration with SR-IOV Device

slide-12
SLIDE 12

OpenStack Summit 2018 12 Network Based Computing Laboratory

High Performance SR-IOV enabled VM Migration Support in MVAPICH2

MPI

Host Guest VM1

Hypervisor IB Adapter IVShmem Network Suspend Trigger Read-to- Migrate Detector Network Reactive Notifier

  • Controller

MPI

Guest VM2

MPI

Host Guest VM1

VF / SR-IOV Hypervisor IB Adapter IVShmem Ethernet Adapter MPI

Guest VM2

Ethernet Adapter Migration Done Detector VF / SR-IOV VF / SR-IOV VF / SR-IOV Migration Trigger

  • J. Zhang, X. Lu, D. K. Panda. High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV

enabled InfiniBand Clusters. IPDPS, 2017

  • Migration with SR-IOV device has to handle the

challenges of detachment/re-attachment of virtualized IB device and IB connection

  • Consist of SR-IOV enabled IB Cluster and External

Migration Controller

  • Multiple parallel libraries to notify MPI

applications during migration (detach/reattach SR-IOV/IVShmem, migrate VMs, migration status)

  • Handle the IB connection suspending and

reactivating

  • Propose Progress engine (PE) and migration

thread based (MT) design to optimize VM migration and MPI application performance

slide-13
SLIDE 13

OpenStack Summit 2018 13 Network Based Computing Laboratory

  • Compared with the TCP, the RDMA scheme reduces the total migration time by 20%
  • Total time is dominated by `Migration’ time; Times on other steps are similar across different schemes
  • Proposed migration framework could reduce up to 51% migration time

Performance Evaluation of VM Migration Framework

0.5 1 1.5 2 2.5 3 TCP IPoIB RDMA Times (s)

Set POST_MIGRATION Add IVSHMEM Attach VF Migration Remove IVSHMEM Detach VF Set PRE_MIGRATION

Breakdown of VM migration

10 20 30 40 2 VM 4 VM 8 VM 16 VM Time (s) Sequential Migration Framework Proposed Migration Framework

Multiple VM Migration Time

slide-14
SLIDE 14

OpenStack Summit 2018 14 Network Based Computing Laboratory

Bcast (4VMs * 2Procs/VM)

  • Migrate a VM from one machine to another while benchmark is running inside
  • Proposed MT-based designs perform slightly worse than PE-based designs because of lock/unlock
  • No benefit from MT because of NO computation involved

Performance Evaluation of VM Migration Framework

Pt2Pt Latency

50 100 150 200 1 4 16 64 256 1K 4K 16K 64K 256K 1M Latency (us) Message Size ( bytes) PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA 100 200 300 400 500 600 700 1 4 16 64 256 1K 4K 16K 64K 256K 1M 2Latency (us) Message Size ( bytes) PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA

slide-15
SLIDE 15

OpenStack Summit 2018 15 Network Based Computing Laboratory

Graph500

  • 8 VMs in total and 1 VM carries out migration during application running
  • Compared with NM, MT- worst and PE incur some overhead compared with NM
  • MT-typical allows migration to be completely overlapped with computation

Performance Evaluation of VM Migration Framework

NAS

20 40 60 80 100 120 LU.C EP.C IS.C MG.C CG.C Execution Time (s) PE MT-worst MT-typical NM 0.0 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 20,10 20,16 20,20 22,10 Execution Time (s) PE MT-worst MT-typical NM

slide-16
SLIDE 16

OpenStack Summit 2018 16 Network Based Computing Laboratory

  • MVAPICH2-Virt with SR-IOV and IVSHMEM

– Standalone, OpenStack

  • SR-IOV-enabled VM Migration Support in MVAPICH2
  • MVAPICH2 with Containers (Docker and Singularity)
  • MVAPICH2 with Nested Virtualization (Container over VM)
  • MVAPICH2-Virt on SLURM

– SLURM alone, SLURM + OpenStack

  • Neuroscience Applications on HPC Clouds
  • Big Data Libraries on Cloud

– RDMA-Hadoop, OpenStack Swift

Approaches to Build HPC Clouds

slide-17
SLIDE 17

OpenStack Summit 2018 17 Network Based Computing Laboratory

  • Container-based technologies (e.g., Docker) provide lightweight virtualization solutions
  • Container-based virtualization – share host kernel by containers

Overview of Containers-based Virtualization

VM1 Container1

slide-18
SLIDE 18

OpenStack Summit 2018 18 Network Based Computing Laboratory

  • What are the performance bottlenecks when

running MPI applications on multiple containers per host in HPC cloud?

  • Can we propose a new design to overcome the

bottleneck on such container-based HPC cloud?

  • Can optimized design deliver near-native

performance for different container deployment scenarios?

  • Locality-aware based design to enable CMA

and Shared memory channels for MPI communication across co-resident containers

Containers-based Design: Issues, Challenges, and Approaches

  • J. Zhang, X. Lu, D. K. Panda. High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters.

ICPP, 2016

slide-19
SLIDE 19

OpenStack Summit 2018 19 Network Based Computing Laboratory

10 20 30 40 50 60 70 80 90 100 MG.D FT.D EP.D LU.D CG.D

Execution Time (s)

Container-Def Container-Opt Native

  • 64 Containers across 16 nodes, pining 4 Cores per Container
  • Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500
  • Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Application-Level Performance on Docker with MVAPICH2

Graph 500 NAS

11%

50 100 150 200 250 300 1Cont*16P 2Conts*8P 4Conts*4P

BFS Execution Time (ms) Scale, Edgefactor (20,16)

Container-Def Container-Opt Native

73%

slide-20
SLIDE 20

OpenStack Summit 2018 20 Network Based Computing Laboratory 1000 2000 3000 4000 5000 6000 7000 8000 9000 1K 4K 16K 64K 256K 1M Bandwidth(MB/s) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node

Singularity Performance on Different Processor Architectures

  • MPI point-to-point Bandwidth
  • On both Haswell and KNL, less than 7% overhead for Singularity solution
  • Worse intra-node performance than Haswell because low CPU frequency, complex cluster mode, and cost

maintaining cache coherence

  • KNL - Inter-node performs better than intra-node case after around 256 Kbytes, as Omni-Path

interconnect outperforms shared memory-based transfer for large message size

BW on Haswell BW on KNL

7%

2000 4000 6000 8000 10000 12000 14000 16000 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node

slide-21
SLIDE 21

OpenStack Summit 2018 21 Network Based Computing Laboratory

500 1000 1500 2000 2500 3000 22,16 22,20 24,16 24,20 26,16 26,20 Execution Time (ms) Singularity Native

Singularity Performance on Haswell with InfiniBand

50 100 150 200 250 300 CG EP FT IS LU MG Execution Time (s) Singularity Native

Class D NAS Graph500

  • 512 processors across 32 Haswell nodes
  • Singularity delivers near-native performance, less than 7% overhead on Haswell

with InfiniBand

7%

  • J. Zhang, X. Lu, D. K. Panda. Is Singularity-based Container Technology Ready for Running MPI Applications on HPC

Clouds? UCC 2017. (Best Student Paper Award)

slide-22
SLIDE 22

OpenStack Summit 2018 22 Network Based Computing Laboratory

  • MVAPICH2-Virt with SR-IOV and IVSHMEM

– Standalone, OpenStack

  • SR-IOV-enabled VM Migration Support in MVAPICH2
  • MVAPICH2 with Containers (Docker and Singularity)
  • MVAPICH2 with Nested Virtualization (Container over VM)
  • MVAPICH2-Virt on SLURM

– SLURM alone, SLURM + OpenStack

  • Neuroscience Applications on HPC Clouds
  • Big Data Libraries on Cloud

– RDMA-Hadoop, OpenStack Swift

Approaches to Build HPC Clouds

slide-23
SLIDE 23

OpenStack Summit 2018 23 Network Based Computing Laboratory

Nested Virtualization: Containers over Virtual Machines

  • Useful for live migration, sandbox application, legacy system

integration, software deployment, etc.

  • Performance issues because of the redundant call stacks (two-layer

virtualization) and isolated physical resources

Hardware Host OS Hypervisor Redhat VM1 Docker Engine Container2 bins/ libs App Stack Container1 bins/ libs App Stack Ubuntu VM2 Docker Engine Container4 bins/ libs App Stack Container3 bins/ libs App Stack

slide-24
SLIDE 24

OpenStack Summit 2018 24 Network Based Computing Laboratory

Multiple Communication Paths in Nested Virtualization

1. Intra-VM Intra-Container (across core 4 and core 5) 2. Intra-VM Inter-Container (across core 13 and core 14) 3. Inter-VM Inter-Container (across core 6 and core 12) 4. Inter-Node Inter-Container (across core 15 and the core on remote node)

QPI M e m

  • r

y C

  • n

t r

  • l

l e r

core 4 core 5 core 6 core 7

NUMA 0 VM 0 Container 0 M e m

  • r

y C

  • n

t r

  • l

l e r

core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11 core 12 core 13 core 14 core 15

NUMA 1 VM 1 Container 2 Container 1 Container 3 1 2 3 ... 4

  • Different VM placements introduce multiple communication paths
  • n container level
slide-25
SLIDE 25

OpenStack Summit 2018 25 Network Based Computing Laboratory

Overview of Proposed Design in MVAPICH2

QPI Memory Controller

core 4 core 5 core 6 core 7

NUMA 0 VM 0 Container 0 Memory Controller

core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11 core 12 core 13 core 14 core 15

VM 1 Container 2 Container 1 Container 3 1 2 3 4 Two-Layer NUMA Aware Communication Coordinator NUMA 1 Container Locality Detector VM Locality Detector Nested Locality Combiner Two-Layer Locality Detector CMA Channel SHared Memory (SHM) Channel Network (HCA) Channel

Two-Layer Locality Detector: Dynamically detecting MPI processes in the co- resident containers inside one VM as well as the ones in the co-resident VMs Two-Layer NUMA Aware Communication Coordinator: Leverage nested locality info, NUMA architecture info and message to select appropriate communication channel

  • J. Zhang, X. Lu, D. K. Panda. Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization

based HPC Cloud with SR-IOV Enabled InfiniBand, VEE, 2017

slide-26
SLIDE 26

OpenStack Summit 2018 26 Network Based Computing Laboratory

Inter-VM Inter-Container Pt2Pt (Intra-Socket)

50 100 150 200 250 1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency (us) Message Size (bytes)

Default 1Layer 2Layer Native(w/o CMA) Native

0.5 1 1.5 2 2.5 3 3.5 4 1 4 16 64 256 1K 4K

2000 4000 6000 8000 10000 12000 14000 16000 1 4 16 64 256 1K 4K 16K 64K 256K 1M

Bandwidth (MB/s) Message Size (bytes)

Default 1Layer 2Layer Native(w/o CMA) Native

  • 1Layer has similar performance to the Default
  • Compared with 1Layer, 2Layer delivers up to 84% and 184% improvement for

latency and BW

Latency BW

slide-27
SLIDE 27

OpenStack Summit 2018 27 Network Based Computing Laboratory

Inter-VM Inter-Container Pt2Pt (Inter-Socket)

50 100 150 200 250 300 1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency (us) Message Size (bytes)

Default 1Layer 2Layer Basic-Hybrid Enhanced-Hybrid Native(w/o CMA) Native

0.5 1 1.5 2 2.5 3 3.5 4 1 4 16 64 256 1K 4K

2000 4000 6000 8000 10000 12000 14000 16000 1 4 16 64 256 1K 4K 16K 64K 256K 1M

Bandwidth (MB/s) Message Size (bytes)

Default 1Layer 2Layer Basic-Hybrid Enhanced-Hybrid Native(w/o CMA) Native

Latency BW

  • 1-Layer has similar performance to the Default
  • 2-Layer has near-native performance for small msg, but clear overhead on large msg
  • Compared to 2-Layer, Hybrid design brings up to 42% and 25% improvement for

latency and BW, respectively

slide-28
SLIDE 28

OpenStack Summit 2018 28 Network Based Computing Laboratory

Application-level Evaluations

1 2 3 4 5 6 7 8 9 10

22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16

BFS Execution Time (s) Default 1Layer 2Layer-Enhanced-Hybrid 20 40 60 80 100 120 140 160 180

IS MG EP FT CG LU

Execution Time (s) Default 1Layer 2Layer-Enhanced-Hybrid

  • 256 processes across 64 containers on 16 nodes
  • Compared with Default, enhanced-hybrid design reduces up to 16% (28,16) and 10% (LU) of

execution time for Graph 500 and NAS, respectively

  • Compared with the 1Layer case, enhanced-hybrid design also brings up to 12% (28,16) and 6%

(LU) performance benefit Class D NAS Graph500

slide-29
SLIDE 29

OpenStack Summit 2018 29 Network Based Computing Laboratory

  • MVAPICH2-Virt with SR-IOV and IVSHMEM

– Standalone, OpenStack

  • SR-IOV-enabled VM Migration Support in MVAPICH2
  • MVAPICH2 with Containers (Docker and Singularity)
  • MVAPICH2 with Nested Virtualization (Container over VM)
  • MVAPICH2-Virt on SLURM

– SLURM alone, SLURM + OpenStack

  • Neuroscience Applications on HPC Clouds
  • Big Data Libraries on Cloud

– RDMA-Hadoop, OpenStack Swift

Approaches to Build HPC Clouds

slide-30
SLIDE 30

OpenStack Summit 2018 30 Network Based Computing Laboratory

load SPANK reclaim VMs register job step reply register job step req

Slurmctld Slurmd Slurmd

release hosts run job step req run job step reply

mpirun_vm

MPI Job across VMs

VM Config Reader load SPANK VM Launcher load SPANK VM Reclaimer

  • VM Configuration Reader –

Register all VM configuration

  • ptions, set in the job control

environment so that they are visible to all allocated nodes.

  • VM Launcher – Setup VMs on

each allocated nodes.

  • File based lock to detect occupied VF

and exclusively allocate free VF

  • Assign a unique ID to each IVSHMEM

and dynamically attach to each VM

  • VM Reclaimer – Tear down

VMs and reclaim resources

SLURM SPANK Plugin based Design

MPI MPI

vm hostfile

slide-31
SLIDE 31

OpenStack Summit 2018 31 Network Based Computing Laboratory

  • VM Configuration Reader – VM
  • ptions register
  • VM Launcher, VM Reclaimer –

Offload to underlying OpenStack infrastructure

  • PCI Whitelist to passthrough free VF to VM
  • Extend Nova to enable IVSHMEM when

launching VM

SLURM SPANK Plugin with OpenStack based Design

  • J. Zhang, X. Lu, S. Chakraborty, D. K. Panda.

SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem. Euro-Par, 2016

reclaim VMs register job step reply register job step req

Slurmctld Slurmd

release hosts launch VM

mpirun_vm load SPANK VM Config Reader MPI

VM hostfile

OpenStack daemon

request launch VM

VM Launcher

return request reclaim VM

VM Reclaimer

return

...... ...... ...... ......

slide-32
SLIDE 32

OpenStack Summit 2018 32 Network Based Computing Laboratory

  • 32 VMs across 8 nodes, 6 Core/VM
  • EASJ - Compared to Native, less than 4% overhead with 128 Procs
  • SACJ, EACJ – Also minor overhead, when running NAS as concurrent job with 64 Procs

Application-Level Performance on Chameleon (Graph500)

Exclusive Allocations Sequential Jobs 500 1000 1500 2000 2500 3000 24,16 24,20 26,10

BFS Execution Time (ms) Problem Size (Scale, Edgefactor)

VM Native 50 100 150 200 250 22,10 22,16 22,20

BFS Execution Time (ms) Problem Size (Scale, Edgefactor)

VM Native 50 100 150 200 250 22 10 22 16 22 20

BFS Execution Time (ms) Problem Size (Scale, Edgefactor)

VM Native Shared-host Allocations Concurrent Jobs Exclusive Allocations Concurrent Jobs

4%

slide-33
SLIDE 33

OpenStack Summit 2018 33 Network Based Computing Laboratory

  • MVAPICH2-Virt with SR-IOV and IVSHMEM

– Standalone, OpenStack

  • SR-IOV-enabled VM Migration Support in MVAPICH2
  • MVAPICH2 with Containers (Docker and Singularity)
  • MVAPICH2 with Nested Virtualization (Container over VM)
  • MVAPICH2-Virt on SLURM

– SLURM alone, SLURM + OpenStack

  • Neuroscience Applications on HPC Clouds
  • Big Data Libraries on Cloud

– RDMA-Hadoop, OpenStack Swift

Approaches to Build HPC Clouds

slide-34
SLIDE 34

OpenStack Summit 2018 34 Network Based Computing Laboratory

  • Easy and Fast Discovery is the key!

NeuroScience Meets HPC Cloud

1https://github.com/francopestilli/life

The Brain Connectome. Illustration of a set of fascicles (white matter bundles) obtained by using a tractography

  • algorithm. Fascicles are grouped together conforming white

matter tracts (shown with different colors here) connecting different cortical areas of the human brain. LiFE1 (Linear Fascicle Evaluation) is an approach to predict diffusion measurements in brain connectomes.

Virtualization Cloud Computing

  • Easy-to-use and High-Performance

Technology is the key!

slide-35
SLIDE 35

OpenStack Summit 2018 35 Network Based Computing Laboratory

  • Identified computationally intensive tasks as the computations of matrix by vector products

– w = MTy and y = Mw

  • The computationally intensive functions have been parallelized using MPI by dividing the task

among multiple MPI processes

  • Implementation uses MVAPICH22, from OSU team

MPI-based LiFE for Brain Health: Initial Design using MVAPICH2 MPI Library

1https://github.com/francopestilli/life

2http://mvapich.cse.ohio-state.edu/ Computation of w = MTy using 2 MPI processes Computation of y = Mw using 2 MPI processes

slide-36
SLIDE 36

OpenStack Summit 2018 36 Network Based Computing Laboratory

Design and Evaluation with MVAPICH2: Single Node with MPI on Intel Knights Landing (KNL)

  • Evaluation on TACC Stampede KNL (Intel Xeon Phi KNL CPUs, 68 cores, 96 GB memory per node)
  • Up to 8.7x speed up

2 4 6 8 10 1 2 4 8 16 24 28 32 64 Speed Up MPI Processes

Speed Up on a Single Node

Stampede2 1000 2000 3000 4000 5000 1 2 4 8 16 32 64 Execution Time (s) MPI Processes

Time Breakup Evaluation on Stampede2

NNLS Load MM Reduce Gather Bcast

MPI-LiFE software is available from http://neurohpc.cse.ohio-state.edu Docker-containerized version, Can run from laptop to clusters

slide-37
SLIDE 37

OpenStack Summit 2018 37 Network Based Computing Laboratory

Design and Evaluation of LiFE code with MVAPICH2-Virt+Docker

  • Evaluation on Chameleon with Docker (Intel Haswell CPUs, 24 cores, 128 GB memory per node)
  • Up to 5.5x speed up on Chameleon

1 2 3 4 5 6 1 2 4 8 16 24 Speed Up Nodes

Speed Up on Single Node with Docker (Chameleon)

Native 1 Container 2 Containers

slide-38
SLIDE 38

OpenStack Summit 2018 38 Network Based Computing Laboratory

  • MVAPICH2-Virt with SR-IOV and IVSHMEM

– Standalone, OpenStack

  • SR-IOV-enabled VM Migration Support in MVAPICH2
  • MVAPICH2 with Containers (Docker and Singularity)
  • MVAPICH2 with Nested Virtualization (Container over VM)
  • MVAPICH2-Virt on SLURM

– SLURM alone, SLURM + OpenStack

  • Neuroscience Applications on HPC Clouds
  • Big Data Libraries on Cloud

– RDMA-Hadoop, OpenStack Swift

Approaches to Build HPC Clouds

slide-39
SLIDE 39

OpenStack Summit 2018 39 Network Based Computing Laboratory

  • RDMA for Apache Spark
  • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

  • RDMA for Apache HBase
  • RDMA for Memcached (RDMA-Memcached)
  • RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
  • OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

  • http://hibd.cse.ohio-state.edu
  • Users Base: 285 organizations from 34 countries
  • More than 26,400 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker

slide-40
SLIDE 40

OpenStack Summit 2018 40 Network Based Computing Laboratory

Overview of RDMA-Hadoop-Virt Architecture

  • Virtualization-aware modules in all the four

main Hadoop components:

– HDFS: Virtualization-aware Block Management to improve fault-tolerance – YARN: Extensions to Container Allocation Policy to reduce network traffic – MapReduce: Extensions to Map Task Scheduling Policy to reduce network traffic – Hadoop Common: Topology Detection Module for automatic topology detection

  • Communications in HDFS, MapReduce, and RPC

go through RDMA-based designs over SR-IOV enabled InfiniBand

HDFS YARN Hadoop Common MapReduce HBase Others

Virtual Machines Bare-Metal nodes Containers

Big Data Applications

Topology Detection Module Map Task Scheduling Policy Extension Container Allocation Policy Extension CloudBurst MR-MS Polygraph Others Virtualization Aware Block Management

  • S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on

SR-IOV-enabled Clouds. CloudCom, 2016.

slide-41
SLIDE 41

OpenStack Summit 2018 41 Network Based Computing Laboratory

Evaluation with Applications

– 14% and 24% improvement with Default Mode for CloudBurst and Self-Join – 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join

20 40 60 80 100 Default Mode Distributed Mode EXECUTION TIME

CloudBurst

RDMA-Hadoop RDMA-Hadoop-Virt 50 100 150 200 250 300 350 400 Default Mode Distributed Mode EXECUTION TIME

Self-Join

RDMA-Hadoop RDMA-Hadoop-Virt

30% reduction 55% reduction

slide-42
SLIDE 42

OpenStack Summit 2018 42 Network Based Computing Laboratory

  • Distributed Cloud-based Object Storage Service
  • Deployed as part of OpenStack installation
  • Can be deployed as standalone storage solution as well
  • Worldwide data access via Internet

– HTTP-based

  • Architecture

– Multiple Object Servers: To store data – Few Proxy Servers: Act as a proxy for all requests – Ring: Handles metadata

  • Usage

– Input/output source for Big Data applications (most common use case) – Software/Data backup – Storage of VM/Docker images

  • Based on traditional TCP sockets communication

OpenStack Swift Overview

Send PUT or GET request PUT/GET /v1/<account>/<container>/<object> Proxy Server Object Server Object Server Object Server Ring

Disk 1 Disk 2 Disk 1 Disk 2 Disk 1 Disk 2

Swift Architecture

slide-43
SLIDE 43

OpenStack Summit 2018 43 Network Based Computing Laboratory

  • Challenges

– Proxy server is a bottleneck for large scale deployments – Object upload/download operations network intensive – Can an RDMA-based approach benefit?

  • Design

– Re-designed Swift architecture for improved scalability and performance; Two proposed designs:

  • Client-Oblivious Design: No changes required on the client side
  • Metadata Server-based Design: Direct communication

between client and object servers; bypass proxy server

– RDMA-based communication framework for accelerating networking performance – High-performance I/O framework to provide maximum

  • verlap between communication and I/O

Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds

  • S. Gugnani, X. Lu, and D. K. Panda, Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud,

accepted at CCGrid’17, May 2017 Client-Oblivious Design (D1) Metadata Server-based Design (D2)

slide-44
SLIDE 44

OpenStack Summit 2018 44 Network Based Computing Laboratory 5 10 15 20 25 Swift PUT Swift-X (D1) PUT Swift-X (D2) PUT Swift GET Swift-X (D1) GET Swift-X (D2) GET LATENCY (S) TIME BREAKUP OF GET AND PUT Communication I/O Hashsum Other

Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds

5 10 15 20

1MB 4MB 16MB 64MB 256MB 1GB 4GB

LATENCY (s) OBJECT SIZE GET LATENCY EVALUATION Swift Swift-X (D2) Swift-X (D1) Reduced by 66%

  • Up to 66% reduction in GET latency
  • Communication time reduced by up to

3.8x for PUT and up to 2.8x for GET

slide-45
SLIDE 45

OpenStack Summit 2018 45 Network Based Computing Laboratory

Available Appliances on Chameleon Cloud*

Appliance Description CentOS 7 KVM SR- IOV Chameleon bare-metal image customized with the KVM hypervisor and a recompiled kernel to enable SR-IOV over InfiniBand. https://www.chameleoncloud.org/appliances/3/ MPI bare-metal cluster complex appliance (Based on Heat) This appliance deploys an MPI cluster composed of bare metal instances using the MVAPICH2 library over InfiniBand. https://www.chameleoncloud.org/appliances/ 29/ MPI + SR-IOV KVM cluster (Based on Heat) This appliance deploys an MPI cluster of KVM virtual machines using the MVAPICH2-Virt implementation and configured with SR-IOV for high-performance communication over InfiniBand. https://www.chameleoncloud.org/appliances/28/ CentOS 7 SR-IOV RDMA-Hadoop The CentOS 7 SR-IOV RDMA-Hadoop appliance is built from the CentOS 7 appliance and additionally contains RDMA-Hadoop library with SR-IOV. https://www.chameleoncloud.org/appliances/17/

  • Through these available appliances, users and researchers can easily deploy HPC clouds to perform experiments and run jobs with

– High-Performance SR-IOV + InfiniBand – High-Performance MVAPICH2 Library over bare-metal InfiniBand clusters – High-Performance MVAPICH2 Library with Virtualization Support over SR-IOV enabled KVM clusters – High-Performance Hadoop with RDMA-based Enhancements Support

[*] Only include appliances contributed by OSU NowLab

slide-46
SLIDE 46

OpenStack Summit 2018 46 Network Based Computing Laboratory

  • MVAPICH2-Virt over SR-IOV-enabled InfiniBand is an efficient approach to build HPC Clouds

– Standalone, OpenStack, Slurm, and Slurm + OpenStack – Support Virtual Machine Migration with SR-IOV InfiniBand devices – Support Virtual Machine, Container (Docker and Singularity), and Nested Virtualization

  • Very little overhead with virtualization, near native performance at application level
  • Much better performance than Amazon EC2
  • MVAPICH2-Virt is available for building HPC Clouds (http://mvapich.cse.ohio-state.edu)

– SR-IOV, IVSHMEM, Docker support, OpenStack

  • Neuroscience applications can benefit from technologies on HPC clouds
  • Big Data analytics stacks such as RDMA-Hadoop can benefit from cloud-aware designs
  • Appliances for MVAPICH2-Virt and RDMA-Hadoop are available for building HPC Clouds
  • SR-IOV/container support and appliances for other MVAPICH2 libraries (MVAPICH2-X,

MVAPICH2-GDR, ...) and RDMA-Spark/Memcached

Conclusions

slide-47
SLIDE 47

OpenStack Summit 2018 47 Network Based Computing Laboratory

The 2nd International BoF on Building Efficient Clouds for HPC, Big Data, and Deep Learning Middleware and Applications (HPC Cloud BoF)

HPC Cloud BoF 2018 will be held with ISC ‘18, Frankfurt, Germany

June 27, 2018 (Wednesday) 1:45-2:45 pm

HPC Cloud BoF 2017 was held in conjunction with SC’17 http://sc17.supercomputing.org/presentation/?id=bof165&sess=sess357

slide-48
SLIDE 48

OpenStack Summit 2018 48 Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ MVAPICH/MVAPICH2 http://mvapich.cse.ohio-state.edu/

{panda, luxi}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi