comet virtual clusters what s underneath
play

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San - PowerPoint PPT Presentation

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San Diego Supercomputer Center ppapadopoulos@ucsd.edu Overview NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science PI: Michael Norman


  1. Comet Virtual Clusters – What’s underneath? Philip Papadopoulos San Diego Supercomputer Center ppapadopoulos@ucsd.edu

  2. Overview NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science PI: Michael Norman Co-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-Diehr SDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)

  3. Comet: System Characteristics • Hybrid fat-tree topology • Total peak flops ~2.1 PF • Dell primary integrator • FDR (56 Gbps) InfiniBand • Intel Haswell processors w/ AVX2 • Rack-level (72 nodes, 1,728 cores) full bisection • Mellanox FDR InfiniBand bandwidth • 1,944 standard compute nodes (46,656 • 4:1 oversubscription cross-rack cores) • Dual CPUs, each 12-core, 2.5 GHz • Performance Storage (Aeon) • 128 GB DDR4 2133 MHz DRAM • 7.6 PB, 200 GB/s; Lustre • 2*160GB GB SSDs (local disk) • Scratch & Persistent Storage segments • 36 108 GPU nodes • Same as standard nodes plus • Durable Storage (Aeon) • Two NVIDIA K80 cards, each with dual Kepler3 GPUs (36) • 6 PB, 100 GB/s; Lustre • Two NVIDIA P100 GPUs (72) • Automatic backups of critical data • 4 large-memory nodes • Home directory storage • 1.5 TB DDR4 1866 MHz DRAM • Four Haswell processors/node • Gateway hosting nodes • 64 cores/node • Virtual image repository • 100 Gbps external connectivity to Internet2 &

  4. Comet Network Architecture InfiniBand compute, Ethernet Storage Home File Systems Login Gateway VM Image Repository Management Hosts Data Mover Node-Local 72 HSWL 320 GB Storage 18 27 racks 4 Core Internet 2 FDR 36p InfiniBand FDR 72 FDR (2 x 108- 18 port) 72 HSWL switches 320 GB 2*36 FDR Juniper Research and Education 100 Gbps 72 FDR 36p IB-Ethernet Network Access Bridges (4 x 4 Mid-tier 36 GPU Data Movers 18-port each) InfiniBand Arista 4*18 40GbE 40GbE (2x) 4 Large- Memory Arista 40GbE Data Mover (2x) Nodes 64 40GbE 128 10GbE 18 72 HSWL Additional Support Components (not shown for clarity) Ethernet Mgt Network (10 GbE) 7x 36-port FDR in each rack wired as full fat-tree. 4:1 over Performance Storage Durable Storage subscription between racks. 7.7 PB, 200 GB/s 6 PB, 100 GB/s 32 storage servers 64 storage servers

  5. Fun with IB �� Ethernet Bridging LID of LAG • Comet has four (4) Ethernet �� IB bridge switches • 18 FDR links, 18 40GbE links (72 total of each) • 4 X 16 port + 4 x 2 Port LAGS on the Ethernet Side • Issue #1 • Significant BW limitation cluster � storage IB Switch • Why? (IB Routing) 1. Each LAG group has a single IB local ID (LID) 2. IB switches are destination routed – Default is that all sources for the same destination LID take the same route (port) • Solution: change LID mask count (LMC) from 0 to 2. � Every LID IB Nodes becomes 2^LMC addresses. At each switch level, there are now 2^LMC routes to a destination LID (better route dispersion) • Drawbacks: IB can have about 48K endpoints . When you increase LMC for better route balancing, you reduce the size of your network. At LMC=2 � 12K at LMC=3 � 6K nodes.

  6. Lustre Storage More IB to Ethernet Issues Ethernet IP XX.YY (mac: aa) PROBLEM: Losing Ethernet Paths from Nodes to storage • Mellanox bridges use PROXY ARP Arista Switch/Router • When a IPOIB interface on a compute ARPs for IP address XX.YY bridges “answers” with it’s MAC address. When it receives a packet destined for IP XX.YY it forwards (Layer 2) to the appropriate mac • Vendor Advertised that it could handle 3K Proxy Arp entries per bridge. Our network config worked for 18+ months. IB/Ether Bridge • Then, a change in opensm (subnet manager). Whenever a (mac: bb) subnet change occurred, an ARP flood ensued (2K nodes each asking for O(64) Ether mac addresses). • Bridge CPUs were woefully underpowered taking minutes to respond to all ARP requests. Lustre wasn’t happy • � redesigned network from layer 2 to layer 3 (using routers IPoIB node: Who has XX.YY? inside our Arista Fabric). Bridge answers: “I do, at bb” (PROXY ARP)

  7. Vir irtualized Clu lusters on Comet Goal: Provide a near bare metal HPC performance and management experience Target Use Projects that could manage their own cluster, and: • can’t fit OUR software environment, and • don’t want to buy hardware or • have bursty or intermittent need

  8. User Perspective User is a system administrator – we give them their own HPC cluster Active virtual compute nodes Attached and synchronized • Scheduling • Storage management • Coordinating network changes • VM launch & shutdown Nucleus Disk images API • Request nodes • Console & power Persistent virtual front end Idle disk images

  9. User-Customized HPC 1:1 physical-to-virtual compute node public physical network virtual virtual Virtual Frontend Disk Image Vault Hosting Frontend Virtual Virtual Frontend Frontend private private private Compute Compute Compute Virtual Compute Compute Compute Compute Virtual Virtual Virtual Compute Compute Compute Compute Compute Compute Virtual Virtual Compute Compute

  10. High Performance Virtual Cluster Characteristics Comet: Providing Virtualized HPC for XSEDE Infiniband Infin In iniband Vi Virtualization Virtual Frontend • 8% latency overhead. 8% . • Nom omin inal ba bandwid idth overhead private Ethernet All l no nodes s ha have • Priv rivate Eth thernet • Infin In iniband Virtual • Loc Local l Di Disk St Storage Compute Virtual Vi l Co Compute Nod odes s can Netw twork boo boot (P (PXE XE) Virtual from fr om its ts vi virtual l fr frontend Compute All l Di Disks retain state Virtual Compute • keep user configuration between boots

  11. Bare Metal “Experience” • Can install virtual frontend from a bootable ISO image • Subordinate nodes can PXE boot • Compute nodes retain disk state (turning off a compute node is equivalent to turning off power on a physical node). • � Don’t want cluster owners to learn an entirely “new way” of doing things. • Side comment: you don’t always have to run the way “Google does it” to do good science. • � If you have tools to manage physical nodes today, you can use those same tools to manage your virtual cluster.

  12. Benchmark Results

  13. Single Root I/O Virtualization in HPC • Problem: Virtualization generally has resulted in significant I/O performance degradation (e.g., excessive DMA interrupts) • Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand host channel adapters • One physical function � multiple virtual functions, each light weight but with its own DMA streams, memory space, interrupts • Allows DMA to bypass hypervisor to VMs • SRIOV enables virtual HPC cluster w/ near-native InfiniBand latency/bandwidth and minimal overhead

  14. MPI bandwidth slowdown from SR-IOV is at most 1.21 for medium-sized messages & negligible for small & large ones

  15. MPI latency slowdown from SR-IOV is at most 1.32 for small messages & negligible for large ones

  16. WRF Weather Modeling • 96-core (4-node) calculation • Nearest-neighbor communication • Test Case: 3hr Forecast, 2.5km resolution of Continental US (CONUS). • Scalable algorithms • 2% slower w/ SR-IOV vs native IB. WR F 3.4.1 – 3hr forecast

  17. MrBayes: Software for Bayesian inference of phylogeny. • Widely used, including by CIPRES gateway. • 32-core (2 node) calculation • Hybrid MPI/OpenMP Code. • 8 MPI tasks, 4 OpenMP threads per task. • Compilers: gcc + mvapich2 v2.2, AVX options. • Test Case: 218 taxa, 10,000 generations. • 3% slower with SR-IOV vs native IB.

  18. Quantum ESPRESSO • 48-core (3 node) calculation • CG matrix inversion - irregular communication • 3D FFT matrix transposes (all-to- all communication) • Test Case: DEISA AUSURF 112 benchmark. • 8% slower w/ SR-IOV vs native IB.

  19. RAxML: Code for Maximum Likelihood-based inference of large phylogenetic trees. • Widely used, including by CIPRES gateway. • 48-core (2 node) calculation • Hybrid MPI/Pthreads Code. • 12 MPI tasks, 4 threads per task. • Compilers: gcc + mvapich2 v2.2, AVX options. • Test Case: Comprehensive analysis, 218 taxa, 2,294 characters, 1,846 patterns, 100 bootstraps specified. • 19% slower w/ SR-IOV vs native IB.

  20. NAMD: Molecular Dynamics, ApoA1 Benchmark • 48-core (2 node) calculation • Test Case: ApoA1 benchmark. • 92,224 atoms, periodic, PME. • Binary used: NAMD 2.11, ibverbs, SMP. • Directly used prebuilt binary which uses ibverbs for multi-node runs. • 23% slower w/ SR-IOV vs native IB.

  21. Accessing Virtual Cluster Capabilities – much smaller API than Openstack/EC2/GCE • REST API • Command line interface • Command shell for scripting • Console Access • (Portal) User does NOT see: Rocks, Slurm, etc.

Recommend


More recommend