Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference LLNL-PRES-746812 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Application developers face greater complexity § Hardware architectures § Applications — SMT — Need the hardware topology to run efficiently — GPUs — Need to run on more than one — FPGAs architecture — NVRAM — Need multiple programming — NUMA, multi-rail abstractions Hybrid applications § Programming abstractions — MPI — OpenMP, POSIX threads — CUDA, OpenMP 4.5, OpenACC — Kokkos, RAJA 2 LLNL-PRES-746812
How do we map hybrid applications to increasingly complex hardware? § Compute power is not the bottleneck § Data movement dominates energy consumption § HPC applications dominated by the memory system — Latency and bandwidth — Capacity tradeoffs (multi-level memories) § Leverage local resources — Avoid remote accesses More than compute resources, it is about the memory system! 3 LLNL-PRES-746812
The Sierra system that will replace Sequoia features a GPU-accelerated architecture Compute System Compute Rack Compute Node 4320 nodes Standard 19” 1.29 PB Memory 2 IBM POWER9 CPUs Warm water cooling 240 Compute Racks 4 NVIDIA Volta GPUs 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ~12 MW 256 GiB DDR4 16 GiB Globally addressable HBM2 associated with each GPU Components Coherent Shared Memory IBM POWER9 • Gen2 NVLink NVIDIA Volta GPFS File System Mellanox Interconnect • 7 TFlop/s Single Plane EDR InfiniBand 154 PB usable storage • HBM2 2 to 1 Tapered Fat Tree 1.54 TB/s R/W bandwidth • Gen2 NVLink 4 LLNL-PRES-746812
2017 Machine (256GB total) NUMANode P#0 (128GB) CORAL EA Package P#0 PCI 15b3:1013 PCI 1c58:0003 PCI 10de:15f9 PCI 10de:15f9 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) ib0 mlx5_0 card0 card1 L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) renderD128 renderD129 L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) NVMe Core P#8 Core P#16 Core P#24 Core P#32 PU P#0 PU P#1 PU P#2 PU P#3 PU P#8 PU P#9 PU P#10 PU P#11 PU P#16 PU P#17 PU P#18 PU P#19 PU P#24 PU P#25 PU P#26 PU P#27 SSD PU P#4 PU P#5 PU P#6 PU P#7 PU P#12 PU P#13 PU P#14 PU P#15 PU P#20 PU P#21 PU P#22 PU P#23 PU P#28 PU P#29 PU P#30 PU P#31 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) 1 IB NIC Core P#40 Core P#48 Core P#72 Core P#88 PU P#32 PU P#33 PU P#34 PU P#35 PU P#40 PU P#41 PU P#42 PU P#43 PU P#48 PU P#49 PU P#50 PU P#51 PU P#56 PU P#57 PU P#58 PU P#59 per socket PU P#36 PU P#37 PU P#38 PU P#39 PU P#44 PU P#45 PU P#46 PU P#47 PU P#52 PU P#53 PU P#54 PU P#55 PU P#60 PU P#61 PU P#62 PU P#63 Private L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L1, L2, L3 L1d (64KB) L1d (64KB) 10 cores IBM Power8+ Core P#96 Core P#112 PU P#64 PU P#65 PU P#66 PU P#67 PU P#72 PU P#73 PU P#74 PU P#75 per socket PU P#68 PU P#69 PU P#70 PU P#71 PU P#76 PU P#77 PU P#78 PU P#79 SL822LC NUMANode P#1 (128GB) Package P#1 PCI 15b3:1013 PCI 1b4b:9235 PCI 10de:15f9 PCI 10de:15f9 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) ib1 mlx5_1 card2 card3 PCI 1a03:2000 NVIDIA Pascal L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) card4 renderD130 renderD131 L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) controlD64 Tesla P100 Core P#136 Core P#144 Core P#152 Core P#168 2 Ethernet PCI 14e4:1657 PU P#80 PU P#81 PU P#82 PU P#83 PU P#88 PU P#89 PU P#90 PU P#91 PU P#96 PU P#97 PU P#98 PU P#99 PU P#104 PU P#105 PU P#106 PU P#107 enP5p7s0f0 PU P#84 PU P#85 PU P#86 PU P#87 PU P#92 PU P#93 PU P#94 PU P#95 PU P#100 PU P#101 PU P#102 PU P#103 PU P#108 PU P#109 PU P#110 PU P#111 NICs PCI 14e4:1657 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) enP5p7s0f1 2 GPUs L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) per socket Core P#176 Core P#200 Core P#208 Core P#216 PU P#112 PU P#113 PU P#114 PU P#115 PU P#120 PU P#121 PU P#122 PU P#123 PU P#128 PU P#129 PU P#130 PU P#131 PU P#136 PU P#137 PU P#138 PU P#139 PU P#116 PU P#117 PU P#118 PU P#119 PU P#124 PU P#125 PU P#126 PU P#127 PU P#132 PU P#133 PU P#134 PU P#135 PU P#140 PU P#141 PU P#142 PU P#143 L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) SMT-8 Core P#224 Core P#232 PU P#144 PU P#145 PU P#146 PU P#147 PU P#152 PU P#153 PU P#154 PU P#155 PU P#148 PU P#149 PU P#150 PU P#151 PU P#156 PU P#157 PU P#158 PU P#159 Figure generated with hwloc 5 LLNL-PRES-746812
Existing approaches and their limitations § MPI/RM approaches § Limitations — By thread — Memory system is not primary concern — By core — No coherent mapping across — By socket programming abstractions — Latency (IBM Spectrum MPI) — No heterogeneous devices — Bandwidth (IBM Spectrum MPI) support § OpenMP approaches — Policies Spread, Close, Master — Predefined places Threads, Cores, Sockets 6 LLNL-PRES-746812
A portable algorithm for multi-GPU architectures: mpibind § 2 IBM Power8+ processors Node § Per socket Memory Memory Machine — 10 SMT-8 cores Socket Socket L3 (8192KB) — 1 InfiniBand NIC L2 (512KB) — 2 Pascal GPUs L1d (64KB) G Core P#0 G § NVMe SSD PU P#0 PU P#1 PU P#2 PU P#3 G PU P#4 PU P#5 PU P#6 PU P#7 § Private L1, L2, L3 G per core 7 LLNL-PRES-746812
mpibind’s primary consideration: The memory hierarchy Workers = 8 8 workers / 2 NUMA Vertices( l 2 ) = 20 4 workers / 2 GPUs Start l 0 NIC-0 3 0 1 2 GPU-0 NIC-1 GPU-1 GPU-2 NVM-0 GPU-3 l 1 NUMA NUMA l 2 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 ... l 3 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 l 4 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c19 l 5 p0-7 p8-15 p16-23 p24-31 p32-39 p40-47 p48-55 p56-63 p64-71 p72-79 p80-87 p152-159 8 LLNL-PRES-746812
l 0 Start The algorithm GPU0 GPU1 NUMA NUMA l 1 NIC0 NIC1 ... l 2 L2 L2 § Get hardware topology ... L1 L1 l 3 Core0 Core1 § Devise memory tree G — Assign devices to memory vertices l 4 p4 p5 p6 p7 § Calculate # workers w (all processes and threads) § Traverse tree to determine level k with at least w vertices § Traverse subtrees selecting compute resources for each vertex m’: vertices(k) → PU § Map workers to vertices respecting NUMA boundaries m: workers → vertices(k) → PU 9 LLNL-PRES-746812
Example mapping: One task per GPU Node Memory Memory def lat bw mb Socket Socket 0 0-79 0-7 0-7 0,8,16,24,32 2 3 1 80-159 8-15 8-15 40,48,56,64,72 1 0 0 1 2 3 3 0 1 0 2 2 0-79 16-23 80-87 80,88,96,104,112 G 2 3 2 0 1 2 3 0 1 2 3 0 1 2 3 G 3 80-159 24-31 88-95 120,128,136,144,152 0 1 0 0 1 2 3 2 3 G 3 G 1 1 3 10 LLNL-PRES-746812
Evaluation: Synchronous collectives, GPU compute and bandwidth, app benchmark Machine CORAL EA system Spectrum-MPI default Spectrum-MPI latency Affinity Spectrum-MPI bandwidth mpibind MPI Barrier MPI Allreduce Benchmarks Bytes&Flops Compute Bytes&Flops Bandwidth SW4lite Number 1, 2, 4, 8, 16 of Nodes Processes (tasks) 4, 8, 20 per node 11 LLNL-PRES-746812
Enabled uniform access to GPU resources Compute micro-benchmark* Target PPN: 4 Target PPN: 8 § Execute multiple instances 5000 concurrently 4 4 4 4 5 5 4 8 PPN — 4 and 8 PPN GPU 0 4000 § Measure GPU FLOPS Mean GFlops per Process Better 3000 § Processes time-share GPUs by default 2000 § Performance without 1000 mpibind severely limited because of GPU mapping 0 bandwidth default latency mpibind bandwidth default latency mpibind *kokkos/benchmarks/bytes_and_flops 12 LLNL-PRES-746812
Enabled access to the memory of all GPUs without user intervention Memory bandwidth micro-benchmark* § Execute multiple instances Target PPN: 4 Target PPN: 8 concurrently 4 4 4 4 5 4 4 8 PPN 500 — 4 or 8 PPN GPU 0 § Measure GPU global 400 Mean GBps per Process memory bandwidth Better 300 § Processes time-share GPUs by default 200 § Without mpibind some processes fail running out 100 of memory 0 bandwidth default latency mpibind bandwidth default latency mpibind *kokkos/benchmarks/bytes_and_flops 13 LLNL-PRES-746812
Impact on SW4lite–Earthquake ground motion simulation § Simplified version of SW4 CPU: MPI + OpenMP GPU: MPI + RAJA — Layer over half space (LOH.2) Forcing — 17 million grid points (h100) Supergrid 800 Scheme § Multiple runs, calculate mean BC_phys BC_comm — 6 times for GPU Mean Execution Time (secs) — 10 times for CPU 600 Better § Performance speedup — CPU: mpibind over default: 3.7x 400 — GPU: mpibind over bandwidth: 2.2x — GPU over CPU: 9.7x 200 TPP → CPP x PPN → CPN bandwidth 8 1 4 4 under 0 default 80 10 4 40 over bandwidth default latency mpibind bandwidth default latency mpibind latency 8 1 4 4 under mpibind 5 5 4 20 TPP: Threads per process, CPP: Cores per process, PPN: Processes per node, CPN: Cores per node 14 LLNL-PRES-746812
Recommend
More recommend