Mapping MPI+X Applications to Multi-GPU Architectures A - PowerPoint PPT Presentation

Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference LLNL-PRES-746812 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Application developers face greater complexity § Hardware architectures § Applications — SMT — Need the hardware topology to run efficiently — GPUs — Need to run on more than one — FPGAs architecture — NVRAM — Need multiple programming — NUMA, multi-rail abstractions Hybrid applications § Programming abstractions — MPI — OpenMP, POSIX threads — CUDA, OpenMP 4.5, OpenACC — Kokkos, RAJA 2 LLNL-PRES-746812

How do we map hybrid applications to increasingly complex hardware? § Compute power is not the bottleneck § Data movement dominates energy consumption § HPC applications dominated by the memory system — Latency and bandwidth — Capacity tradeoffs (multi-level memories) § Leverage local resources — Avoid remote accesses More than compute resources, it is about the memory system! 3 LLNL-PRES-746812

The Sierra system that will replace Sequoia features a GPU-accelerated architecture Compute System Compute Rack Compute Node 4320 nodes Standard 19” 1.29 PB Memory 2 IBM POWER9 CPUs Warm water cooling 240 Compute Racks 4 NVIDIA Volta GPUs 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ~12 MW 256 GiB DDR4 16 GiB Globally addressable HBM2 associated with each GPU Components Coherent Shared Memory IBM POWER9 • Gen2 NVLink NVIDIA Volta GPFS File System Mellanox Interconnect • 7 TFlop/s Single Plane EDR InfiniBand 154 PB usable storage • HBM2 2 to 1 Tapered Fat Tree 1.54 TB/s R/W bandwidth • Gen2 NVLink 4 LLNL-PRES-746812

2017 Machine (256GB total) NUMANode P#0 (128GB) CORAL EA Package P#0 PCI 15b3:1013 PCI 1c58:0003 PCI 10de:15f9 PCI 10de:15f9 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) ib0 mlx5_0 card0 card1 L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) renderD128 renderD129 L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) NVMe Core P#8 Core P#16 Core P#24 Core P#32 PU P#0 PU P#1 PU P#2 PU P#3 PU P#8 PU P#9 PU P#10 PU P#11 PU P#16 PU P#17 PU P#18 PU P#19 PU P#24 PU P#25 PU P#26 PU P#27 SSD PU P#4 PU P#5 PU P#6 PU P#7 PU P#12 PU P#13 PU P#14 PU P#15 PU P#20 PU P#21 PU P#22 PU P#23 PU P#28 PU P#29 PU P#30 PU P#31 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) 1 IB NIC Core P#40 Core P#48 Core P#72 Core P#88 PU P#32 PU P#33 PU P#34 PU P#35 PU P#40 PU P#41 PU P#42 PU P#43 PU P#48 PU P#49 PU P#50 PU P#51 PU P#56 PU P#57 PU P#58 PU P#59 per socket PU P#36 PU P#37 PU P#38 PU P#39 PU P#44 PU P#45 PU P#46 PU P#47 PU P#52 PU P#53 PU P#54 PU P#55 PU P#60 PU P#61 PU P#62 PU P#63 Private L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L1, L2, L3 L1d (64KB) L1d (64KB) 10 cores IBM Power8+ Core P#96 Core P#112 PU P#64 PU P#65 PU P#66 PU P#67 PU P#72 PU P#73 PU P#74 PU P#75 per socket PU P#68 PU P#69 PU P#70 PU P#71 PU P#76 PU P#77 PU P#78 PU P#79 SL822LC NUMANode P#1 (128GB) Package P#1 PCI 15b3:1013 PCI 1b4b:9235 PCI 10de:15f9 PCI 10de:15f9 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) ib1 mlx5_1 card2 card3 PCI 1a03:2000 NVIDIA Pascal L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) card4 renderD130 renderD131 L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) controlD64 Tesla P100 Core P#136 Core P#144 Core P#152 Core P#168 2 Ethernet PCI 14e4:1657 PU P#80 PU P#81 PU P#82 PU P#83 PU P#88 PU P#89 PU P#90 PU P#91 PU P#96 PU P#97 PU P#98 PU P#99 PU P#104 PU P#105 PU P#106 PU P#107 enP5p7s0f0 PU P#84 PU P#85 PU P#86 PU P#87 PU P#92 PU P#93 PU P#94 PU P#95 PU P#100 PU P#101 PU P#102 PU P#103 PU P#108 PU P#109 PU P#110 PU P#111 NICs PCI 14e4:1657 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) enP5p7s0f1 2 GPUs L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) per socket Core P#176 Core P#200 Core P#208 Core P#216 PU P#112 PU P#113 PU P#114 PU P#115 PU P#120 PU P#121 PU P#122 PU P#123 PU P#128 PU P#129 PU P#130 PU P#131 PU P#136 PU P#137 PU P#138 PU P#139 PU P#116 PU P#117 PU P#118 PU P#119 PU P#124 PU P#125 PU P#126 PU P#127 PU P#132 PU P#133 PU P#134 PU P#135 PU P#140 PU P#141 PU P#142 PU P#143 L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L1d (64KB) L1d (64KB) SMT-8 Core P#224 Core P#232 PU P#144 PU P#145 PU P#146 PU P#147 PU P#152 PU P#153 PU P#154 PU P#155 PU P#148 PU P#149 PU P#150 PU P#151 PU P#156 PU P#157 PU P#158 PU P#159 Figure generated with hwloc 5 LLNL-PRES-746812

Existing approaches and their limitations § MPI/RM approaches § Limitations — By thread — Memory system is not primary concern — By core — No coherent mapping across — By socket programming abstractions — Latency (IBM Spectrum MPI) — No heterogeneous devices — Bandwidth (IBM Spectrum MPI) support § OpenMP approaches — Policies Spread, Close, Master — Predefined places Threads, Cores, Sockets 6 LLNL-PRES-746812

A portable algorithm for multi-GPU architectures: mpibind § 2 IBM Power8+ processors Node § Per socket Memory Memory Machine — 10 SMT-8 cores Socket Socket L3 (8192KB) — 1 InfiniBand NIC L2 (512KB) — 2 Pascal GPUs L1d (64KB) G Core P#0 G § NVMe SSD PU P#0 PU P#1 PU P#2 PU P#3 G PU P#4 PU P#5 PU P#6 PU P#7 § Private L1, L2, L3 G per core 7 LLNL-PRES-746812

mpibind’s primary consideration: The memory hierarchy Workers = 8 8 workers / 2 NUMA Vertices( l 2 ) = 20 4 workers / 2 GPUs Start l 0 NIC-0 3 0 1 2 GPU-0 NIC-1 GPU-1 GPU-2 NVM-0 GPU-3 l 1 NUMA NUMA l 2 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 ... l 3 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 l 4 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c19 l 5 p0-7 p8-15 p16-23 p24-31 p32-39 p40-47 p48-55 p56-63 p64-71 p72-79 p80-87 p152-159 8 LLNL-PRES-746812

l 0 Start The algorithm GPU0 GPU1 NUMA NUMA l 1 NIC0 NIC1 ... l 2 L2 L2 § Get hardware topology ... L1 L1 l 3 Core0 Core1 § Devise memory tree G — Assign devices to memory vertices l 4 p4 p5 p6 p7 § Calculate # workers w (all processes and threads) § Traverse tree to determine level k with at least w vertices § Traverse subtrees selecting compute resources for each vertex m’: vertices(k) → PU § Map workers to vertices respecting NUMA boundaries m: workers → vertices(k) → PU 9 LLNL-PRES-746812

Example mapping: One task per GPU Node Memory Memory def lat bw mb Socket Socket 0 0-79 0-7 0-7 0,8,16,24,32 2 3 1 80-159 8-15 8-15 40,48,56,64,72 1 0 0 1 2 3 3 0 1 0 2 2 0-79 16-23 80-87 80,88,96,104,112 G 2 3 2 0 1 2 3 0 1 2 3 0 1 2 3 G 3 80-159 24-31 88-95 120,128,136,144,152 0 1 0 0 1 2 3 2 3 G 3 G 1 1 3 10 LLNL-PRES-746812

Evaluation: Synchronous collectives, GPU compute and bandwidth, app benchmark Machine CORAL EA system Spectrum-MPI default Spectrum-MPI latency Affinity Spectrum-MPI bandwidth mpibind MPI Barrier MPI Allreduce Benchmarks Bytes&Flops Compute Bytes&Flops Bandwidth SW4lite Number 1, 2, 4, 8, 16 of Nodes Processes (tasks) 4, 8, 20 per node 11 LLNL-PRES-746812

Enabled uniform access to GPU resources Compute micro-benchmark* Target PPN: 4 Target PPN: 8 § Execute multiple instances 5000 concurrently 4 4 4 4 5 5 4 8 PPN — 4 and 8 PPN GPU 0 4000 § Measure GPU FLOPS Mean GFlops per Process Better 3000 § Processes time-share GPUs by default 2000 § Performance without 1000 mpibind severely limited because of GPU mapping 0 bandwidth default latency mpibind bandwidth default latency mpibind *kokkos/benchmarks/bytes_and_flops 12 LLNL-PRES-746812

Enabled access to the memory of all GPUs without user intervention Memory bandwidth micro-benchmark* § Execute multiple instances Target PPN: 4 Target PPN: 8 concurrently 4 4 4 4 5 4 4 8 PPN 500 — 4 or 8 PPN GPU 0 § Measure GPU global 400 Mean GBps per Process memory bandwidth Better 300 § Processes time-share GPUs by default 200 § Without mpibind some processes fail running out 100 of memory 0 bandwidth default latency mpibind bandwidth default latency mpibind *kokkos/benchmarks/bytes_and_flops 13 LLNL-PRES-746812

Impact on SW4lite–Earthquake ground motion simulation § Simplified version of SW4 CPU: MPI + OpenMP GPU: MPI + RAJA — Layer over half space (LOH.2) Forcing — 17 million grid points (h100) Supergrid 800 Scheme § Multiple runs, calculate mean BC_phys BC_comm — 6 times for GPU Mean Execution Time (secs) — 10 times for CPU 600 Better § Performance speedup — CPU: mpibind over default: 3.7x 400 — GPU: mpibind over bandwidth: 2.2x — GPU over CPU: 9.7x 200 TPP → CPP x PPN → CPN bandwidth 8 1 4 4 under 0 default 80 10 4 40 over bandwidth default latency mpibind bandwidth default latency mpibind latency 8 1 4 4 under mpibind 5 5 4 20 TPP: Threads per process, CPP: Cores per process, PPN: Processes per node, CPN: Cores per node 14 LLNL-PRES-746812

Mapping MPI+X Applications to Multi-GPU Architectures A - PowerPoint PPT Presentation

Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. Len Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference LLNL-PRES-746812 This work was performed under the auspices of the

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

ConnecTo Spiritual Screening Tool Education Program (2018 revision) A note to the facilitator

Sonya Fulgham The Chicago School Of Professional Psychology Research Advisor: Dr. Kennedy, Phd

Finn Hill Park and Recreation District and O.O.Denny Park October 11, 2011 Park Overview 46

Implementing CIDOC CRM Search Based on Fundamental Relations and OWLIM Rules Vladimir Alexiev,

Cooperative Chairpersons and CEOs Dialogue Days Hotel, Tagaytay City July 23-24, 2019 The

Risk, responsibility and relevance Contextually framing adaptation governance for the mining

Public School Capital Outlay Council (PSCOC) Standards-Based Grant Awards 2015-2016 Funding Cycle

Larry Clinton President & CEO Internet Security Alliance lclinton@ISAlliance.org

Sambuz

Useful Links

Newsletter

Mail Us

Mapping MPI+X Applications to Multi-GPU Architectures A - PowerPoint PPT Presentation

Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. Len Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference LLNL-PRES-746812 This work was performed under the auspices of the

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

ConnecTo Spiritual Screening Tool Education Program (2018 revision) A note to the facilitator

Sonya Fulgham The Chicago School Of Professional Psychology Research Advisor: Dr. Kennedy, Phd

Finn Hill Park and Recreation District and O.O.Denny Park October 11, 2011 Park Overview 46

Implementing CIDOC CRM Search Based on Fundamental Relations and OWLIM Rules Vladimir Alexiev,

Cooperative Chairpersons and CEOs Dialogue Days Hotel, Tagaytay City July 23-24, 2019 The

Risk, responsibility and relevance Contextually framing adaptation governance for the mining

Public School Capital Outlay Council (PSCOC) Standards-Based Grant Awards 2015-2016 Funding Cycle

Larry Clinton President &amp; CEO Internet Security Alliance lclinton@ISAlliance.org

Sambuz

Useful Links

Newsletter

Mail Us

Larry Clinton President & CEO Internet Security Alliance lclinton@ISAlliance.org