UPC++: A High-Performance Communication Framework for Asynchronous - PowerPoint PPT Presentation

UPC++: A High-Performance Communication Framework for Asynchronous Computation Amir Kamil upcxx.lbl.gov pagoda@lbl.gov https://upcxx.lbl.gov/training Computational Research Division Lawrence Berkeley National Laboratory Berkeley, California, USA

Acknowledgements This presentation includes the efforts of the following past and present members of the Pagoda group and collaborators: Hadia Ahmed, John Bachan, Scott B. Baden, Dan Bonachea, • Rob Egan, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Erich Strohmaier, Daniel Waters, Katherine Yelick This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation’s exascale computing imperative. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231. 2 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

Some motivating applications PGAS well-suited to applications that use irregular data structures • Sparse matrix methods • Adaptive mesh refinement • Graph problems, distributed hash tables http://tinyurl.com/yxqarenl Processes may send different amounts of information to other processes The amount can be data dependent, dynamic Courtesy of Jim Demmel 3 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

The impact of fine-grained communication The first exascale systems will appear soon • In the USA: Frontier (2021) https://tinyurl.com/y2ptx3th Some apps employ fine-grained communication • Messages are short, so the overhead term dominates communication time a + F( b -1 ¥ , n) • They are latency-limited, and latency is only improving slowly Memory per core is dropping, an effect that can force more frequent fine-grained communication We need to reduce communication costs • Asynchronous communication and execution are critical • But we also need to keep overhead costs to a minimum 4 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

Reducing communication overhead What if we could let each process directly access one another’s memory via a global pointer? • We don’t need to match sends to receives • We don’t need to guarantee message ordering • There are no unexpected messages Communication is one-sided • All metadata provided by the initiator, rather than split between sender and receiver Looks like shared memory Observation: modern network hardware provides the capability to directly access memory on another node: Remote Direct Memory Access (RDMA) • Can be compiled to load/store if source and target share physical memory 5 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

RMA performance: GASNet-EX vs MPI-3 8-Byte RMA Operation Latency (one-at-a-time) 4.5 • Three different MPI GASNet-EX Put MPI RMA Put 4 implementations GASNet-EX Get RMA Operation Latency ( µ s) 3.5 DOWN IS GOOD MPI RMA Get • Two distinct network 3 2.5 hardware types 2 • On these four systems 1.5 the performance of 1 0.5 GASNet-EX meets or 0 exceeds MPI RMA: Cori-I Cori-II Summit Gomez • 8-byte Put latency 6% to 55% better • 8-byte Get latency 5% to 45% better • Better flood bandwidth efficiency, typically saturating at ½ or ¼ the transfer size (next slide) GASNet-EX results from v2018.9.0 and v2019.6.0. MPI results from Intel MPI Benchmarks v2018.1. For more details see Languages and Compilers for Parallel Computing (LCPC'18). https://doi.org/10.25344/S4QP4W More recent results on Summit here replace the paper’s results from the older Summitdev. 6 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

RMA performance: GASNet-EX vs MPI-3 Uni-directional Flood Bandwidth (many-at-a-time) Cori-I: Haswell, Aries, Cray MPI Summit: IBM POWER9, Dual-Rail EDR InfiniBand, IBM Spectrum MPI 10 25 9 8 20 Bandwidth (GiB/s) Bandwidth (GiB/s) 7 6 15 5 4 10 GASNet-EX Put GASNet-EX Put 3 MPI RMA Put MPI RMA Put GASNet-EX Get GASNet-EX Get 2 5 MPI RMA Get MPI RMA Get 1 MPI ISend/IRecv MPI ISend/IRecv 0 0 UP IS GOOD 256 B 1kiB 4kiB 16kiB 64kiB 256kiB 1MiB 4MiB 256 B 1kiB 4kiB 16kiB 64kiB 256kiB 1MiB 4MiB Transfer Size Transfer Size Cori-II: Xeon Phi, Aries, Cray MPI Gomez: Haswell-EX, InfiniBand, MVAPICH2 10 12 9 10 8 Bandwidth (GiB/s) Bandwidth (GiB/s) 7 8 6 5 6 4 GASNet-EX Put GASNet-EX Put 4 3 MPI RMA Put MPI RMA Put GASNet-EX Get GASNet-EX Get 2 2 MPI RMA Get MPI RMA Get 1 MPI ISend/IRecv MPI ISend/IRecv 0 0 256 B 1kiB 4kiB 16kiB 64kiB 256kiB 1MiB 4MiB 256 B 1kiB 4kiB 16kiB 64kiB 256kiB 1MiB 4MiB Transfer Size Transfer Size 7 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

RMA microbenchmarks Two processor partitions: Experiments on NERSC Cori: l Intel Haswell (2 x 16 cores per node) l Cray XC40 system l Intel KNL (1 x 68 cores per node) Round-trip Put Latency (lower is better) Flood Put Bandwidth (higher is better) Data collected on Cori Haswell (https://doi.org/10.25344/S4V88H) 8 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

The PGAS model P artitioned G lobal A ddress S pace • Support global visibility of storage, leveraging the network’s RDMA capability • Distinguish private and shared memory • Separate synchronization from data movement Languages that support PGAS: UPC, Titanium, Chapel, X10, Co-Array Fortran (Fortran 2008) Libraries that support PGAS: Habanero UPC++, OpenSHMEM, Co-Array C++, Global Arrays, DASH, MPI-RMA This presentation is about UPC++, a C++ library developed at Lawrence Berkeley National Laboratory 9 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

Execution model: SPMD Like MPI, UPC++ uses a SPMD model of execution, where a fixed number of processes run the same program int main() { upcxx::init(); cout << "Hello from " << upcxx::rank_me() << endl; upcxx::barrier(); if (upcxx::rank_me() == 0) cout << "Done." << endl; upcxx::finalize(); } Program Start Print Print Print Print Print Print Print Print Barrier Print Program End 10 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

A Partitioned Global Address Space Global Address Space • Processes may read and write shared segments of memory • Global address space = union of all the shared segments Partitioned • Global pointers to objects in shared memory have an affinity to a particular process • Explicitly managed by the programmer to optimize for locality • In conventional shared memory, pointers do not encode affinity Shared Shared Shared Shared Global address space Segment Segment Segment Segment Private Private Private Private Private memory Segment Segment Segment Segment Rank 1 Rank 3 Rank 0 Rank 2 Process 1 Process 3 Process 0 Process 2 11 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

Global vs. raw pointers We can create data structures with embedded global pointers Raw C++ pointers can be used on a process to refer to objects in the global address space that have affinity to that process x: 7 x: 1 x: 5 Global p: address space p: p: l: l: l: Private memory g: g: g: Process 0 Process 1 Process 3 Process 2 12 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

What is a global pointer? A global pointer carries both an address and the affinity for the data Parameterized by the type of object it points to, as with a C++ (raw) pointer: e.g. global_ptr<double> The affinity identifies the process that created the object x: 7 x: 1 x: 5 Global p: address space p: p: l: l: l: Private memory g: g: g: Process 0 Process 1 Process 3 Process 2 13 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

How does UPC++ deliver the PGAS model? A “Compiler-Free,” library approach UPC++ leverages C++ standards, • needs only a standard C++ compiler Relies on GASNet-EX for low-overhead communication Efficiently utilizes the network, whatever that network may be, • including any special-purpose offload support Active messages efficiently support Remote Procedure Calls • (RPCs), which are expensive to implement in other models Enables portability (laptops to supercomputers) • Designed to allow interoperation with existing programming systems Same process model as MPI, enabling hybrid applications • OpenMP and CUDA can be mixed with UPC++ in the same • way as MPI+X 14 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

What does UPC++ offer? Asynchronous behavior based on futures/promises RMA : Low overhead, zero-copy, one-sided communication. • Get/put to a remote location in another address space RPC: Remote Procedure Call : move computation to the data • Design principles encourage performant program design All communication is syntactically explicit • All communication is asynchronous: futures and promises • Scalable data structures that avoid unnecessary replication • 15 Amir Kamil / UPC++ / ECP 2020 Tutorial / upcxx.lbl.gov

UPC++: A High-Performance Communication Framework for Asynchronous - PowerPoint PPT Presentation

UPC++: A High-Performance Communication Framework for Asynchronous Computation Amir Kamil upcxx.lbl.gov pagoda@lbl.gov https://upcxx.lbl.gov/training Computational Research Division Lawrence Berkeley National Laboratory Berkeley,

CoMo-UPC TMA evaluation service @ UPC Pere Barlet-Ros Josep Sanjus-Cuxart Advanced Broadband

2. Knowledge Representation and Communication Part 2 Part 2: ems (SMA-UPC) Agent Communication

KnowledgeWeb UPC Introduction Semantic Web Education Activities and Potential Contributions

EGNOS TUTORIAL Research g roup of A stronomy and GE omatics (gAGE/UPC) Universitat Politcnica

4. Multiagent Systems Design Part 6: Coordination (I). Explicit Coordination ems (SMA-UPC)

3. Reasoning in Agents Part 2: BDI Agents ems (SMA-UPC) Javier Vzquez-Salceda q Multiagent

1. Introduction ( (to Agents and Multiagent g g Systems) ems (SMA-UPC) Javier

RFID UPC Wallace Flint first suggested an automated checkout in 1932 UPC bar code formats

4. Multiagent Systems Design Part 4: Coordination models (I): ( ) Social Models ems (SMA-UPC)

2. Knowledge Representation and Communication Part 1 Part 1: ems (SMA-UPC) Knowledge

UPC++: A High-Performance Communication Framework for Asynchronous Computation John Bachan, Scott

How UPC is good for Primary Care Clinicians I. How UPC is good for Vermonters II. Primary Care

Pr Prog ogram am UPC Collec UPC Collection tion Na National tional WIC Associa WIC

stereovision Miguel Ares and Santiago Royo (miguel.ares@oo.upc.edu , santiago.royo@upc.edu) COST

Requirements Reuse and Patterns: A Survey GESSI Cristina Palomares (GESSI - UPC) Carme Quer

I need to draw circuits and diagrams! Orestes Mas (orestes@tsc.upc.edu) - UPC Quality diagrams

Spoofing GPS Receiver Clock Offset of Phasor Measurement Units 1 A. D. Dom nguez-Garc a

First st field ld qualit lity y measurem suremen ents ts of a 15 T Nb3Sn n Dipole ole

ParaShape Parametric Approach to Personal Design Shadan Sadeghianborojeni Master of Computer

Ou Outstandin ding g Teachin ing, g, Learnin ing g and As Assessme sment nt (OTLA) A)

Cryptanalysis of a Novel Authentication Protocol Conforming to EPC-C1G2 Standard Pedro

8 Marketing From Code to Product gidgreen.com/course Definition Marketing is the

Polar Coding Part 2 - Construction and Performance Erdal Arkan Electrical-Electronics

Universal lower bounds for potential energy of spherical codes Doug Hardin (Vanderbilt