Communication for InfiniBand Clusters G.Santhanaraman, T. - PowerPoint PPT Presentation

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by: Miao Luo National Center for Supercomputing Applications Dept of Computer Science and Engineering, The Ohio State University 1

Introduction High-end Computing (HEC) Systems (approaching petascale capability) • – Systems with few thousands/tens/hundreds of thousands of cores – Meet the requirements of grand challenge problems Greater emphasis on programming models • – One sided communication is getting popular Minimize the need to synchronization • – Ability to overlap computation and communication Scalable application communication patterns • – Clique-based communication • Nearest neighbor: Ocean/Climate modeling, PDE solvers • Cartesian grids: 3DFFT 2

Introduction: HPC Clusters HPC has been the key driving force • – Provides immense computing power by increasing the scale of parallel machines Approaching petascale capabilities • – Increased Node performance – Faster/Larger Memory – Hundreds of thousands of cores Commodity clusters with Modern • Interconnects (InfiniBand, Myrinet 10GigE etc) 3

Introduction: Message Passing Interface (MPI) • MPI - Dominant programming model • Very Portable – Available on all High end systems • Two sided message passing – Requires a handshake between the sender and receiver – Matching sends and receives • One sided programming models becoming popular – MPI also provides one-sided communication semantics 4

Introduction: One-sided Communication P0 reads/writes directly into the address space of P1 • Only one processor (P0) involved in the communication • MPI-2 standard (extension to MPI-1) ‏ • One Sided Communication or Remote memory Access (RMA ) MPI-3 standard coming up... Node Node Memory Memory P1 P0 PCI/PCI-EX PCI/PCI-EX IB IB P2 P3 5

Introduction : MPI-2 One-sided Communication • Sender (origin) ca n access the receiver (target) remote address space (window) directly • Decouples data transfer and synchronization operations • Communication operations – MPI_Put, MPI_Get, MPI_Accumulate – Contiguous and Non-contiguous operations • Synchronization Modes – Active synchronization • Post/start Wait/Complete • Fence (collective) ‏ – Passive synchronization 6 • Lock/unlock

Introduction: Fence Synchronization PROCESS : 1 PROCESS: 2 PROCESS: 0 START: epoch 0 Fence Fence Fence Put(2) ‏ Put(0) ‏ Get(1) ‏ Put(2) ‏ END : epoch 0 Fence Fence Fence START: epoch 1 Put(1) ‏ Put(1) ‏ Put(2) ‏ END: epoch 1 Fence Fence Fence 7

Introduction: Top 100 Interconnect Share 58/100 systems In top systems, the use of InfiniBand has grown significantly. Over 50% of the top 100 systems in the Top500 use InfiniBand 8 8

Introduction: InfiniBand Overview  The InfiniBand Architecture (IBA): Open standard for high speed interconnect  IBA supports send/recv and RDMA semantics • Can provide good hardware support for RMA/one-sided communication model  Very good performance with many features • Minimum latency ~1usecs, peak bandwidth ~2500MB/s • RDMA Read, RDMA Write ( matches well with one-sided get/put semantics) • RDMA Write with Immediate ‏ (explored in this work)  Several High End Computing systems use InfiniBand examples: Ranger at TACC (62976 cores), Chinook at PNNL (18176 cores) 9

Pr Presenta esentation Lay tion Layout out • Introduction • Problem Statement • Design Alternatives • Experimental Evaluation • Conclusions and Future Work

Problem Statement • How can we explore the design space for implementing fence synchronization on modern Interconnects? • Can we design a novel fence synchronization mechanism that leverages InfiniBand’s RDMA Write with immediate primitives? – Reduced synchronization overhead and network traffic – Provide increased scope for overlap 11

Design Space • Deferred Approach – All operations and synchronizations deferred to subsequent fence – Use two-sided operations – Certain optimizations possible to reduce latency of ops and overhead of sync – Capability for overlap is lost • Immediate Approach – Sync and communication ops happen as they are issued – Use RDMA for communication ops – Can achieve good overlap of computation and communication – How can we handle remote completions?? • Characterize the performance – Overlap capability – Synchronization overhead 13

Fence Designs • Deferred approach ( Fence-2S ) – Two Sided Based Approach – First fence does nothing – All one-sided operations queued locally – The second fence goes through the queue, issues operations, and handles completion – The last message in the epoch can signal a completion • Optimizations (combining of put and the ensuing synchronization) -> reduced synchronization overhead • Cons : No scope for providing overlap 14

Fence Designs Process: 1 • Immediate Approach Process: 0 Barrier: step 1 – Issue a completion message on all the channels PUT: from 0 to 3 PUT: from 0 to 3 Issued – Issue a Barrier after the Barrier: step 2 Barrier: step 2 (Arrives after step 2) ‏ operations? Barrier: step 1 Process: 2 Process: 3 15

Fence-Imm Naive Design (Fence-1S) P0 P1 P2 P3 Epoch 0 PUT PUT PUT Fence begin Finish message Complete Epoch 0 Local comple/on REDUCE SCATTER Finish mesg comple/on Start Epoch 1 Fence end

Fence-Imm Opt Design (Fence-1S-Barrier) P0 P1 P2 P3 Epoch 0 PUT PUT PUT Fence begin Finish message Complete Epoch 0 Local comple/on REDUCE SCATTER Finish mesg comple/on Start Epoch 1 BARRIER Fence end

Novel Fence-RI Design P0 P1 P2 P3 (RDMA write with imm) ‏ (RDMA write with imm) ‏ (RDMA write with imm) ‏ Epoch 0 PUT PUT PUT Fence begin Local comple/on Complete Epoch 0 ALL REDUCE Remote RDMA Immediate comple/on BARRIER Start Epoch 1 Fence end 18

Experimental Evaluation Experimental Testbed - 64 Node Intel Cluster - 2.33 GHz quad-core processor - 4GB Main Memory - RedHat Linux AS4 - Mellanox MT25208 HCAs with PCI Express Interfaces - Silverstorm 144 port switch - MVAPICH2 Software Stack Experiments Conducted - Overlap Measurements - Fence Synchronization Microbenchmarks - Halo Exchange Communication Pattern 20

MVAPICH/MVAPICH2 Software Distributions • High Performance MPI Library for InfiniBand and iWARP Clusters – MVAPICH2(MPI-2) – Used by more than 975 organizations world-wide – Empowering many TOP500 clusters – Available with software stacks of many InfiniBand, iWARP and server vendors including Open Fabrics Enterprise Distribution (OFED) http://mvapich.cse.ohio-state.edu/ 21

Overlap 100 Overlap Metric • Fence‐2S - Increasing amount of computation is Fence‐1S 80 inserted between the put and fence Fence‐1S‐Barrier Percentage Overlap sync Fence‐RI 60 - Percentage overlap is measured as the amount of computation that can be inserted without increasing overall 40 latency 20 Two sided implementation • (Fence-2S) uses deferred approach 0 16 64 256 1k 4k 8k 16k 32k 64k 256k – No scope for overlap The one-sided implementations • Message Size can achieve overla p 22

Latency of Fence (Zero-put) Performance of fence alone 1400 • without any one-sided 1200 operations 1000 Overhead of synchronization • Latency (usecs) alone 800 Fence-1S performs badly due • 600 to all pair-wise sync to indicate 400 start of next epoch 200 Fence-2S performs the best • since it does not need 0 additional collective to indicate 8 16 32 64 start of an epoch Num of Procs Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI

Latency of Fence with Put Operations • Performance of fence with put 1800 1600 operations 1400 – Measuring synchronization with Latency (usecs) 1200 communication ops 1000 – A single put is issued by all the 800 processes between two fences 600 400 200 • Fence-1s performs the worst 0 • Fence-RI performs better than 8 16 32 64 Fence-1S-Barrier Num of Procs Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI Single Put 24

Latency of Fence with Multiple Put Operations • Performance of fence with 1400 multiple put operations 1200 – Each process issues puts to 1000 Latency (usecs) 8 neighbors 800 600 400 • Fence-RI performs better than Fence-1S barrier 200 • Fence-2S still performs the 0 8 16 32 64 best Num of Procs – However poor overlap capability Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI 8 Puts 25

Halo Communication Pattern 2500 2000 Latency (usecs) 1500 1000 500 0 Mimics halo or Ghost cell • 8 16 32 64 update The Fence-RI scheme • Num of Procs performs the best Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI 26

Communication for InfiniBand Clusters G.Santhanaraman, T. - PowerPoint PPT Presentation

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by: Miao Luo National Center for

IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator] Michael Kagan Vice

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay

High-Performance Broadcast Designs for Streaming Applications on Multi-GPU InfiniBand Clusters

Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari

A Long-distance InfiniBand Interconnection between two Clusters in Production Use Sabine

Operating two InfiniBand grid clusters over 28 km distance Sabine Richling, Steffen Hau, Heinz

Libraries on Virtualized InfiniBand Clusters Keynote Talk at VisorHPC (January 2017) by

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with

ON LID ALLOCATION AND ASSIGNMENT IN INFINIBAND NETWORKS Wickus Nienaber, Xin Yuan, Zhenhai Duan

Infiniband for Open MPI Andrew Friedley, Torsten Hoefler Matthew L. Leininger, Andrew Lumsdaine

1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results A new Barrier Algorithm for

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

The -Grid: A National Infrastructure for Computer Systems Research Ian Foster Argonne

Mining Threat-intelligence from Billion- scale SSH Brute-Force Attacks Yuming Wu 1 , Phuong M.

Exscale when will it happen? William Kramer National Center for Supercomputing Applications

TeraGrid: National Cyberinfrastructure Charlie Catlett Director, TeraGrid www.teragrid.org

ARCHER Training Courses General Overview Reusing this material This work is licensed under a

iRODS at KTH and SNIC - Status and Prospects Ilari Korhonen iRODS Users Group Meeting 2019, June

Varnish John Franklin Sentai Digital, LLC Sunday, April 21, 13 Agenda Overview

Bigger is Better Trends in super computers, super software, and super data Michael L. Norman,