A Bandwidth-saving Optimization for MPI Broadcast Collective - PowerPoint PPT Presentation

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation Huan Zhou, Vladimir Marjanovic, Christoph Niethammer, Jos´ e Gracia HLRS, Uni Stuttgart, Germany P2S2-2015 / Peking, China / 01.09.2015 1 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: Outline Introduction 1 Problem statement 2 Proposed design for the MPI broadcast algorithm 3 Experimental evaluation 4 Conclusions 5 2 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: What are MPI collective operations? What is Message Passing Interface(MPI)? A portable parallel programming model for distributed-memory system Provides point-to-point, RMA and collective operations What are MPI collective operations? Invoked by multiple processes/threads to send or receive data simultaneously Frequently used in MPI scientific applications ◮ Use collective communications to synchronize or exchange data Types of collective operations ◮ All-to-All (MPI Allgather, MPI Allscatter, MPI Allreduce and MPI Alltoall) ◮ All-to-One (MPI Gather and MPI Reduce) ◮ One-to-All (MPI Bcast and MPI Scatter) 4 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: Why is MPI Bcast important? MPI Bcast A typical One-to-All dissemination interface ◮ The root process broadcasts a copy of the source data to all other processes Broadly used in scientific applications Profiling study shows its impact on application performance (LS DYNA software performance) Calls for optimization of MPI Bcast! 5 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: Why is MPICH important? A portable, frequently-used and freely-available implementation of MPI. Implements the MPI-3 standard MPICH and its derivatives play a dominant role in the state-of-art Supercomputers 6 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: MPI Bcast in MPICH3 Multiple algorithms are used based on the message size and process counts A scatter-ring-allgather approach ◮ Is adopted in case where long messages ( lmsg ) are transfered or in case where medium messages are transferred with non-power-of-two process counts( mmsg-npof2 ) ◮ Consists of a binomial scatter and followed by a ring allgather operation MPI Bcast native is a user-level implementation of scatter-ring-allgather algorithm ◮ Without multi-core awareness 8 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: The binomial scatter algorithm 8 processes (third-power-of-2) 10 processes (non-power-of-2) Completed in 3= log 2 8 steps Completed in 4= ⌈ log 2 10 ⌉ steps The root 0 divides the source The root 0 divides the source data into data into 8 chunks, marked with 9 chunks, marked with 0,1,...,9, 0,1,...,7, sequentially sequentially Theoretically, process i is supposed to own data chunk i in the end Practically, Non-leaf processes provide all data chunks for all their descendant 9 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: The native ring allgather algorithm 8 processes (enclosed ring) 7 steps and 56 data transmissions in total P0, P2, P4 and P6 repeatedly receive the data chunks that already existed in them ◮ Bring redundant data transmissions This algorithm is not optimal 10 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: Motivation Especially, for lmsg , the usage of bandwidth is important The native ring allgather algorithm can be optimized ◮ Avoid the redundant data transmissions ⋆ Each data transmission corresponds to a point-to-point operation ⋆ Save bandwidth use ⋆ Potentially bring reduction in communication time 11 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: The tuned design of the native scatter-ring-allgather algorithm I MPI Bcast opt is a user-level implementation of the tuned scatter-ring-allgather algorithm ◮ Without the multi-core awareness ◮ Leave the scatter algorithm unchanged and tune the native allgather algorithm The tuned allgather algorithm in the case of 8 processes Non-enclosed ring for the tuned allgather algorithm 7 steps and 56 data transmissions in total, 12 data transmissions can be saved 13 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: The tuned design of the native scatter-ring-allgather algorithm II The tuned allgather algorithm in the case of 10 processes Non-enclosed ring for the tuned allgather algorithm 9 steps and 75 data transmissions in total, 15 data transmissions can be saved The above two graphs show us that each process sends or receives message segments adaptively according to the chunks it has already owned 14 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: A brief pseudo-code for the tuned allgather algorithm In : step , f l a g , comm size // C o l l e c t data chunks i n ( comm size − 1) s t e p s at most f o r i = 1 . . . comm size − 1 do // Each p r o c e s s uses step to judge i f i t has reached the p o i n t that i n d i c a t e s send − only OR recv − only i f step ≤ comm size − i then //The p r o c e s s sends and meantime r e c e i v e s data chunk to / from i t s s u c c e s s o r / p r e d e c e s s o r MPI Sendrecv e l s e // The p r o c e s s r e a c h e s the send − only p o i n t i f f l a g = 1 then MPI Recv // The p r o c e s s r e a c h e s the recv − only p o i n t e l s e MPI Send end i f end i f end f o r 15 :: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation :: P2S2-2015 / Peking, China / 01.09.2015 :: Huan Zhou

A Bandwidth-saving Optimization for MPI Broadcast Collective - PowerPoint PPT Presentation

:::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: A Bandwidth-saving Optimization for MPI Broadcast Collective Operation Huan

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Broadcast Algorithms BJRN A. JOHNSSON Overview Best-Effort Broadcast (Regular) Reliable

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

ADAPTIVE RADIO OUTPUT SCALING FOR POWER AND BANDWIDTH SAVING Koen Zandberg 1 ADAPTIVE RADIO

From Saving the Princess to From Saving the Princess to Saving the Cow Saving the Cow Content

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Broadcast Encryption and Some Other Primitives Lecture 24 Broadcast Encryption Broadcast

BROADCAST RECEIVER SERVICE Broadcast receiver A broadcast receiver is a dormant component of

BROADCAST RECEIVER SERVICES Broadcast receiver A broadcast receiver is a dormant component of

Saving Time Bill Rising StataCorp LLC 2018 Stata Conference Columbus, OH July 20, 2018 Saving

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory Shaden Smith 1

Physical Synthesis of Bus Matrix for High Bandwidth Low Power On-chip Communications Renshen Wang

A Novel Parallel Traffic Control Mechanism for Cloud Computing Zheng Li, Nenghai Yu, Zhuo Hao

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li , Quanlu Zhang, Zhi

Comparing Time-Triggered Ethernet with Till Steinbach, Franz Korf, Thomas C. FlexRay: Schmidt

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors Farzad Farshchi , Qijing

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Introduction Welcome to

MASTER'S THESIS Routing Protocols in Wireless Ad-hoc Networks - A Simulation Study Tony Larsson,