Advanced Message Passing ASD Distributed Memory HPC Workshop - PowerPoint PPT Presentation

Advanced Message Passing ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia October 31, 2017

Day 2 – Schedule Computer Systems (ANU) Advanced Messaging 31 Oct 2017 2 / 62

Performance Measures and Models Outline Performance Measures and Models 1 Collective Communications in MPI 2 Collective Communication Algorithms 3 Message Passing Extensions 4 Computer Systems (ANU) Advanced Messaging 31 Oct 2017 3 / 62

Performance Measures and Models Overview: Performance Measures and Models granularity of parallel programs parallel speedup and overhead Amdahls Law efficiency and cost example: adding n numbers scalability and strong/weak scaling measuring time Ref: Grama et al. sect 3.1, ch 5; Lin & Synder Computer Systems (ANU) Advanced Messaging 31 Oct 2017 4 / 62

Performance Measures and Models Granularity MIMD divides computation into multiple tasks or processes that execute in parallel granularity : size of the tasks coarse grain : large tasks/lots of instructions fine grain : small tasks/few instructions t compute granularity metric: t communication Would the startup part of communication time be better? granularity may depend on numbers of processors (why?) Case study: parallel LU factorization aim: to increase granularity (why?) Computer Systems (ANU) Advanced Messaging 31 Oct 2017 5 / 62

Performance Measures and Models Speedup the relative performance between single and multiprocessor systems execution time using p processors = t seq S ( n ) = execution time on single processor t par (should we use walltime or CPU time?) t seq should be for the fastest known sequential algorithm best parallel algorithm may be different may also consider speedup in terms of operation count operation count rate with p processors S op ( p ) = operation count rate on single processor linear speedup : maximum possible speedup is n on n processors, i.e. t seq assuming no overhead , etc S ( p ) = t seq / p = p super-linear speedup : when S ( p ) > p may imply a sub-optimal sequential algorithm : go back and re-implement parallel algorithm on 1 processor! may arise from unique features of architecture that favour parallel computation – suggestions? Computer Systems (ANU) Advanced Messaging 31 Oct 2017 6 / 62

Performance Measures and Models Parallel Overhead factors that limit parallel scalability: periods when not all processors perform useful work, including times when just one processor is active on sequential parts of the code load imbalance extra computations not in the sequential code, e.g. re-computation of intermediates locally (may be quicker than send from another processor) communication times Jumpshot and VAMPIR are tools that give graphical display of parallel computation. See also details on profiling an MPI application on Raijin Time Computing Process 0 Startup Process 1 Waiting to Process 2 send Process 3 } Time to send timeline visualization message Computer Systems (ANU) Advanced Messaging 31 Oct 2017 7 / 62

Performance Measures and Models Amdahl’s Law #1 Assume some part cannot be divided ( f ), while rest is perfectly divided (no overhead): t par = ft seq + (1 − f ) t seq / p t s ft s (1−f)t s Serial section Parallelizable sections One processor Multiple processors p processors (1−f)t /p s t p t seq p S ( p ) = ft seq +(1 − f ) t seq / p = 1+( p − 1) f S ( p ) = 1 / f p →∞ Computer Systems (ANU) Advanced Messaging 31 Oct 2017 8 / 62

Performance Measures and Models Amdahl’s Law #2: Speedup Curves f = 0 . 05 f = 0 . 01 ”Better to have two strong oxen pulling your plough across the country than a thousand chickens. Chickens are OK, but we can’t make them work together yet” (. . . or can we?) Computer Systems (ANU) Advanced Messaging 31 Oct 2017 9 / 62

Performance Measures and Models Efficiency and Cost efficiency : how well are you using the processors t seq E = / p t par S ( p ) = × 100% p cost : product of the parallel execution time and the total number of processors used t par × p = t seq p S ( p ) = t seq E cost optimal : if the cost of solving a problem on a parallel computer has the same asymptotic growth as a function of the input size as the fastest known sequential algorithm on a single processor Computer Systems (ANU) Advanced Messaging 31 Oct 2017 10 / 62

Performance Measures and Models Adding n numbers on n processors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 4 6 8 10 12 14 16 Σ Σ Σ Σ Σ Σ Σ Σ 1 3 5 7 9 11 13 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4 8 12 16 Σ Σ Σ Σ 1 5 9 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 8 16 Σ Σ 1 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 Σ 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 speedup over sequential is O ( n lg n ) cost is O ( n lg n ), so not cost optimal! Computer Systems (ANU) Advanced Messaging 31 Oct 2017 11 / 62

Performance Measures and Models Adding n numbers on p processors #1 13 14 15 16 13 14 15 16 13 14 15 16 9 10 11 12 9 10 11 12 9 10 11 12 6 8 5 6 7 8 5 6 7 8 Σ 5 Σ 7 1 2 3 4 2 4 2 4 Σ 1 Σ 3 Σ 1 Σ 3 1 2 3 4 1 2 3 4 1 2 3 4 algorithm takes O ( n / p lg p ) to communicate numbers, then O ( n / p ) to add partial sums. Thus total execution time is O ( n / p lg p ) cost is O ( n lg p ) which is not cost optimal - either!! Computer Systems (ANU) Advanced Messaging 31 Oct 2017 12 / 62

Performance Measures and Models Adding n numbers on p processors #2 4 8 12 16 3 7 11 15 2 6 10 14 8 16 1 5 9 13 4 4 12 16 Σ 1 Σ 9 Σ 1 Σ 8 Σ 9 Σ 13 1 2 3 4 1 2 3 4 1 2 3 4 algorithm takes O ( n / p + lg p ) cost is O ( n + p lg p ) so if n = Ω( p lg p ) (i.e. n ≥ p lg p ), cost is O ( n ), which is cost-optimal Computer Systems (ANU) Advanced Messaging 31 Oct 2017 13 / 62

Performance Measures and Models Scalability Imprecise measure: hardware scalability : does increasing the size of the basic hardware give increased performance? consider ring, crossbar, hypercube topologies and what changes as we add processors algorithmic scalability : can the basic algorithm accommodate more processors? combined : an increased problem size can be accommodated on increased processors consider effect of doubling computation size: for two N × N matrices, doubling the value of N increases the cost of addition by a factor of 4, but the cost of multiplication by a factor of 8. Computer Systems (ANU) Advanced Messaging 31 Oct 2017 14 / 62

Performance Measures and Models Gustafson’s Law: Strong/Weak Scaling recall we assume a serial computation can be split to serial and parallel parts: t seq = ft seq + (1 − f ) t seq and parallel time is given by t par = ft seq + (1 − f ) t seq / p and the speedup is S ( p ) = t seq / t par Amdahl’s Law : constant problem size scaling ( strong scaling ) p S ( p ) = 1+( p − 1) f Gustafson’s Law : time constrained scaling (i.e. problem size is dependent on processor count, weak scaling ) assumes parallel execution time t par is fixed (for simplicity, assume t par = 1) and the sequential time component ft seq is a constant yielding a speedup of: S ( p ) = p + (1 − p ) ft seq speedup a line of negative slope rather the rapid reduction observed previously 5% serial on 20 processors implies S ( p ) = 19 . 05 but under Amdahl’s Law, S ( p ) = 10 . 26 Computer Systems (ANU) Advanced Messaging 31 Oct 2017 15 / 62

Performance Measures and Models Hands-on Exercise: Performance Profiling Computer Systems (ANU) Advanced Messaging 31 Oct 2017 16 / 62

Collective Communications in MPI Outline Performance Measures and Models 1 Collective Communications in MPI 2 Collective Communication Algorithms 3 Message Passing Extensions 4 Computer Systems (ANU) Advanced Messaging 31 Oct 2017 17 / 62

Collective Communications in MPI Collective Communications: Basic Ideas synchronization : barrier to inhibit further execution until all processes have participated e.g. use simple pingpong between two processes broadcast : send same message to many processes must define the source of the message scatter : 1 process sends unique data to every other in group gather : reverse of above (courtesy LLNL) reduction : gather and combined with arithmetic/logical operation result can go to just one process, or goes to all processes All of these can be constructed from simple sends and receives, and all require the group of participating processes to be defined. Computer Systems (ANU) Advanced Messaging 31 Oct 2017 18 / 62

Collective Communications in MPI MPI Communicators a communicator is a group that MPI processes can join MPI_COMM_WORLD is the communicator defined in MPI_Init() , and contains all processes created at that point these can be used to specify the group of processes in a collective communication they can also prevent conflict between messages, e.g. that are internal to a library and those used by the application program User User User User Communicator 1 Process 0 Process 1 Process 2 Process 3 Library Library Library Library Communicator 2 Process 0 Process 1 Process 2 Process 3 Computer Systems (ANU) Advanced Messaging 31 Oct 2017 19 / 62

Advanced Message Passing ASD Distributed Memory HPC Workshop - PowerPoint PPT Presentation

Advanced Message Passing ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia October 31, 2017 Day 2 Schedule Computer Systems (ANU) Advanced

COMP31212: Concurrency Topics 4.3: Message Passing Topic 4.3: Message Passing Outline Topic

Message Passing Concepts Message Passing Model The message passing model is based on the

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Message-Passing Programming with MPI Message-Passing Concepts Overview This lecture will

MDT FE Power Consumption M. Fras, 06 June 2019 ASD Power Depending on Voltage ASD Supply [V]

Distributed HPC Systems ASD Distributed Memory HPC Workshop Computer Systems Group Research

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Interference Alignment via Message-Passing Message-Passing M. Guillaud Motivation Maxime

Message Passing Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term 2 2020 1

Message Passing Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term 2 2020 1

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Autism Spectrum Disorder: A Fresh Look ASD in Females Andrea Fourie Speech Therapist ASD:

Message passing and channels INF4140 - Models of concurrency Message passing and channels Fall

MacNeille completion and Buchholz Omega rule Kazushige Terui RIMS, Kyoto University 27/03/18,

Possible Hadron Physics with High- Momentum Beam Lines Shinya Sawada (KEK) Workshop on 'Future

Monotone Dynamical Systems: A Quick Tour Hal Smith A R I Z O N A S T A T E U N I V E R S I T Y

Proving Correctness of Graph Programs Relative to Recursively Nested Conditions Nils Erik Flick

DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack

Mass modifica+on of hadrons associated with par+al chiral symmetry restora+on Masayasu Harada

E aStencils ExaSlang and the ExaStencils code generator Christian Schmitt 1 , Stefan Kronawitter 2

Yvain Bruned (U Edinburgh) Resonance based schemes for dispersive equations via decorated trees