Programming Models Torsten Hoefler , Greg Bronevetsky, Brian Barrett, - PowerPoint PPT Presentation

Efficient MPI Support for Advanced Hybrid Programming Models Torsten Hoefler , Greg Bronevetsky, Brian Barrett, Bronis R. de Supinski, and Andrew Lumsdaine EuroMPI 2010, Stuttgart, Germany, Sep. 13 th 2010

Threaded/Hybrid MPI Programming • Hybrid Programming gains importance – Reduce surface-to-volume (less comm.) – Will be necessary at Peta- and Exascale! • MPI supports hybrid programming – Offers thread levels: • single, serial, funneled, multiple – Thread_multiple becomes more common • E.g., codes using OpenMP tasks

MPI Messaging Details • MPI_Probe to receive messages of unknown size – MPI_Probe (…, status) – size = get_count(status)*size_of(datatype) – buffer = malloc(size) – MPI_Recv (buffer, …) • MPI_Probe peeks in matching queue – Does not change it → stateful object

Multithreaded MPI Messaging • Two threads, A and B perform probe, malloc, receive sequence – A P → A M → A R → B P → B M → B R • Possible ordering – A P → B P → B M → B R → A M → A R – Wrong matching! – Thread A’s message was “stolen” by B – Access to queue needs mutual exclusion 

“Obvious” Solution 1 • Separate threads with “channels” – Needs t*p threads or communicators • Not scalable – Threads cannot “share” messages • Not flexible for load-balancing (master/worker) – Problems with libraries • Each needs t*p tags or communicators • This solution is impractical!

“Obvious” Solution 2 • Lock each P,M,R sequence – Unnecessary synchronization – This sequence might be slow (malloc) • Only one thread can perform it – Observation: • E.g., (tag,src )=(4,5) and (5,5) do not “conflict”

Solution 3 – 2d Locking • Lock each (src,tag) pair – Requires 2d lock matrix • Should be sparse! lock (src, tag) P,M,R (e.g., irecv) unlock(src,tag) – Wildcards (ANY_SRC, ANY_TAG) acquire locks for whole row/column or matrix – Minimizes lock overhead

Solution 3 is incorrect  • Can lead to deadlocks – A correct MPI code (threads A+B): A: A: send(..., 1, 1, comm) probe/recv(0, 2, comm) recv(..., 1, 1, comm) B: send(..., 1, 2, comm) probe/recv(0,ANY_TAG,comm) ... send(..., 0, 1, comm) – Thread A enters locks (0,2), B is waiting forever (deadlock)

Updated Solution 3 • Obvious fix: don’t block, poll  – Only needed if code uses wildcards – Several variants:

Solution 4 - Matching Outside MPI • Helper thread calls MPI_Probe – Receives all incoming messages – Full matching logic on top of that • Replicating MPI logic (thread safe) • Allows blocking on MPI calls – High overhead though

Fixing the MPI Standard? • Avoid state in the library – Return handle, remove message from queue MPI_Message msg; MPI_Status status; /* Match a message */ MPI_Mprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &msg, &status); /* Allocate memory to receive the message */ int count; MPI_get_count(&status, MPI_BYTE, &count); char* buffer = malloc(count); /* Receive this message. */ MPI_Mrecv(buffer, count, MPI_BYTE, &msg, MPI_STATUS_IGNORE);

Implementation • Open MPI as reference implementation • Low-level matching (e.g., MX) will need FW support

Test System • Sif at Indiana University – Eight core 1.86 GHz Xeon – Myrinet 10G (MX) – Open MPI rev. 22973 + mprobe patch • -- enable-mpi-thread-multiple • Using MPI_THREAD_MULTIPLE with TCP BTL

Benchmarks • Receive Message Rate – MT receive (j processes send to j threads) • 2d locking (2D) • Outside MPI matching (OUT) • Mprobe reference (MPROBE) • Threaded Roundtrip Time – Send n RTT messages between threads – Report average latency

ANY_SRC, ANY_TAG Receive each message copied twice

Directed Receive lower than wildcard (locking overhead) higher than wildcard (less contention)

ANY_SRC, ANY_TAG Latency Mprobe optimization potential each message copied twice

Directed Latency 2d lock higher than wildcard (locking overhead)

Conclusions • MPI_Probe is not thread-safe – Arguably a bug in MPI-2.2 • Obvious solutions do not help – Resource exhaustion • Complex solutions are tricky – Too complex for average MPI user • Change to standard to add stateless interface – Mprobe proposal under consideration for MPI-3 – Encouraging initial performance results!

Programming Models Torsten Hoefler , Greg Bronevetsky, Brian Barrett, - PowerPoint PPT Presentation

Efficient MPI Support for Advanced Hybrid Programming Models Torsten Hoefler , Greg Bronevetsky, Brian Barrett, Bronis R. de Supinski, and Andrew Lumsdaine EuroMPI 2010, Stuttgart, Germany, Sep. 13 th 2010 Threaded/Hybrid MPI Programming

Traditional Programming Models: Traditional Programming Models: Stone Knives and Bearskins in the

Cognitive Models of Programming CS294-184: Building User-Centered Programming Tools UC Berkeley

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

17. Intro to nonconvex models Overview Discrete models Mixed-integer programming

Cognitive Models of Programming CS294-184: Building User-Centered Programming Tools UC Berkeley

Tools for GUI Programming Toolkits Imperative, declarative models Programming languages &

Nonlinear Programming Models Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Nonlinear

OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3

NLP Programming Tutorial 7 - Topic Models Graham Neubig Nara Institute of Science and Technology

Towards a Science of Parallel Programming Keshav Pingali The University of Texas at Austin

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Math Programming Scheduling Models

Overview Realization of Models in Programming Languages: Achieving Non-Functional Properties

Programming Models for Future High Performance Computing Systems John Gurd University of

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

Distributed Systems Lecture 6 Programming models Josva Kleist Unit for Distributed Systems and

T8: Predicting Structures in NLP: Constrained Conditional Models and Integer Linear Programming

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled

Linear Programming Models for Traffic Engineering Under Combined IS-IS and MPLS-TE Protocols D.

CS 140: Models of parallel programming: Distributed memory and MPI Technology Trends:

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

MULTI GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute, GTC March 2019 MOTIVATION Why

Programming Models Torsten Hoefler , Greg Bronevetsky, Brian Barrett, - PowerPoint PPT Presentation

Efficient MPI Support for Advanced Hybrid Programming Models Torsten Hoefler , Greg Bronevetsky, Brian Barrett, Bronis R. de Supinski, and Andrew Lumsdaine EuroMPI 2010, Stuttgart, Germany, Sep. 13 th 2010 Threaded/Hybrid MPI Programming

Traditional Programming Models: Traditional Programming Models: Stone Knives and Bearskins in the

Cognitive Models of Programming CS294-184: Building User-Centered Programming Tools UC Berkeley

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

17. Intro to nonconvex models Overview Discrete models Mixed-integer programming

Cognitive Models of Programming CS294-184: Building User-Centered Programming Tools UC Berkeley

Tools for GUI Programming Toolkits Imperative, declarative models Programming languages &amp;

Nonlinear Programming Models Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Nonlinear

OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3

NLP Programming Tutorial 7 - Topic Models Graham Neubig Nara Institute of Science and Technology

Towards a Science of Parallel Programming Keshav Pingali The University of Texas at Austin

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Math Programming Scheduling Models

Overview Realization of Models in Programming Languages: Achieving Non-Functional Properties

Programming Models for Future High Performance Computing Systems John Gurd University of

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

Distributed Systems Lecture 6 Programming models Josva Kleist Unit for Distributed Systems and

T8: Predicting Structures in NLP: Constrained Conditional Models and Integer Linear Programming

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled

Linear Programming Models for Traffic Engineering Under Combined IS-IS and MPLS-TE Protocols D.

CS 140: Models of parallel programming: Distributed memory and MPI Technology Trends:

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

MULTI GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute, GTC March 2019 MOTIVATION Why

Tools for GUI Programming Toolkits Imperative, declarative models Programming languages &