Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC - PowerPoint PPT Presentation

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache and Hervé Jourdren CEA, DAM, DIF, F-91297 Arpajon France IWOMP'2010 June 15th 2010 1

Introduction/Context � HPC Architecture: Petaflop/s Era � Multicore processors as basic blocks � Clusters of ccNUMA nodes � Parallel programming models � MPI: distributed-memory model � OpenMP: shared-memory model � Hybrid MPI/OpenMP (or mixed-mode programming ) � Promising solution (benefit from both models for data parallelism) � How to hybridize an application? � Contributions � Approaches for hybrid programming � Unified MPI/OpenMP framework (MPC) for lower hybrid overhead IWOMP'2010 June 15th 2010 2

Outline � Introduction/Context � Hybrid MPI/OpenMP Programming � Overview � Extended taxonomy � MPC Framework � OpenMP runtime implementation � Hybrid optimization � Experimental Results � OpenMP performance � Hybrid performance � Conclusion & Future Work IWOMP'2010 June 15th 2010 3

Hybrid MPI/OpenMP Programming Overview � MPI (Message Passing Interface ) � Inter-node communication � Implicit locality � Useless data duplication and useless shared-memory transfers � OpenMP � Fully exploit shared-memory data parallelism � No inter-node standard � No data-locality standard (ccNUMA node) � Hybrid Programming � Mix MPI and OpenMP inside an application � Benefit from Pure-MPI and Pure-OpenMP modes IWOMP'2010 June 15th 2010 4

Hybrid MPI/OpenMP Programming Approaches � Traditional Approaches � Exploit one core with one execution flow � E.g., MPI for inter-node communication, OpenMP otherwise � E.g., multi core CPU Socket-exploitation with OpenMP � Oversubscribing Approaches � Exploit one core with several execution flows � Load balancing on the whole node � Adaptive behavior between parallel regions � Mixing Depth � Communications outside parallel regions � Network bandwidth saturation � Communications inside parallel regions � MPI thread-safety � Extended Taxonomy from [Heager09] IWOMP'2010 June 15th 2010 5

Hybrid MPI/OpenMP Extended Taxonomy IWOMP'2010 June 15th 2010 6

MPC Framework � User-level thread library [EuroPar’08] � Pthreads API, debugging with GDB [MTAAP’2010] � Thread-based MPI [EuroPVM/MPI’09] � MPI 1.3 Compliant � Optimized to save memory � NUMA-aware memory allocator (for multithreaded applications) � Contribution: Hybrid representation inside MPC � Implementation of OpenMP Runtime (2.5 compliant) � Compiler part w/ patched GCC (4.3.X and 4.4.X) � Optimizations for hybrid applications � Efficient oversubscribed OpenMP (more threads than cores) � Unified representation of MPI tasks and OpenMP threads � Scheduler-integrated polling methods � Message-buffer privatization and parallel message reception IWOMP'2010 June 15th 2010 7

MPC’s Hybrid Execution Model ( Fully Hybrid ) � Application with 1 MPI task per node IWOMP'2010 June 15th 2010 8

MPC’s Hybrid Execution Model ( Fully Hybrid ) � Initialization of OpenMP regions (on the whole node) IWOMP'2010 June 15th 2010 9

MPC’s Hybrid Execution Model ( Fully Hybrid ) � Entering OpenMP parallel region w/ 6 threads IWOMP'2010 June 15th 2010 10

MPC’s Hybrid Execution Model ( Simple Mixed ) � 2 MPI tasks + OpenMP parallel region w/ 4 threads (on 2 cores) IWOMP'2010 June 15th 2010 11

Experimental Environment � Architecture � Dual-socket Quad-core Nehalem-EP machine � 24GB of memory/Linux 2.6.31 kernel � Programming model implementations � MPI: MPC, IntelMPI 3.2.1, MPICH2 1.1, OpenMPI 1.3.3 � OpenMP: MPC, ICC 11.1, GCC 4.3.0 and 4.4.0, SunCC 5.1 � Best option combination � OpenMP thread pinning (KMP_AFFINITY, GOMP_CPU_AFFINITY) � OpenMP wait policy (OMP_WAIT_POLICY, SUN_MP_THR_IDLE=spin) � MPI task placement (I_MPI_PIN_DOMAIN=omp) � Benchmarks � EPCC suite (Pure OpenMP/Fully Hybrid) [Bull et al. 01] � Microbenchmarks for mixed-mode OpenMP/MPI [Bull et al. IWOMP’09] IWOMP'2010 June 15th 2010 12

EPCC: OpenMP Parallel-Region Overhead MPC ICC 11.1 GCC 4.3.0 GCC 4.4.0 SUNCC 5.1 50 45 40 Execution Time (us) 35 30 25 20 15 10 5 0 1 2 4 8 Number of Threads IWOMP'2010 June 15th 2010 13

EPCC: OpenMP Parallel-Region Overhead (cont.) MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1 5 4,5 4 Execution Time (us) 3,5 3 2,5 2 1,5 1 0,5 0 1 2 4 8 Number of Threads IWOMP'2010 June 15th 2010 14

EPCC: OpenMP Parallel-Region Overhead (cont.) MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1 350 300 Execution Time (us) 250 200 150 100 50 0 8 16 32 64 Number of Threads IWOMP'2010 June 15th 2010 15

Hybrid Funneled Ping-Pong (1KB) MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 OPENMPI/GCC 4.4.0 OPENMPI/ICC 11.1 1000 100 Ratio 10 1 2 4 8 Number of OpenMP Threads IWOMP'2010 June 15th 2010 16

Hybrid Multiple Ping-Pong (1KB) MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 20 18 16 14 12 Ratio 10 8 6 4 2 0 2 4 Number of OpenMP Threads IWOMP'2010 June 15th 2010 17

Hybrid Multiple Ping-Pong (1KB) (cont.) MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 60 50 40 Ratio 30 20 10 0 2 4 8 Number of OpenMP Threads IWOMP'2010 June 15th 2010 18

Hybrid Multiple Ping-Pong (1MB) MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 3,5 3 2,5 2 Ratio 1,5 1 0,5 0 2 4 8 Number of OpenMP Threads IWOMP'2010 June 15th 2010 19

Alternating (MPI Tasks Waiting) MPC Intel MPI MPICH2 1.1/GCC 4.4.0 MPICH2 1.1/ICC 11.1 OPENMPI/GCC 4.4.0 9 8 7 6 Ratio 5 4 3 2 1 0 2 4 8 16 Number of OpenMP Threads IWOMP'2010 June 15th 2010 20

Conclusion � Mixing MPI+OpenMP is a promising solution for next-generation computer architectures � How to avoid large overhead? � Contributions � Taxonomy of hybrid approaches � MPC: a Framework unifying both programming models � Lower hybrid overhead � Fully compliant MPI 1.3 and OpenMP 2.5 (with patched GCC) � Freely available at http://mpc.sourceforge.net (version 2.0) � Contact: patrick.carribault@cea.fr or marc.perache@cea.fr � Future Work � Optimization of OpenMP runtime (e.g., NUMA barrier) � OpenMP 3.0 (tasks) � Thread/data affinity (thread placement, data locality) � Tests on large applications IWOMP'2010 June 15th 2010 21

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC - PowerPoint PPT Presentation

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Prache and Herv Jourdren CEA, DAM, DIF, F-91297 Arpajon France IWOMP'2010 June 15th 2010 1 Introduction/Context HPC Architecture: Petaflop/s Era

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Low-Overhead System Tracing With eBPF Akshay Kapoor DevOps Engineer @ SAP Labs May 2018

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Threaded Programming Lecture 6: Further topics in OpenMP Overview Nested parallelism

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Modern Shader-based OpenGL Techniques Qt Developer Days, Berlin 2012 Presented by Sean Harmer

Distributed Programming with MPI Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT

Smarter Working Happier People David Papworth Consulting Manager The Issue Teams doing high

Selective Applicative Functors Andrey Mokhov, Georgy Lukyanov, Simon Marlow , Jeremie Dimino

DWDM system integration Roundtable Discussion Pieter Hanssens 7 th TF-NOC Poznan The optical NMS

Analysis of Electronic Circuits using PySpice and Scipy Raghuttam Hombal and N. S. Ashokkumar

Central vacuum valve in ALICE Pros and Cons ALICE TC 20.11.2013 20/11/13 Central vacuum

N doping trial at KEK not successful example 2015/12/1 TTC meeting WG1 Kensei Umemori

Sambuz

Useful Links

Newsletter

Mail Us