MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S - PowerPoint PPT Presentation

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S CHNEIDER , F REDRIK K JOLSTAD , T ORSTEN H OEFLER

W HAT YOUR VENDOR SOLD Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 2

W HAT YOUR A PPLICATIONS GET 10% of Ping-Pong performance Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 3

W HAT YOUR A PPLICATIONS GET Why? 10% of Ping-Pong performance Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 4

W HAT YOUR A PPLICATIONS GET How to measure? Why? 10% of Ping-Pong performance Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 5

W HAT MPI OFFERS Manual packing MPI Datatypes sbuf = malloc(N*sizeof(double)) MPI_Datatype nt rbuf = malloc(N*sizeof(double)) MPI_Type_vector(N-2, 1, N, MPI_DOUBLE, &nt) for (i=1; i<N-1; ++i) MPI_Type_commit(&nt) sbuf[i]=data[i*N+N-1] MPI_Isend(&data[N+N-1], 1, nt, …) MPI_Isend(sbuf, …) MPI_Irecv(&data[N], 1, nt, …) MPI_Irecv(rbuf, …) MPI_Waitall (…) MPI_Waitall (…) MPI_Type_free(&nt) for (i=1; i<N-1; ++i) data[i*N]=rbuf[i] • No explicit copying free(sbuf) • Less code free(rbuf) • Often slower than manual packing (see [1]) [1] Schneider, Gerstenberger, Hoefler: Micro-Applications for Communication Data Access Patterns and MPI Datatypes Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 6

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled bt = Vector(2, 1, 2, MPI_BYTE) If (dt.type == VECTOR) nt =Vector(N, 1, 4, bt) for (int i=0; i<dt.count; i++) { tin = inbuf; tout=outbuf for (b=0; b<dt.blklen; d++) { Internal Representation interpret(dt.basetype, tin, tout) } Vector: tin += dt.stride * dt.base.extent count: N Vector: blklen: 1 tout = dt.blklen * dt.base.size count: 2 stride: 4 } blklen: 1 size: 10 inbuf += dt.extent stride: 2 Primitive: extent: 51 outbuf += dt.size size: 2 size: 1 } extent: 3 extent: 1 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 7

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled If (dt.type == VECTOR) for (int i=0; i<dt.count; i++) { • None of these variables are tin = inbuf; tout=outbuf; known when this code is compiled for (b=0; b<dt.blklen; d++) { • Many nested loops interpret(dt.basetype, tin, tout) } tin += dt.stride * dt.base.extent tout = dt.blklen * dt.base.size } inbuf += dt.extent outbuf += dt.size } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 8

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled for (int i=0; i<N; ++i) { for(j=0; j<2; ++j) { outbuf[j] = inbuf[j*2] } inbuf += 3*4 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 9

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling for (int i=0; i<N; ++i) { for(j=0; j<2; ++j) { outbuf[j] = inbuf[j*2] } inbuf += 3*4 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 10

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling for (int i=0; i<N; ++i) { int j = 0 • Constant Propagation outbuf[j] = inbuf[j*2] outbuf[j+1] = inbuf[(j+1)*2] inbuf += 3*4 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 11

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling for (int i=0; i<N; ++i) { outbuf[0] = inbuf[0] • Constant Propagation outbuf[1] = inbuf[2] • Strength reduction inbuf += 12 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 12

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling bound = outbuf + 2*N while (outbuf<bound) { • Constant Propagation outbuf[0] = inbuf[0] • Strength reduction outbuf[1] = inbuf[2] inbuf += 12 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 13

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling bound = outbuf + 2*N while (outbuf<bound) { • Constant Propagation outbuf[0] = inbuf[0] • Strength reduction outbuf[1] = inbuf[2] • Unrolling of outer loop inbuf += 12 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 14

I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling bound = outbuf + 2*N while (outbuf<bound) { • Constant Propagation outbuf[0] = inbuf[0] • Strength reduction outbuf[1] = inbuf[2] • Unrolling of outer loop inbuf += 12 • SIMDization outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 15

R UNTIME -C OMPILED PACK FUNCTIONS Record arguments in internal MPI_Type_vector(cnt, blklen , …) representation (Tree of C++ objects) Generate pack(*in, cnt, *out) MPI_Type_commit(new_ddt) function using LLVM IR. Compile to machine code. Store f-pointer. new_ddt.pack(buf, cnt tmpbuf) MPI_Send(cnt, buf, new_ddt ,…) PMPI_Send (… tmpbuf, MPI_BYTE) Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 16

C OPYING B LOCKS  Even for non- contiguous transfers, the “leaves” of the DDT are consecutive blocks  It is important that we copy those blocks as efficiently as possible  If the size of the cont. block is less the 256B we completely unroll the loop around it  Use fastest available instruction (SSE2 on our test system) Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 17

B LOCK C OPY P ERFORMANCE 35% In-cache measurement on AMD Interlagos CPU (Blue Waters test system) Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 18

P ACKING V ECTORS  Vector count and size and extent of subtype are always known  Use this to eliminate induction variables to reduce loop overhead  Unroll loop for innermost loop 16 times Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 19

V ECTOR P ACKING P ERFORMANCE HVector(2,1,6144) of Vector(8,8,32) of Contig(6) of MPI_FLOAT This datatype is used by the 14x Quantum-Chromodynamics faster code MILC [2] [2] Studying quarks and gluons on MIMD parallel computers, Bernard, et al. In-cache measurement on AMD Interlagos CPU (Blue Waters test system) Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 20

I RREGULAR D ATATYPES Depending on index list length: copy(inb+off[0], outb +…, len[0]) for (i=0; i<idx.len; i+=3) { copy(inb+off[1], outb +…, len[1]) inb0=load(idx[i+0])+inb copy(inb+off[2], outb +…, len[2]) inb1=load(idx[i+1])+inb inb2=load(idx[i+2])+inb // load oub and len Inline indexes into code copy(inb0, outb0, len0) copy(inb1, outb1, len1) copy(inb2, outb2, len2) } Minimize loop overhead by unrolling the loop over the index list Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 21

I RREGULAR P ACKING P ERFORMANCE 33% Hindexed DDT with faster random displacements Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 22

W HAT ’ S THE CATCH ?  Emitting and compiling IR is expensive!  Commit should tune the DDT, but we do not know how often it will be used – how much tuning is ok?  Lets see how often we need to reuse the datatypes in a real application! Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 23

P ERFORMANCE S TUDY : MILC 0-1 column is empty. We don’t make anything slower than Cray MPI Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 24

P ERFORMANCE S TUDY : MILC Most datatypes become seven times faster! Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 25

P ERFORMANCE S TUDY : MILC  Packing faster, but commit is now slower  How often do Some even 38 we need to use times faster a DDT to break even? Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 26

P ERFORMANCE S TUDY : MILC Most datatypes have to be reused 180-5000 times Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 27

P ERFORMANCE S TUDY : MILC But some need 30000 uses to amortize their costs at commit time Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 28

P ERFORMANCE H INTS FOR DDT S  How often will the DDT be reused?  How will it be used (Send/Recv/Pack/Unpack)?  Will the buffer argument be always the same?  Will the data to pack be in cache or not? Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 29

C AN WE BEAT MANUAL PACKING ? Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 30

F UTURE W ORK  Currently we do not support pipelining of packing and communicating  Our packing library is not yet integrated with an MPI implementation – we use the MPI Profiling interface to hijack calls http://spcl.inf.ethz.ch/Research/Parallel_Programming/ MPI_Datatypes/libpack Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 31

T HANK Y OU !  Questions? Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 32

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S - PowerPoint PPT Presentation

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S CHNEIDER , F REDRIK K JOLSTAD , T ORSTEN H OEFLER W HAT YOUR VENDOR SOLD Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 2 W HAT YOUR A PPLICATIONS GET 10% of Ping-Pong

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Succinct Compilation of Propositional Theories Simone Bova Vienna University of Technology

S PEECH R ECOGNITION G RAMMAR C OMPILATION IN GF Bjrn Bringert bringert@cs.chalmers.se

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

BP4 B eyond P acket P rocessing towards P rotocol P rocessing Optimizing host networking

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

Lexicalized Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

JUST THE MATHS SLIDES NUMBER 16.1 LAPLACE TRANSFORMS 1 (Definitions and rules) by

DoveTail Slides DOVETAIL SLIDE USES Work feeders for production milling. Shuttle devices.

Algorithms for NLP Parsing III Anjalie Field CMU Slides adapted from: Dan Klein UC

Invariant measures for KdV and Toda-type discrete integrable systems Online Open Probability

MATH 12002 - CALCULUS I 2.7: Related Rates Part 1: Introduction & Example Revisited

The Toda lattice and Bruhat interval polytopes Lauren K. Williams, UC Berkeley 2314 1324 2413

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S - PowerPoint PPT Presentation

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S CHNEIDER , F REDRIK K JOLSTAD , T ORSTEN H OEFLER W HAT YOUR VENDOR SOLD Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 2 W HAT YOUR A PPLICATIONS GET 10% of Ping-Pong

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Succinct Compilation of Propositional Theories Simone Bova Vienna University of Technology

S PEECH R ECOGNITION G RAMMAR C OMPILATION IN GF Bjrn Bringert bringert@cs.chalmers.se

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

BP4 B eyond P acket P rocessing towards P rotocol P rocessing Optimizing host networking

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

Lexicalized Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

JUST THE MATHS SLIDES NUMBER 16.1 LAPLACE TRANSFORMS 1 (Definitions and rules) by

DoveTail Slides DOVETAIL SLIDE USES Work feeders for production milling. Shuttle devices.

Algorithms for NLP Parsing III Anjalie Field CMU Slides adapted from: Dan Klein UC

Invariant measures for KdV and Toda-type discrete integrable systems Online Open Probability

MATH 12002 - CALCULUS I 2.7: Related Rates Part 1: Introduction &amp; Example Revisited

The Toda lattice and Bruhat interval polytopes Lauren K. Williams, UC Berkeley 2314 1324 2413

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

MATH 12002 - CALCULUS I 2.7: Related Rates Part 1: Introduction & Example Revisited