Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , - PowerPoint PPT Presentation

Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Send(...) 1. Processor writes 4 words • destination rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data from memory Ly D, Saldaña M, Chow P. FPT 2009 29

Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 32

Class II: Processor-based Optimizations Direct Memory Access MPI Engine • DMA engine is completely transparent to the user – Exact same MPI functions are called – DMA setup is handled by the implementation Ly D, Saldaña M, Chow P. FPT 2009 46

Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Two types of MPI message functions – Blocking functions: returns only when buffer can be safely reused – Non-blocking functions: returns immediately • Request handle is required so the message status can be checked later • Non-blocking functions are used to overlap communication and computation Ly D, Saldaña M, Chow P. FPT 2009 47

Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Typical HPC non-blocking use case: MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation(); Ly D, Saldaña M, Chow P. FPT 2009 48

Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Class II interactions have a different use case – Hardware engines are responsible for computation – Embedded processors only need to send messages as fast as possible • DMA hardware allow messages to be queued • ‘Fire -and- forget’ message model – Message status is not important – Request handles are serviced by expensive, interrupts Ly D, Saldaña M, Chow P. FPT 2009 49

Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Standard MPI protocol provides a mechanism for ‘fire -and- forget’: MPI_Request request_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy); Ly D, Saldaña M, Chow P. FPT 2009 50

Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Standard implementation still incurs overhead: – Setup the interrupt – Remove the interrupt – Extra function call overhead – Memory space for the MPI_Request data structure • For the ‘just -in- time’ message model on embedded processors, these overheads create a bottleneck Ly D, Saldaña M, Chow P. FPT 2009 51

Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Proposed modification to the MPI protocol: #define MPI_REQUEST_NULL NULL; ... MPI_Isend(..., MPI_REQUEST_NULL); • Non-blocking functions check that the request pointer is valid before setting interrupts • Circumvents the overhead • Not standard, but minor modification that works well for embedded processors with DMA Ly D, Saldaña M, Chow P. FPT 2009 52

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 53

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 54

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 55

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 56

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 57

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer lots of data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 60

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA MPI_Send() return Transfer four words, regardless of message length Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 61

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA 28.7% 15.6% 55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 62

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA = 44.3% 28.7% + 15.6% 55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 63

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA – Message queueing msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 64

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 65

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? – Increases program length! msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 68

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Standard MPI Functions void *msg_buf; int msg_size; ... MPI_Isend(msg_buf, msg_size, ...); MPI_Irecv(msg_buf, msg_size, ...); Ly D, Saldaña M, Chow P. FPT 2009 69

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 70

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 76

Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce is not part of the MPI Standard • Behaviour can be easily reproduced – Even when source code is not available • Maintains compatibility with MPI code Ly D, Saldaña M, Chow P. FPT 2009 80

Class II: Processor-based Optimizations Results • Application: Restricted Boltzmann Machines[2] – Neural network FPGA implementation – Platform: Berkeley Emulation Engine 2 (BEE2) • Five Xilinx II-Pro XC2VP70 FPGA • Inter-FPGA communication: – Latency: 6 cycles – Bandwidth: 1.73GB/s [1] D. Ly et al., “A Multi - FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009. Ly D, Saldaña M, Chow P. FPT 2009 81

Class II: Processor-based Optimizations Results Ly D, Saldaña M, Chow P. FPT 2009 82

Class II: Processor-based Optimizations Results Message # Source Destination Size [# of words] 1 R0 R1 0 2 R0 R1 3 3 R0 R6 0 4 R0 R6 3 5 R0 R11 0 6 R0 R11 3 7 R0 R16 0 8 R0 R16 3 9 R0 R1 4 10 R0 R6 4 11 R0 R11 4 12 R0 R16 4 Ly D, Saldaña M, Chow P. FPT 2009 83

Class II: Processor-based Optimizations Results 2.33x Ly D, Saldaña M, Chow P. FPT 2009 86

Class II: Processor-based Optimizations Results 2.33x Ly D, Saldaña M, Chow P. FPT 2009 87

Class II: Processor-based Optimizations Results 2.33x 3.94x Ly D, Saldaña M, Chow P. FPT 2009 88

Class II: Processor-based Optimizations Results 2.33x 3.94x Ly D, Saldaña M, Chow P. FPT 2009 89

Class II: Processor-based Optimizations Results 2.33x 3.94x 5.32x Ly D, Saldaña M, Chow P. FPT 2009 90

Class III: Hardware-based Optimizations • Background • Dataflow Message Passing Model – Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 91

Class III: Hardware-based Optimizations Background • Processor-based, software model – Function calls are atomic – Program flow is quantized in message function units – Cannot execute communication and computation simultaneously • Hardware engines – Significantly more parallelism – Communication and computations can be simultaneous Ly D, Saldaña M, Chow P. FPT 2009 92

Class III: Hardware-based Optimizations Dataflow Message Passing Model • Standard message processing model MPI_Recv(...); compute(); MPI_Send(...); • Hardware uses dataflow-model Logic Ly D, Saldaña M, Chow P. FPT 2009 93

Class III: Hardware-based Optimizations Case Study: Vector Addition • Vector Addition:    v v v c a b v v v c i a i b i , , , • v a comes from Rank 1, v b comes from Rank 2 • Compute v c , send result back to Rank 1 and 2 Ly D, Saldaña M, Chow P. FPT 2009 94

Class III: Hardware-based Optimizations Case Study: Vector Addition • Software model: int va[N], vb[N], vc[N]; MPI_Recv(va, N, MPI_INT, rank1, ...); MPI_Recv(vb, N, MPI_INT, rank2, ...); for(int i = 0; i < N; i++) vc[i] = va[i] + vb[i]; MPI_Send(vc, N, MPI_INT, rank1, ...); MPI_Send(vc, N, MPI_INT, rank2, ...); Ly D, Saldaña M, Chow P. FPT 2009 95

Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 96

Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , - PowerPoint PPT Presentation

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , Manuel Saldaa 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Hardware Observability Framework Hardware Observability Framework Hardware Observability

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Rank and Bias in Families of Hyperelliptic Curves Trajan Hammonds 1 Ben Logsdon 2

Vertical Search Engines Web Searching Current challenge: finding relevant results for targeted

Mathematics and Science Education: Funding Opportunities at IES Christina Chhin Elizabeth Albro

White-Box Cryptography Don't Forget About Grey Box Attacks Joppe W. Bos Real World Crypto 2017

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School

NTU-WUHAN JOINT EDUCATION PROGRAM Jan 2018 May 2019 presented by H B Gooi Associate

Tropical Secant Graphs of Monomial Curves Mar a Ang elica Cueto UC Berkeley Joint work

From geometry to invertibility preservers Joint work with Peter Semrl Dept. of Mathematics,

Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , - PowerPoint PPT Presentation

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , Manuel Saldaa 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Hardware Observability Framework Hardware Observability Framework Hardware Observability

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Rank and Bias in Families of Hyperelliptic Curves Trajan Hammonds 1 Ben Logsdon 2

Vertical Search Engines Web Searching Current challenge: finding relevant results for targeted

Mathematics and Science Education: Funding Opportunities at IES Christina Chhin Elizabeth Albro

White-Box Cryptography Don't Forget About Grey Box Attacks Joppe W. Bos Real World Crypto 2017

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School

NTU-WUHAN JOINT EDUCATION PROGRAM Jan 2018 May 2019 presented by H B Gooi Associate

Tropical Secant Graphs of Monomial Curves Mar a Ang elica Cueto UC Berkeley Joint work

From geometry to invertibility preservers Joint work with Peter Semrl Dept. of Mathematics,

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards