Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Send(...) 1. Processor writes 4 words • destination rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data from memory Ly D, Saldaña M, Chow P. FPT 2009 29
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Send(...) 1. Processor writes 4 words • destination rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data from memory Ly D, Saldaña M, Chow P. FPT 2009 30
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Send(...) 1. Processor writes 4 words • destination rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data from memory Ly D, Saldaña M, Chow P. FPT 2009 31
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 32
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 33
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 34
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 35
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 36
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 37
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 38
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 39
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 40
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 41
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 42
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 43
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 44
Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 45
Class II: Processor-based Optimizations Direct Memory Access MPI Engine • DMA engine is completely transparent to the user – Exact same MPI functions are called – DMA setup is handled by the implementation Ly D, Saldaña M, Chow P. FPT 2009 46
Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Two types of MPI message functions – Blocking functions: returns only when buffer can be safely reused – Non-blocking functions: returns immediately • Request handle is required so the message status can be checked later • Non-blocking functions are used to overlap communication and computation Ly D, Saldaña M, Chow P. FPT 2009 47
Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Typical HPC non-blocking use case: MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation(); Ly D, Saldaña M, Chow P. FPT 2009 48
Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Class II interactions have a different use case – Hardware engines are responsible for computation – Embedded processors only need to send messages as fast as possible • DMA hardware allow messages to be queued • ‘Fire -and- forget’ message model – Message status is not important – Request handles are serviced by expensive, interrupts Ly D, Saldaña M, Chow P. FPT 2009 49
Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Standard MPI protocol provides a mechanism for ‘fire -and- forget’: MPI_Request request_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy); Ly D, Saldaña M, Chow P. FPT 2009 50
Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Standard implementation still incurs overhead: – Setup the interrupt – Remove the interrupt – Extra function call overhead – Memory space for the MPI_Request data structure • For the ‘just -in- time’ message model on embedded processors, these overheads create a bottleneck Ly D, Saldaña M, Chow P. FPT 2009 51
Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Proposed modification to the MPI protocol: #define MPI_REQUEST_NULL NULL; ... MPI_Isend(..., MPI_REQUEST_NULL); • Non-blocking functions check that the request pointer is valid before setting interrupts • Circumvents the overhead • Not standard, but minor modification that works well for embedded processors with DMA Ly D, Saldaña M, Chow P. FPT 2009 52
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 53
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 54
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 55
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 56
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 57
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 58
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 59
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer lots of data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 60
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA MPI_Send() return Transfer four words, regardless of message length Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 61
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA 28.7% 15.6% 55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 62
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA = 44.3% 28.7% + 15.6% 55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 63
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA – Message queueing msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 64
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 65
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 66
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 67
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? – Increases program length! msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 68
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Standard MPI Functions void *msg_buf; int msg_size; ... MPI_Isend(msg_buf, msg_size, ...); MPI_Irecv(msg_buf, msg_size, ...); Ly D, Saldaña M, Chow P. FPT 2009 69
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 70
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 71
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 72
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 73
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 74
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 75
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 76
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 77
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 78
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 79
Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce is not part of the MPI Standard • Behaviour can be easily reproduced – Even when source code is not available • Maintains compatibility with MPI code Ly D, Saldaña M, Chow P. FPT 2009 80
Class II: Processor-based Optimizations Results • Application: Restricted Boltzmann Machines[2] – Neural network FPGA implementation – Platform: Berkeley Emulation Engine 2 (BEE2) • Five Xilinx II-Pro XC2VP70 FPGA • Inter-FPGA communication: – Latency: 6 cycles – Bandwidth: 1.73GB/s [1] D. Ly et al., “A Multi - FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009. Ly D, Saldaña M, Chow P. FPT 2009 81
Class II: Processor-based Optimizations Results Ly D, Saldaña M, Chow P. FPT 2009 82
Class II: Processor-based Optimizations Results Message # Source Destination Size [# of words] 1 R0 R1 0 2 R0 R1 3 3 R0 R6 0 4 R0 R6 3 5 R0 R11 0 6 R0 R11 3 7 R0 R16 0 8 R0 R16 3 9 R0 R1 4 10 R0 R6 4 11 R0 R11 4 12 R0 R16 4 Ly D, Saldaña M, Chow P. FPT 2009 83
Class II: Processor-based Optimizations Results Ly D, Saldaña M, Chow P. FPT 2009 84
Class II: Processor-based Optimizations Results Ly D, Saldaña M, Chow P. FPT 2009 85
Class II: Processor-based Optimizations Results 2.33x Ly D, Saldaña M, Chow P. FPT 2009 86
Class II: Processor-based Optimizations Results 2.33x Ly D, Saldaña M, Chow P. FPT 2009 87
Class II: Processor-based Optimizations Results 2.33x 3.94x Ly D, Saldaña M, Chow P. FPT 2009 88
Class II: Processor-based Optimizations Results 2.33x 3.94x Ly D, Saldaña M, Chow P. FPT 2009 89
Class II: Processor-based Optimizations Results 2.33x 3.94x 5.32x Ly D, Saldaña M, Chow P. FPT 2009 90
Class III: Hardware-based Optimizations • Background • Dataflow Message Passing Model – Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 91
Class III: Hardware-based Optimizations Background • Processor-based, software model – Function calls are atomic – Program flow is quantized in message function units – Cannot execute communication and computation simultaneously • Hardware engines – Significantly more parallelism – Communication and computations can be simultaneous Ly D, Saldaña M, Chow P. FPT 2009 92
Class III: Hardware-based Optimizations Dataflow Message Passing Model • Standard message processing model MPI_Recv(...); compute(); MPI_Send(...); • Hardware uses dataflow-model Logic Ly D, Saldaña M, Chow P. FPT 2009 93
Class III: Hardware-based Optimizations Case Study: Vector Addition • Vector Addition: v v v c a b v v v c i a i b i , , , • v a comes from Rank 1, v b comes from Rank 2 • Compute v c , send result back to Rank 1 and 2 Ly D, Saldaña M, Chow P. FPT 2009 94
Class III: Hardware-based Optimizations Case Study: Vector Addition • Software model: int va[N], vb[N], vc[N]; MPI_Recv(va, N, MPI_INT, rank1, ...); MPI_Recv(vb, N, MPI_INT, rank2, ...); for(int i = 0; i < N; i++) vc[i] = va[i] + vb[i]; MPI_Send(vc, N, MPI_INT, rank1, ...); MPI_Send(vc, N, MPI_INT, rank2, ...); Ly D, Saldaña M, Chow P. FPT 2009 95
Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 96
Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 97
Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 98
Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 99
Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 100
Recommend
More recommend