embedded mpi for hardware based
play

Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , - PowerPoint PPT Presentation

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly 1 , Manuel Saldaa 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada


  1. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Send(...) 1. Processor writes 4 words • destination rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data from memory Ly D, Saldaña M, Chow P. FPT 2009 29

  2. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Send(...) 1. Processor writes 4 words • destination rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data from memory Ly D, Saldaña M, Chow P. FPT 2009 30

  3. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Send(...) 1. Processor writes 4 words • destination rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data from memory Ly D, Saldaña M, Chow P. FPT 2009 31

  4. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 32

  5. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 33

  6. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 34

  7. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 35

  8. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 36

  9. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 37

  10. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 38

  11. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 39

  12. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 40

  13. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 41

  14. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 42

  15. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 43

  16. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 44

  17. Class II: Processor-based Optimizations Direct Memory Access MPI Engine MPI_Recv(...) 1. Processor writes 4 words • source rank • address of data buffer • message size • message tag 2. PLB_MPE decodes message header 3. PLB_MPE transfers data to memory 4. PLB_MPE notifies processor Ly D, Saldaña M, Chow P. FPT 2009 45

  18. Class II: Processor-based Optimizations Direct Memory Access MPI Engine • DMA engine is completely transparent to the user – Exact same MPI functions are called – DMA setup is handled by the implementation Ly D, Saldaña M, Chow P. FPT 2009 46

  19. Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Two types of MPI message functions – Blocking functions: returns only when buffer can be safely reused – Non-blocking functions: returns immediately • Request handle is required so the message status can be checked later • Non-blocking functions are used to overlap communication and computation Ly D, Saldaña M, Chow P. FPT 2009 47

  20. Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Typical HPC non-blocking use case: MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation(); Ly D, Saldaña M, Chow P. FPT 2009 48

  21. Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Class II interactions have a different use case – Hardware engines are responsible for computation – Embedded processors only need to send messages as fast as possible • DMA hardware allow messages to be queued • ‘Fire -and- forget’ message model – Message status is not important – Request handles are serviced by expensive, interrupts Ly D, Saldaña M, Chow P. FPT 2009 49

  22. Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Standard MPI protocol provides a mechanism for ‘fire -and- forget’: MPI_Request request_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy); Ly D, Saldaña M, Chow P. FPT 2009 50

  23. Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Standard implementation still incurs overhead: – Setup the interrupt – Remove the interrupt – Extra function call overhead – Memory space for the MPI_Request data structure • For the ‘just -in- time’ message model on embedded processors, these overheads create a bottleneck Ly D, Saldaña M, Chow P. FPT 2009 51

  24. Class II: Processor-based Optimizations Non-Interrupting, Non-Blocking Functions • Proposed modification to the MPI protocol: #define MPI_REQUEST_NULL NULL; ... MPI_Isend(..., MPI_REQUEST_NULL); • Non-blocking functions check that the request pointer is valid before setting interrupts • Circumvents the overhead • Not standard, but minor modification that works well for embedded processors with DMA Ly D, Saldaña M, Chow P. FPT 2009 52

  25. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 53

  26. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 54

  27. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 55

  28. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 56

  29. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 57

  30. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 58

  31. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 59

  32. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message without DMA MPI_Send() return Transfer lots of data words Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 60

  33. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA MPI_Send() return Transfer four words, regardless of message length Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 61

  34. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA 28.7% 15.6% 55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 62

  35. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA = 44.3% 28.7% + 15.6% 55.6% Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 63

  36. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI message with DMA – Message queueing msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 64

  37. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 65

  38. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 66

  39. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 67

  40. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Inline all MPI functions? – Increases program length! msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code Ly D, Saldaña M, Chow P. FPT 2009 68

  41. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • Standard MPI Functions void *msg_buf; int msg_size; ... MPI_Isend(msg_buf, msg_size, ...); MPI_Irecv(msg_buf, msg_size, ...); Ly D, Saldaña M, Chow P. FPT 2009 69

  42. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 70

  43. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 71

  44. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 72

  45. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 73

  46. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 74

  47. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() void MPI_Coalesce ( //MPI_Coalesce specific arguments MPI_Function *mpi_fn, int mpi_fn_count, //Array of point-to-point MPI function arguments void **msg_buf, int *msg_size, ... ) { for(int i = 0; i < mpi_fn_count; i++) { if (mpi_fn[i] == MPI_Isend) inline MPI_Isend(msg_buf[i], msg_size[i], ...); else if (mpi_fn[i] == MPI_Irecv) inline MPI_Irecv(msg_buf[i], msg_size[i], ...); } } Ly D, Saldaña M, Chow P. FPT 2009 75

  48. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 76

  49. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 77

  50. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 78

  51. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce msg 1 msg 2 msg 3 Legend Non-MPI Code Function Preamble/Postamble MPI Function Code For loop Ly D, Saldaña M, Chow P. FPT 2009 79

  52. Class II: Processor-based Optimizations Series of messages – MPI_Coalesce() • MPI_Coalesce is not part of the MPI Standard • Behaviour can be easily reproduced – Even when source code is not available • Maintains compatibility with MPI code Ly D, Saldaña M, Chow P. FPT 2009 80

  53. Class II: Processor-based Optimizations Results • Application: Restricted Boltzmann Machines[2] – Neural network FPGA implementation – Platform: Berkeley Emulation Engine 2 (BEE2) • Five Xilinx II-Pro XC2VP70 FPGA • Inter-FPGA communication: – Latency: 6 cycles – Bandwidth: 1.73GB/s [1] D. Ly et al., “A Multi - FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009. Ly D, Saldaña M, Chow P. FPT 2009 81

  54. Class II: Processor-based Optimizations Results Ly D, Saldaña M, Chow P. FPT 2009 82

  55. Class II: Processor-based Optimizations Results Message # Source Destination Size [# of words] 1 R0 R1 0 2 R0 R1 3 3 R0 R6 0 4 R0 R6 3 5 R0 R11 0 6 R0 R11 3 7 R0 R16 0 8 R0 R16 3 9 R0 R1 4 10 R0 R6 4 11 R0 R11 4 12 R0 R16 4 Ly D, Saldaña M, Chow P. FPT 2009 83

  56. Class II: Processor-based Optimizations Results Ly D, Saldaña M, Chow P. FPT 2009 84

  57. Class II: Processor-based Optimizations Results Ly D, Saldaña M, Chow P. FPT 2009 85

  58. Class II: Processor-based Optimizations Results 2.33x Ly D, Saldaña M, Chow P. FPT 2009 86

  59. Class II: Processor-based Optimizations Results 2.33x Ly D, Saldaña M, Chow P. FPT 2009 87

  60. Class II: Processor-based Optimizations Results 2.33x 3.94x Ly D, Saldaña M, Chow P. FPT 2009 88

  61. Class II: Processor-based Optimizations Results 2.33x 3.94x Ly D, Saldaña M, Chow P. FPT 2009 89

  62. Class II: Processor-based Optimizations Results 2.33x 3.94x 5.32x Ly D, Saldaña M, Chow P. FPT 2009 90

  63. Class III: Hardware-based Optimizations • Background • Dataflow Message Passing Model – Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 91

  64. Class III: Hardware-based Optimizations Background • Processor-based, software model – Function calls are atomic – Program flow is quantized in message function units – Cannot execute communication and computation simultaneously • Hardware engines – Significantly more parallelism – Communication and computations can be simultaneous Ly D, Saldaña M, Chow P. FPT 2009 92

  65. Class III: Hardware-based Optimizations Dataflow Message Passing Model • Standard message processing model MPI_Recv(...); compute(); MPI_Send(...); • Hardware uses dataflow-model Logic Ly D, Saldaña M, Chow P. FPT 2009 93

  66. Class III: Hardware-based Optimizations Case Study: Vector Addition • Vector Addition:    v v v c a b v v v c i a i b i , , , • v a comes from Rank 1, v b comes from Rank 2 • Compute v c , send result back to Rank 1 and 2 Ly D, Saldaña M, Chow P. FPT 2009 94

  67. Class III: Hardware-based Optimizations Case Study: Vector Addition • Software model: int va[N], vb[N], vc[N]; MPI_Recv(va, N, MPI_INT, rank1, ...); MPI_Recv(vb, N, MPI_INT, rank2, ...); for(int i = 0; i < N; i++) vc[i] = va[i] + vb[i]; MPI_Send(vc, N, MPI_INT, rank1, ...); MPI_Send(vc, N, MPI_INT, rank2, ...); Ly D, Saldaña M, Chow P. FPT 2009 95

  68. Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 96

  69. Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 97

  70. Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 98

  71. Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 99

  72. Class III: Hardware-based Optimizations Case Study: Vector Addition Ly D, Saldaña M, Chow P. FPT 2009 100

Recommend


More recommend