spin high performance streaming processing in the network
play

sPIN: High-performance streaming Processing in the Network - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER , S. D I G IROLAMO , K. T ARANOV , R. E. G RANT , R. B RIGHTWELL sPIN: High-performance streaming Processing in the Network spcl.inf.ethz.ch @spcl_eth The Development of High-Performance Networking


  1. spcl.inf.ethz.ch @spcl_eth T. H OEFLER , S. D I G IROLAMO , K. T ARANOV , R. E. G RANT , R. B RIGHTWELL sPIN: High-performance streaming Processing in the Network

  2. spcl.inf.ethz.ch @spcl_eth The Development of High-Performance Networking Interfaces libfabric Scalable Coherent Interface Myrinet GM+MX Virtual Interface Architecture OFED Portals 4 Ethernet+TCP/IP Fast Messages Quadrics QsNet IB Verbs Cray Gemini 1980 1990 2000 2010 2020 protocol offload sockets (active) message based remote direct memory access (RDMA) coherent memory access OS bypass zero copy triggered operations June 2017 95 / top-100 systems use RDMA businessinsider.com >285 / top-500 systems use RDMA 2

  3. spcl.inf.ethz.ch @spcl_eth Data Processing in modern RDMA networks Local Node Remote Nodes (via network) Core i7 Haswell Regs Regs 4 cycles ~1.3ns 4 cycles ~1.3ns L1 L1 11 cycles ~ 3.6ns 11 cycles ~ 3.6ns L2 L2 34 cycles ~11.3ns 34 cycles ~11.3ns L3 RDMA NIC 125 cycles ~41.6ns PCIe bus Main Memory DMA RDMA arriving Processing Unit ~ 250ns packets Input buffer 3

  4. spcl.inf.ethz.ch @spcl_eth Data Processing in modern RDMA networks Local Node Remote Nodes (via network) Core i7 Haswell Regs Regs 4 cycles ~1.3ns 4 cycles ~1.3ns L1 L1 11 cycles ~ 3.6ns 11 cycles ~ 3.6ns L2 L2 Mellanox Connect-X5: 1 msg/5ns 34 cycles ~11.3ns 34 cycles ~11.3ns Tomorrow (400G): 1 msg/1.2ns L3 RDMA NIC 125 cycles ~41.6ns PCIe bus Main Memory DMA RDMA arriving Processing Unit ~ 250ns packets Input buffer 4

  5. spcl.inf.ethz.ch @spcl_eth Data Processing in modern RDMA networks Local Node Remote Nodes (via network) Core i7 Haswell Regs Regs 4 cycles ~1.3ns 4 cycles ~1.3ns L1 L1 11 cycles ~ 3.6ns 11 cycles ~ 3.6ns L2 L2 Mellanox Connect-X5: 1 msg/5ns 34 cycles ~11.3ns 34 cycles ~11.3ns Tomorrow (400G): 1 msg/1.2ns L3 RDMA NIC 125 cycles ~41.6ns PCIe bus Main Memory DMA RDMA arriving Processing Unit ~ 250ns packets Input buffer 5

  6. spcl.inf.ethz.ch @spcl_eth Data Processing in modern RDMA networks Local Node Remote Nodes (via network) Core i7 Haswell Regs Regs 4 cycles ~1.3ns 4 cycles ~1.3ns L1 L1 11 cycles ~ 3.6ns 11 cycles ~ 3.6ns L2 L2 Mellanox Connect-X5: 1 msg/5ns 34 cycles ~11.3ns 34 cycles ~11.3ns Tomorrow (400G): 1 msg/1.2ns L3 RDMA NIC 125 cycles ~41.6ns PCIe bus Main Memory DMA RDMA arriving Processing Unit ~ 250ns packets Input buffer 6

  7. spcl.inf.ethz.ch @spcl_eth The future of High-Performance Networking Interfaces libfabric Scalable Coherent Interface Myrinet GM+MX Virtual Interface Architecture OFED Portals 4 Ethernet+TCP/IP Fast Messages Quadrics QsNet IB Verbs Cray Gemini 1980 1990 2000 2010 2020 protocol offload sockets (active) message based remote direct memory access (RDMA) coherent memory access OS bypass zero copy triggered operations June 2017 Established Principles for Compute Acceleration Programmability Specialization Libraries 95 / top-100 systems use RDMA Ease-of-use Portability Efficiency 4.0 >285 / top-500 systems use RDMA 7

  8. spcl.inf.ethz.ch @spcl_eth sPIN The future of High-Performance Networking Interfaces Streaming Processing In the Network libfabric Scalable Coherent Interface Myrinet GM+MX Virtual Interface Architecture OFED Portals 4 Ethernet+TCP/IP Fast Messages Quadrics QsNet IB Verbs Cray Gemini 1980 1990 2000 2010 2020 fully protocol offload sockets (active) message based remote direct memory access (RDMA) programmable coherent memory access OS bypass zero copy triggered operations NIC acceleration June 2017 Established Principles for Compute Acceleration Programmability Specialization Libraries 95 / top-100 systems use RDMA Ease-of-use Portability Efficiency 4.0 >285 / top-500 systems use RDMA 8

  9. spcl.inf.ethz.ch @spcl_eth sPIN NIC - Abstract Machine Model CPU upload Fast shared memory handlers Packet Scheduler (packet input buffer) manage memory HPU 0 HPU 1 arriving MEM packets HPU 2 HPU 3 R/W DMA Unit 9

  10. spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Simple Ping Pong Initiator Target 10

  11. spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Simple Ping Pong Initiator Target 11

  12. spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Streaming Ping Pong Initiator Target 12

  13. spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Streaming Ping Pong Initiator Target 13

  14. spcl.inf.ethz.ch @spcl_eth sPIN – Programming Interface Header handler __handler int pp_header_handler( const ptl_header_t h, void *state) { pingpong_info_t *i = state; i->source = h.source_id; Packet Scheduler return PROCESS_DATA; // execute payload handler to put from device Incoming message } Tail Payload Payload handler __handler int pp_payload_handler( const ptl_payload_t p, void * state) { pingpong_info_t *i = state; Header PtlHandlerPutFromDevice(p.base, p.length, 1, 0, i->source , 10, 0, NULL, 0); return SUCCESS; } Completion handler __handler int pp_completion_handler( int dropped_bytes, bool flow_control_triggered, void *state) { return SUCCESS; } connect (peer, /* … */, & pp_header_handler, &pp_payload_handler, &pp_completion_handler); 14

  15. spcl.inf.ethz.ch @spcl_eth sPIN – Programming Interface Header handler __handler int pp_header_handler( const ptl_header_t h, void *state) { pingpong_info_t *i = state; i->source = h.source_id; Packet Scheduler return PROCESS_DATA; // execute payload handler to put from device Incoming message } Tail Payload Payload handler __handler int pp_payload_handler( const ptl_payload_t p, void * state) { pingpong_info_t *i = state; Header PtlHandlerPutFromDevice(p.base, p.length, 1, 0, i->source , 10, 0, NULL, 0); return SUCCESS; } Completion handler __handler int pp_completion_handler( int dropped_bytes, bool flow_control_triggered, void *state) { return SUCCESS; } connect (peer, /* … */, & pp_header_handler, &pp_payload_handler, &pp_completion_handler); 15

  16. spcl.inf.ethz.ch @spcl_eth Possible sPIN implementations ▪ sPIN is a programming abstraction, similar to CUDA or OpenCL combined with OFED or Portals 4 ▪ It enables a large variety of NIC implementations! ▪ For example, massively multithreaded HPUs Including warp-like scheduling strategies at 400G, process more than 833 million messages/s ▪ Main goal: sPIN must not obstruct line-rate ▪ Programmer must limit processing time per packet Little’s Law: 500 instructions per handler, 2.5 GHz, IPC=1, 1 Tb/s  25 kiB memory ▪ Relies on fast shared memory (processing in packet buffers) Scratchpad or registers Catapult ▪ Quick (single-cycle) handler invocation on packet arrival Innova Flex (Virtex FPGA) (Kintex FPGA) Pre-initialized memory & context ▪ Can be implemented in most RDMA NICs with a firmware update ▪ Or in software in programmable (Smart) NICs BCM58800 SoC (Full Linux) 16

  17. spcl.inf.ethz.ch @spcl_eth Simulating a sPIN NIC – Ping Pong ▪ LogGOPSim v2 [1]: combine LogGOPSim (packet-level network) with gem5 (cycle accurate CPU simulation) ▪ Network (LogGOPSim): 35% lower latency 32% higher BW ▪ Supports Portals 4 and MPI ▪ Parametrized for future InfiniBand o=65ns (measured) g=6.7ns (150 MM/s) G=2.5ps (400 Gib/s) Switch L=50ns (measured) Wire L=33.4ns (10m cable) Handlers cost: 18 instructions + 1 Put 17 ▪ NIC HPU ▪ 2.5 GHz ARM Cortex A15 OOO ▪ sPIN (stream) ARMv8-A 32 bit ISA RDMA ▪ Single-cycle access SRAM (no DRAM) ▪ Header matching m=30ns, per packet 2ns In parallel with g! [1] S. Di Girolamo, K. Taranov, T. Schneider, E. Stalder, T. Hoefler, LogGOPSim+gem5: Simulating Network 17 Offload Engines Over Packet-Switched Networks . Presented at ExaMPI’17

  18. spcl.inf.ethz.ch @spcl_eth Simulating a sPIN NIC – Ping Pong ▪ LogGOPSim v2 [1]: combine LogGOPSim (packet-level network) with gem5 (cycle accurate CPU simulation) ▪ Network (LogGOPSim): 35% lower latency 32% higher BW ▪ Supports Portals 4 and MPI ▪ Parametrized for future InfiniBand o=65ns (measured) g=6.7ns (150 MM/s) G=2.5ps (400 Gib/s) Switch L=50ns (measured) Network Group Distributed Data Data Layout Wire L=33.4ns (10m cable) Handlers cost: Communication Management Transformation 18 instructions + 1 Put 18 ▪ NIC HPU ▪ 2.5 GHz ARM Cortex A15 OOO ▪ sPIN (stream) ARMv8-A 32 bit ISA RDMA ▪ Single-cycle access SRAM (no DRAM) ▪ Header matching m=30ns, per packet 2ns In parallel with g! [1] S. Di Girolamo, K. Taranov, T. Schneider, E. Stalder, T. Hoefler, LogGOPSim+gem5: Simulating Network 18 Offload Engines Over Packet-Switched Networks . Presented at ExaMPI’17

  19. spcl.inf.ethz.ch @spcl_eth Use Case 1: Broadcast acceleration Network Group Communication Message size: 8 Bytes RDMA 19 Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004

  20. spcl.inf.ethz.ch @spcl_eth Use Case 1: Broadcast acceleration Network Group Communication Message size: 8 Bytes Offloaded collectives (e.g., ConnectX-2, Portals 4) RDMA Underwood, K.D., et al., Enabling flexible collective communication offload with triggered operations. HOTI’11 20 20 Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004

Recommend


More recommend