sPIN: High-performance streaming Processing in the Network - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER , S. D I G IROLAMO , K. T ARANOV , R. E. G RANT , R. B RIGHTWELL sPIN: High-performance streaming Processing in the Network

spcl.inf.ethz.ch @spcl_eth The Development of High-Performance Networking Interfaces libfabric Scalable Coherent Interface Myrinet GM+MX Virtual Interface Architecture OFED Portals 4 Ethernet+TCP/IP Fast Messages Quadrics QsNet IB Verbs Cray Gemini 1980 1990 2000 2010 2020 protocol offload sockets (active) message based remote direct memory access (RDMA) coherent memory access OS bypass zero copy triggered operations June 2017 95 / top-100 systems use RDMA businessinsider.com >285 / top-500 systems use RDMA 2

spcl.inf.ethz.ch @spcl_eth Data Processing in modern RDMA networks Local Node Remote Nodes (via network) Core i7 Haswell Regs Regs 4 cycles ~1.3ns 4 cycles ~1.3ns L1 L1 11 cycles ~ 3.6ns 11 cycles ~ 3.6ns L2 L2 34 cycles ~11.3ns 34 cycles ~11.3ns L3 RDMA NIC 125 cycles ~41.6ns PCIe bus Main Memory DMA RDMA arriving Processing Unit ~ 250ns packets Input buffer 3

spcl.inf.ethz.ch @spcl_eth Data Processing in modern RDMA networks Local Node Remote Nodes (via network) Core i7 Haswell Regs Regs 4 cycles ~1.3ns 4 cycles ~1.3ns L1 L1 11 cycles ~ 3.6ns 11 cycles ~ 3.6ns L2 L2 Mellanox Connect-X5: 1 msg/5ns 34 cycles ~11.3ns 34 cycles ~11.3ns Tomorrow (400G): 1 msg/1.2ns L3 RDMA NIC 125 cycles ~41.6ns PCIe bus Main Memory DMA RDMA arriving Processing Unit ~ 250ns packets Input buffer 4

spcl.inf.ethz.ch @spcl_eth The future of High-Performance Networking Interfaces libfabric Scalable Coherent Interface Myrinet GM+MX Virtual Interface Architecture OFED Portals 4 Ethernet+TCP/IP Fast Messages Quadrics QsNet IB Verbs Cray Gemini 1980 1990 2000 2010 2020 protocol offload sockets (active) message based remote direct memory access (RDMA) coherent memory access OS bypass zero copy triggered operations June 2017 Established Principles for Compute Acceleration Programmability Specialization Libraries 95 / top-100 systems use RDMA Ease-of-use Portability Efficiency 4.0 >285 / top-500 systems use RDMA 7

spcl.inf.ethz.ch @spcl_eth sPIN The future of High-Performance Networking Interfaces Streaming Processing In the Network libfabric Scalable Coherent Interface Myrinet GM+MX Virtual Interface Architecture OFED Portals 4 Ethernet+TCP/IP Fast Messages Quadrics QsNet IB Verbs Cray Gemini 1980 1990 2000 2010 2020 fully protocol offload sockets (active) message based remote direct memory access (RDMA) programmable coherent memory access OS bypass zero copy triggered operations NIC acceleration June 2017 Established Principles for Compute Acceleration Programmability Specialization Libraries 95 / top-100 systems use RDMA Ease-of-use Portability Efficiency 4.0 >285 / top-500 systems use RDMA 8

spcl.inf.ethz.ch @spcl_eth sPIN NIC - Abstract Machine Model CPU upload Fast shared memory handlers Packet Scheduler (packet input buffer) manage memory HPU 0 HPU 1 arriving MEM packets HPU 2 HPU 3 R/W DMA Unit 9

spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Simple Ping Pong Initiator Target 10

spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Simple Ping Pong Initiator Target 11

spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Streaming Ping Pong Initiator Target 12

spcl.inf.ethz.ch @spcl_eth RDMA vs. sPIN in action: Streaming Ping Pong Initiator Target 13

spcl.inf.ethz.ch @spcl_eth sPIN – Programming Interface Header handler __handler int pp_header_handler( const ptl_header_t h, void *state) { pingpong_info_t *i = state; i->source = h.source_id; Packet Scheduler return PROCESS_DATA; // execute payload handler to put from device Incoming message } Tail Payload Payload handler __handler int pp_payload_handler( const ptl_payload_t p, void * state) { pingpong_info_t *i = state; Header PtlHandlerPutFromDevice(p.base, p.length, 1, 0, i->source , 10, 0, NULL, 0); return SUCCESS; } Completion handler __handler int pp_completion_handler( int dropped_bytes, bool flow_control_triggered, void *state) { return SUCCESS; } connect (peer, /* … */, & pp_header_handler, &pp_payload_handler, &pp_completion_handler); 14

spcl.inf.ethz.ch @spcl_eth sPIN – Programming Interface Header handler __handler int pp_header_handler( const ptl_header_t h, void *state) { pingpong_info_t *i = state; i->source = h.source_id; Packet Scheduler return PROCESS_DATA; // execute payload handler to put from device Incoming message } Tail Payload Payload handler __handler int pp_payload_handler( const ptl_payload_t p, void * state) { pingpong_info_t *i = state; Header PtlHandlerPutFromDevice(p.base, p.length, 1, 0, i->source , 10, 0, NULL, 0); return SUCCESS; } Completion handler __handler int pp_completion_handler( int dropped_bytes, bool flow_control_triggered, void *state) { return SUCCESS; } connect (peer, /* … */, & pp_header_handler, &pp_payload_handler, &pp_completion_handler); 15

spcl.inf.ethz.ch @spcl_eth Possible sPIN implementations ▪ sPIN is a programming abstraction, similar to CUDA or OpenCL combined with OFED or Portals 4 ▪ It enables a large variety of NIC implementations! ▪ For example, massively multithreaded HPUs Including warp-like scheduling strategies at 400G, process more than 833 million messages/s ▪ Main goal: sPIN must not obstruct line-rate ▪ Programmer must limit processing time per packet Little’s Law: 500 instructions per handler, 2.5 GHz, IPC=1, 1 Tb/s  25 kiB memory ▪ Relies on fast shared memory (processing in packet buffers) Scratchpad or registers Catapult ▪ Quick (single-cycle) handler invocation on packet arrival Innova Flex (Virtex FPGA) (Kintex FPGA) Pre-initialized memory & context ▪ Can be implemented in most RDMA NICs with a firmware update ▪ Or in software in programmable (Smart) NICs BCM58800 SoC (Full Linux) 16

spcl.inf.ethz.ch @spcl_eth Simulating a sPIN NIC – Ping Pong ▪ LogGOPSim v2 [1]: combine LogGOPSim (packet-level network) with gem5 (cycle accurate CPU simulation) ▪ Network (LogGOPSim): 35% lower latency 32% higher BW ▪ Supports Portals 4 and MPI ▪ Parametrized for future InfiniBand o=65ns (measured) g=6.7ns (150 MM/s) G=2.5ps (400 Gib/s) Switch L=50ns (measured) Wire L=33.4ns (10m cable) Handlers cost: 18 instructions + 1 Put 17 ▪ NIC HPU ▪ 2.5 GHz ARM Cortex A15 OOO ▪ sPIN (stream) ARMv8-A 32 bit ISA RDMA ▪ Single-cycle access SRAM (no DRAM) ▪ Header matching m=30ns, per packet 2ns In parallel with g! [1] S. Di Girolamo, K. Taranov, T. Schneider, E. Stalder, T. Hoefler, LogGOPSim+gem5: Simulating Network 17 Offload Engines Over Packet-Switched Networks . Presented at ExaMPI’17

spcl.inf.ethz.ch @spcl_eth Simulating a sPIN NIC – Ping Pong ▪ LogGOPSim v2 [1]: combine LogGOPSim (packet-level network) with gem5 (cycle accurate CPU simulation) ▪ Network (LogGOPSim): 35% lower latency 32% higher BW ▪ Supports Portals 4 and MPI ▪ Parametrized for future InfiniBand o=65ns (measured) g=6.7ns (150 MM/s) G=2.5ps (400 Gib/s) Switch L=50ns (measured) Network Group Distributed Data Data Layout Wire L=33.4ns (10m cable) Handlers cost: Communication Management Transformation 18 instructions + 1 Put 18 ▪ NIC HPU ▪ 2.5 GHz ARM Cortex A15 OOO ▪ sPIN (stream) ARMv8-A 32 bit ISA RDMA ▪ Single-cycle access SRAM (no DRAM) ▪ Header matching m=30ns, per packet 2ns In parallel with g! [1] S. Di Girolamo, K. Taranov, T. Schneider, E. Stalder, T. Hoefler, LogGOPSim+gem5: Simulating Network 18 Offload Engines Over Packet-Switched Networks . Presented at ExaMPI’17

spcl.inf.ethz.ch @spcl_eth Use Case 1: Broadcast acceleration Network Group Communication Message size: 8 Bytes RDMA 19 Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004

spcl.inf.ethz.ch @spcl_eth Use Case 1: Broadcast acceleration Network Group Communication Message size: 8 Bytes Offloaded collectives (e.g., ConnectX-2, Portals 4) RDMA Underwood, K.D., et al., Enabling flexible collective communication offload with triggered operations. HOTI’11 20 20 Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004

sPIN: High-performance streaming Processing in the Network - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER , S. D I G IROLAMO , K. T ARANOV , R. E. G RANT , R. B RIGHTWELL sPIN: High-performance streaming Processing in the Network spcl.inf.ethz.ch @spcl_eth The Development of High-Performance Networking

SEPG 2007 SEPG 2007 SPIN Panel SPIN Panel SEPG2007 - SPIN Panel Session SEPG2007 - SPIN Panel

Tutorial 2: Promela/Spin Running Spin General Usage and Tips CISC422/853 Advice for

9/2/2015 Spin Currents An overview Sources of Spin Currents Spin current introduction Spin

Guest Speaker Joe Cornell, CFA Spin-Off Advisors, LLC Publishers of Spin-Off Research

Vorticity and spin polarization Vorticity and spin polarization Vorticity and spin polarization

Thermoelectric spin voltage in graphene 2017/12/22

Mairbek Chshiev European School on Magnetism Spintronics Conventional spintronics Spin-orbit

Spin Valves (I) Exchange Coupling Spin Valves (I) Exchange Coupling Spin Valves (I) Exchange

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

SPIN database of SPIN database of funding opportunities funding opportunities Peter R. Barcher

What is the execution time of spin(n) when n = 1 000 000? Function spin(n) : void spin(int n) {

Non-equilibrium physics in Spin Ice Spin Glass in Spin Ice Ludovic Jaubert Ludovic Jaubert,

spintronics Influence of spin Magnetic on conduction nanostructures spin Spin up electron

Topics on QCD and Spin Physics (fifth lecture) Rodolfo Sassot Universidad de Buenos Aires HUGS

Spin Currents and Spin Caloritronics Sergio O. Valenzuela ICREA and Institut Catala de

Spin tunnel and Spin Polarisation Laurent Ranno Laboratoire Louis Nel, Grenoble Spin Tunnel

Chapter 2 The relational model 1

ECEN 5032 Data Networks Physical Layer Peter Mathys mathys@colorado.edu University of Colorado,

What is ConT EXt, that we should be mindful of it? Todays Menu Hello World Items TOCs

Plantation Services Cusabo Plantation Title page Cusabo Slides.jpg Plantation Services Cusabo

Facing threats by sharing information in NRM context Conceptual elements Nicolas Paget LAMSADE

IRFU MPGD Workshop 2011 6-8 december 2011 Saclay 1/12 alain.delbart@cea.fr / IRFU MPGD Workshop

Building a Tokenizer for Indonesian David Moeljadi and Hannah Choi Division of Linguistics and

Ephesians 1:1-10 Ephesians 1:1-10 Grace New Testament: charis Merciful kindness.