Latest version of the slides can be obtained from

Memory overheads in large-scale systems • Different transport protocols with IB – Reliable Connection (RC) is the most common – Unreliable Datagram (UD) is used in some cases • Buffers need to be posted at each receiver to receive message from any sender – Buffer requirement can increase with system size • Connections need to be established across processes under RC – Each connection requires certain amount of memory for handling related data structures – Memory required for all connections can increase with system size • Both issues have become critical as large-scale IB deployments have taken place – Being addressed by both IB specification and upper-level middleware Network Based Computing Laboratory IT4 Innovations’18 17

Shared Receive Queue (SRQ) Process Process P Q One RQ per connection One SRQ for all connections (M*N) - 1 • SRQ is a hardware mechanism for a process to share receive resources (memory) across multiple connections – Introduced in specification v1.2 • 0 < Q << P*((M*N)-1) Network Based Computing Laboratory IT4 Innovations’18 18

eXtended Reliable Connection (XRC) M = # of processes/node N = # of nodes RC Connections XRC Connections M 2 x (N – 1) connections/node M x (N – 1) connections/node • Each QP takes at least one page of memory – Connections between all processes is very costly for RC • New IB Transport added: eXtended Reliable Connection – Allows connections between nodes instead of processes Network Based Computing Laboratory IT4 Innovations’18 19

XRC Addressing • XRC uses SRQ Numbers (SRQN) to direct where a operation should complete Send to #2 Send to #1 SRQ#1 SRQ#1 Process 0 Process 2 SRQ#2 SRQ#2 Process 3 Process 1 • Hardware does all routing of data, so p2 is not actually involved in the data transfer • Connections are not bi-directional , so p3 cannot sent to p0 Network Based Computing Laboratory IT4 Innovations’18 20

DC Connection Model, Communication Objects and Addressing Scheme • Communication Objects & Addressing Scheme Node 0 P0 P1 – DCINI • Analogous to the send QPs • Can transmit data to any peer Node 3 Node 1 – DCTGT IB P6 P2 Network • Receive objects P7 P3 • Must be backed by SRQ • Identified on a node by “DCT Number” Node 2 – Messages routed with combination of DCT P4 P5 Number + LID • Constant connection cost – Requires “DC Key” to enable communication • Must be same across all processes – One QP for any peer • Full Feature Set – RDMA, Atomics etc Network Based Computing Laboratory IT4 Innovations’18 21

User-Mode Memory Registration • Support direct local and remote non-contiguous memory access • Avoid packing at sender and unpacking at receiver Steps to create memory regions with UMR: Process 1. UMR Creation Request • Send number of blocks 1 2. HCA issues uninitialized memory keys for future 3 Kernel UMR use 3. Kernel maps virtual->physical and pins region 4 2 into physical memory HCA/RNIC 4. HCA caches the virtual to physical mapping Network Based Computing Laboratory IT4 Innovations’18 22

On-Demand Paging (ODP) • Applications no longer need to pin down the underlying physical pages • Memory Region (MR) are NEVER pinned by the OS • Paged in by the HCA when needed • Paged out by the OS when reclaimed • ODP can be divided into two classes – Explicit ODP • Applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages – Implicit ODP • Applications are provided with a special memory key that represents their complete address space, does not need to register any virtual address range • Advantages • Simplifies programming • Unlimited MR sizes • Physical memory optimization Network Based Computing Laboratory IT4 Innovations’18 23

Implicit On-Demand Paging (ODP) • Introduced by Mellanox to avoid pinning the pages of registered memory regions • ODP-aware runtime could reduce the size of pin-down buffers while maintaining performance Applications (256 Processes) 100 Pin-down Explicit-ODP Execution Time (s) Implicit-ODP 10 1 0.1 CG EP FT IS MG LU SP AWP-ODC Graph500 M. Li, X. Lu, H. Subramoni, and D. K. Panda, “Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand”, HiPC ‘17 Network Based Computing Laboratory IT4 Innovations’18 24

Collective Offload Support on the Adapters • Performance of collective operations (broadcast, barrier, reduction, all-reduce, etc.) are very critical to the overall performance of MPI applications • Currently being done with basic pt-to-pt operations (send/recv and RDMA) using host-based operations • Mellanox ConnectX-2, ConnectX-3, ConnectX-4, and ConnectX-5 adapters support offloading some of these operations to the adapters (CORE-Direct) – Provides overlap of computation and collective communication – Reduces OS jitter (since everything is done in hardware) Network Based Computing Laboratory IT4 Innovations’18 25

One-to-many Multi-Send • Sender creates a task-list consisting of only send and wait WQEs – One send WQE is created for each Application registered receiver and is appended to Task List the rear of a singly linked task-list Send Wait Send Send Send Wait – A wait WQE is added to make the HCA wait for ACK packet from the receiver InfiniBand HCA MQ Send Q MCQ Send CQ Physical Data Data Link Recv Q Recv CQ Network Based Computing Laboratory IT4 Innovations’18 26

Scalable Hierarchical Aggregation Protocol (SHArP)  Management and execution of MPI operations in the network by using SHArP  Manipulation of data while it is being transferred in the switch network  SHArP provides an abstraction to realize the reduction operation Physical Network Topology*  Defines Aggregation Nodes (AN), Aggregation Tree, and Aggregation Groups  AN logic is implemented as an InfiniBand Target Channel Adapter (TCA) integrated into the switch ASIC*  Uses RC for communication between ANs and between AN Logical SHArP Tree* and hosts in the Aggregation Tree* * Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. R. L. Graham, D. Bureddy, P. Lui, G. Shainer, H. Rosenstock, G. Bloch, D. Goldenberg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, E. Zahavi, Mellanox Technologies, Inc. First Workshop on Optimization of Communication in HPC Runtime Systems (COM-HPC 2016) Network Based Computing Laboratory IT4 Innovations’18 27

Presentation Overview • Advanced Features for InfiniBand • Advanced Features for High Speed Ethernet • RDMA over Converged Ethernet • Open Fabrics Software Stack and RDMA Programming • Libfabrics Software Stack and Programming • Network Management Infrastructure and Tool • Common Challenges in Building HEC Systems with IB and HSE – Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges • System Specific Challenges and Case Studies – HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing • Conclusions and Final Q&A Network Based Computing Laboratory IT4 Innovations’18 28

The Ethernet Ecosystem Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/ Network Based Computing Laboratory IT4 Innovations’18 29

Emergence of 25 GigE and Benefits Slash top-of-rack switches (Source: IEEE 802.3) Courtesy http://www.eetimes.com/document.asp?doc_id=1323184 Courtesy http://www.networkcomputing.com/data-centers/25-gbe-big-deal-will-arrive/1714647938 http://www.plexxi.com/2014/07/whats-25-gigabit-ethernet-want/ http://www.eetimes.com/document.asp?doc_id=1323184 http://www.qlogic.com/Products/adapters/Pages/25Gb-Ethernet.aspx Network Based Computing Laboratory IT4 Innovations’18 30

Matching PCIe and Ethernet Speeds Ethernet Number of PCIe Gen3 Lanes Rate Needed for Single Port Dual Port (Gb/s) 100 16 32 (Uncommon) 40 8 16 25 4 8 10 2 4 • Requires half the number of lanes compared to 40G (x4 instead of x8 PCIe lanes) • Better PCIe bandwidth utilization (25/32=78% vs. 40/64=62.5%) with lower power impact Courtesy: http://www.ieee802.org/3/cfi/0314_3/CFI_03_0314.pdf Network Based Computing Laboratory IT4 Innovations’18 31

Detailed Specifications for 25 and 50 GigE and Looking Forward • 25G & 50G Ethernet specification extends IEEE 802.3 to work at increased data rates • Features in Draft 1.4 of specification – PCS/PMA operation at 25 Gb/s over a single lane – PCS/PMA operation at 25 Gb/s over two lanes – Optional Forward Error Correction modes – Optional auto-negotiation using an OUI next page – Optional link training • Standards for 50 Gb/s, 200 Gb/s and 400Gb/s under development – Expected around 2017 – 2018? Next standards by 2017 – 2018? Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/ Network Based Computing Laboratory IT4 Innovations’18 32

Ethernet Roadmap – To Terabit Speeds? Terabit speeds by 2025?!?! 50G, 100G, 200G and 400G by 2018-2019 Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/ Network Based Computing Laboratory IT4 Innovations’18 33

RDMA over Converged Enhanced Ethernet Network Stack Comparison Application • Takes advantage of IB and Ethernet Application Application – Software written with IB-Verbs IB Verbs IB Verbs IB Verbs – Link layer is Converged (Enhanced) Ethernet (CE) IB Transport IB Transport IB Transport Hardware • Pros: IB Vs RoCE IB Network UDP / IP IB Network – Works natively in Ethernet environments InfiniBand Ethernet Ethernet • Entire Ethernet management ecosystem is available Link Layer Link Layer Link Layer – Has all the benefits of IB verbs – Link layer is very similar to the link layer of native IB, so RoCE RoCE v2 InfiniBand there are no missing features Packet Header Comparison • RoCE v2: Additional Benefits over RoCE – Traditional Network Management Tools Apply Ethertype ETH IB GRH IB BTH+ RoCE – ACLs (Metering, Accounting, Firewalling) L2 Hdr L3 Hdr L4 Hdr – GMP Snooping for optimized Multicast – Network Monitoring Tools • Cons: Ethertype RoCE v2 Proto # – Network bandwidth might be limited to Ethernet Port # ETH IP Hdr UDP IB BTH+ switches L2 Hdr L3 Hdr Hdr L4 Hdr • 10/40GE switches available; 56 Gbps IB is available Courtesy: OFED, Mellanox Network Based Computing Laboratory IT4 Innovations’18 35

Software Convergence with OpenFabrics • Open source organization (formerly OpenIB) – www.openfabrics.org • Incorporates both IB, RoCE, and iWARP in a unified manner – Support for Linux and Windows • Users can download the entire stack and run – Latest stable release is OFED 4.8.1 • New naming convention to get aligned with Linux Kernel Development • OFED 4.8.2 is under development Network Based Computing Laboratory IT4 Innovations’18 37

OpenFabrics Software Stack SA Subnet Administrator MAD Management IP Based Sockets Block Access to Application Datagram Various Clustered App Based Storage File Diag Open MPIs DB Access Level SMA Subnet Manager Agent Access Access Access Systems Tools SM User Level PMA Performance Manager UDAPL Agent MAD API User InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC IPoIB IP over InfiniBand APIs SDP Lib SDP Sockets Direct Protocol User Space SRP SCSI RDMA Protocol Kernel Space (Initiator) Upper NFS-RDMA Cluster IPoIB SDP SRP iSER RDS Layer iSER iSCSI RDMA Protocol RPC File Sys (Initiator) Protocol RDS Reliable Datagram Connection Manager Service Abstraction (CMA) Mid-Layer SA UDAPL User Direct Access Connection Connection MAD SMA Kernel bypass Kernel bypass Programming Lib Client Manager Manager HCA Host Channel Adapter InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC R-NIC RDMA NIC Hardware Hardware Specific Provider Apps & Common Specific Driver Driver Key Access InfiniBand Methods for using InfiniBand HCA iWARP R-NIC Hardware iWARP OF Stack Network Based Computing Laboratory IT4 Innovations’18 38

Programming with OpenFabrics Sample Steps Sender Receiver 1. Create QPs (endpoints) 2. Register memory for Process sending and receiving 3. Send Kernel – Channel semantics • Post receive HCA • Post send – RDMA semantics Network Based Computing Laboratory IT4 Innovations’18 39

Verbs: Post Send • Prepare and post send descriptor (channel semantics) struct ibv_send_wr *bad_wr; struct ibv_send_wr sr; struct ibv_sge sg_entry; sr.next = NULL; sr.opcode = IBV_WR_SEND ; sr.wr_id = 0; sr.num_sge = 1; if (len < max_inline_size) { sr.send_flags = IBV_SEND_SIGNALED | IBV_SEND_INLINE ; } else { sr.send_flags = IBV_SEND_SIGNALED ; } sr.sg_list = &(sg_entry); sg_entry.addr = (uintptr_t) buf ; sg_entry.length = len ; sg_entry.lkey = mr_handle ->lkey; ret = ibv_post_send( qp , &sr, &bad_wr); Network Based Computing Laboratory IT4 Innovations’18 40

Verbs: Post RDMA Write • Prepare and post RDMA write (memory semantics) struct ibv_send_wr *bad_wr; struct ibv_send_wr sr; struct ibv_sge sg_entry; sr.next = NULL; sr.opcode = IBV_WR_RDMA_WRITE ; /* set type to RDMA Write */ sr.wr_id = 0; sr.num_sge = 1; sr.send_flags = IBV_SEND_SIGNALED; sr.wr.rdma.remote_addr = remote_addr; /* remote virtual addr. */ sr.wr.rdma.rkey = rkey; /* from remote node */ sr.sg_list = &(sg_entry); sg_entry.addr = buf ; /* local buffer */ sg_entry.length = len; sg_entry.lkey = mr_handle->lkey; ret = ibv_post_send(qp, &sr, &bad_wr); Network Based Computing Laboratory IT4 Innovations’18 41

Presentation Overview • Advanced Features for InfiniBand • Advanced Features for High Speed Ethernet • RDMA over Converged Ethernet • Open Fabrics Software Stack and RDMA Programming • Libfabrics Software Stack and Programming • Network Management Infrastructure and Tool • Common Challenges in Building HEC Systems with IB and HSE – Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges • System Specific Challenges and Case Studies – HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Grid Computing • Conclusions and Final Q&A Network Based Computing Laboratory IT4 Innovations’18 42

Libfabrics Connection Model HCA Server Process OFI Provider OFI Provider HCA Client Process GigE/IB/TrueScale Sockets/Verbs/PSM Sockets/Verbs/PSM GigE/IB/TrueScale Open fabrics Open fabrics fi_fabrics fi_fabrics Open Passive EP Open domain fi_passive_ep fi_domain Open Event Q Open Event Q fi_eq_open fi_ep_open Register Mem Bind Passive EP fi_pep_bind fi_mr_reg Open EndPoint fi_endpoint Listen for Incoming Connections fi_listen Open Comp Q fi_cq_open Bind EP to CQ fi_ep_bind Connect to Remote EP fi_connect New Event Detected on EQ Validate New Event == FI_CONNREQ fi_eq_sread Open domain fi_domain Register Mem fi_mr_reg Network Based Computing Laboratory IT4 Innovations’18 43

Libfabrics Connection Model (Cont.) HCA Server Process OFI Provider OFI Provider HCA Client Process GigE/IB/TrueScale Sockets/Verbs/PSM Sockets/Verbs/PSM GigE/IB/TrueScale Open fabrics Open EndPoint fi_ep_open Open Event Q fi_cq_open Bind EP to CQ fi_ep_bind Accept Connection fi_accept New Event Detected on EQ Validate New Event == FI_CONNECTED fi_eq_sread Validate New Event == FI_CONNECTED fi_eq_sread Post Recv fi_recv fi_recv Post Recv fi_cq_read / Poll / Wait for Data fi_send fi_cq_sread Post Send fi_cq_read / Poll / Wait for Data Recv Completion fi_cq_sread fi_send Post Send Recv Completion Shutdown Channel Shutdown Channel fi_shutdown fi_shutdown Close all open resources Close all open resources fi_close * fi_close * Network Based Computing Laboratory IT4 Innovations’18 44

Scalable EndPoints Vs Shared TX/RX Context Shared TX/RX Context Scalable EndPoints Normal EndPoint End End End End End Point Point Point Point Point Transmit Transmit Receive Receive Transmit Receive Transmit Receive Completion Completion Completion • Share HW resources • Use more HW resources • Similar to socket / QP • Higher performance per EP • # EP >> HW resources • Simple / Easy to use Courtesy: http://www.slideshare.net/seanhefty/ofa-workshop2015ofiwg?ref=http://ofiwg.github.io/libfabric/ Network Based Computing Laboratory IT4 Innovations’18 45

Libfabrics: Fabric, Domain and Endpoint creation • Open Fabric, Domain and EP struct fi_info *info, *hints; struct fid_fabric *fabric; struct fid_domain *dom; struct fid_ep *ep; hints = fi_allocinfo(); /* Obtain fabric information */ rc = fi_getinfo(VERSION, node, service, flags, hints, &info); /* Free fabric information */ fi_freeinfo(hints); /* Open fabric */ rc = fi_fabric(info->fabric_attr, &fabric, NULL); /* Open domain */ rc = fi_domain(fabric, entry.info, &dom, NULL); /* Open End point */ rc = fi_endpoint(dom, entry.info, &ep, NULL); Network Based Computing Laboratory IT4 Innovations’18 46

Libfabrics: Memory Registration • Open Fabric / Domain and create EQ, EP to end nodes – Connection establishment is abstracted out using connection management APIs (fi_cm) – fi_listen, fi_connect, fi_accept – Fabric provider can implement them with connection managers (rdma_cm or ibcm) or directly through verbs with out-of-band communication • Register memory int fi_mr_reg(struct fid_domain *domain, const void *buf, size_t len, uint64_t access, uint64_t offset, uint64_t requested_key, uint64_t flags, struct fid_mr **mr, void *context); rc = fi_mr_reg(domain, buffer, size, FI_SEND | FI_RECV, 0, 0, 0, &mr, NULL); rc = fi_mr_reg(domain, buffer, size, FI_REMOTE_READ | FI_REMOTE_WRITE, 0, user_key, 0, &mr, NULL); Permissions can be set as needed Network Based Computing Laboratory IT4 Innovations’18 47

Libfabrics: Post Receive (Channel Semantics) • Prepare and post receive request ssize_t fi_recv(struct fid_ep *ep, void * buf, size_t len, void *desc, fi_addr_t src_addr, void *context); - For connected EPs ssize_t fi_recvmsg(struct fid_ep *ep, const struct fi_msg *msg, uint64_t flags); - For connected and un-connected EPs struct fid_ep *ep; struct fid_mr *mr; /* Post recv request */ rc = fi_recv(ep, buf, size, fi_mr_desc(mr), 0, (void *)(uintptr_t)RECV_WCID); Network Based Computing Laboratory IT4 Innovations’18 48

Libfabrics: Post Send (Channel Semantics) • Prepare and post send descriptor ssize_t fi_send(struct fid_ep *ep, void *buf, size_t len, void *desc, fi_addr_t dest_addr, void *context); - For connected EPs ssize_t fi_sendmsg(struct fid_ep *ep, const struct fi_msg *msg, uint64_t flags); - For connected and un-connected EPs ssize_t fi_inject(struct fid_ep *ep, void *buf, size_t len, fi_addr_t dest_addr); - Buffer available for re-use as soon as function returns - No completion event generated for send struct fid_ep *ep; struct fid_mr *mr; static fi_addr_t remote_fi_addr; rc = fi_send(ep, buf, size, fi_mr_desc(mr), 0, (void *)(uintptr_t)SEND_WCID); rc = fi_inject(ep, buf, size, remote_fi_addr); Network Based Computing Laboratory IT4 Innovations’18 49

Libfabrics: Post Remote Read (Memory Semantics) • Prepare and post receive request ssize_t fi_read(struct fid_ep *ep, void *buf, size_t len, void *desc, fi_addr_t src_addr, uint64_t addr, uint64_t key, void *context); - For connected EPs ssize_t fi_readmsg(struct fid_ep *ep, const struct fi_msg_rma *msg, uint64_t flags); - For connected and un-connected EPs struct fid_ep *ep; struct fid_mr *mr; struct fi_context fi_ctx_read; /* Post remote read request */ ret = fi_read(ep, buf, size, fi_mr_desc(mr), local_addr, remote_addr, remote_key, &fi_ctx_read); Network Based Computing Laboratory IT4 Innovations’18 50

Libfabrics: Post Remote Write (Memory Semantics) • Prepare and post send descriptor ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len, void *desc, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context); - For connected EPs ssize_t fi_writemsg(struct fid_ep *ep, const struct fi_msg_rma *msg, uint64_t flags); - For connected and un-connected EPs ssize_t fi_inject_write(struct fid_ep *ep, const void *buf, size_t len, fi_addr_t dest_addr, uint64_t addr, uint64_t key); - Buffer available for re-use as soon as function returns - No completion event generated for send ssize_t fi_writedata(struct fid_ep *ep, const void *buf, size_t len, void *desc, uint64_t data, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context); - Similar to fi_write - Allows for the sending of remote CQ data Network Based Computing Laboratory IT4 Innovations’18 51

Network Management Infrastructure and Tools • Management Infrastructure – Subnet Manager – Diagnostic tools • System Discovery Tools • System Health Monitoring Tools • System Performance Monitoring Tools – Fabric management tools Network Based Computing Laboratory IT4 Innovations’18 53

Concepts in IB Management • Agents – Processes or hardware units running on each adapter, switch, router (everything on the network) – Provide capability to query and set parameters • Managers – Make high-level decisions and implement it on the network fabric using the agents • Messaging schemes – Used for interactions between the manager and agents (or between agents) • Messages Network Based Computing Laboratory IT4 Innovations’18 54

InfiniBand Management • All IB management happens using packets called as Management Datagrams – Popularly referred to as “MAD packets” • Four major classes of management mechanisms – Subnet Management – Subnet Administration – Communication Management – General Services Network Based Computing Laboratory IT4 Innovations’18 55

Subnet Management & Administration • Consists of at least one subnet manager (SM) and several subnet management agents (SMAs) – Each adapter, switch, router has an agent running – Communication between the SM and agents or between agents happens using MAD packets called as Subnet Management Packets (SMPs) • SM’s responsibilities include: – Discovering the physical topology of the subnet – Assigning LIDs to the end nodes, switches and routers – Populating switches and routers with routing paths – Subnet sweeps to discover topology changes Network Based Computing Laboratory IT4 Innovations’18 56

Subnet Manager Inactive Link Active Inactive Multicast Links Links Setup Switch Multicast Join Compute Node Multicast Setup Multicast Join Subnet Manager Network Based Computing Laboratory IT4 Innovations’18 57

Subnet Manager Sweep Behavior • SM can be configured to sweep once or continuously • On the first sweep: – All ports are assigned LIDs on the first sweep – All routes are setup on the switches • On consequent sweeps: – If there has been any change to the topology, appropriate routes are updated – If DLID X is down, packet not sent all the way • First hop will not have a forwarding entry for LID X • Sweep time configured by the system administrator – Cannot be too high or too low Network Based Computing Laboratory IT4 Innovations’18 58

Subnet Manager Scalability Issues • Single subnet manager has issues on large systems – Performance and overhead of scanning • Hardware implementations on switches are faster, but will work only for small systems (memory usage) • Software implementations are more popular (OpenSM) – Multi-SM models • Two benefits: fault tolerance (if one SM dies) and scalability (different SMs can handle different portions of the network) • Current SMs only provide a fault-tolerance model • Network subsetting is still be investigated • Asynchronous events specified to improve scalability – E.g., TRAPS are events sent by an agent to the SM when a link goes down Network Based Computing Laboratory IT4 Innovations’18 59

Multicast Group Management • Creation, joining/leaving, deleting multicast groups occur as SA requests – The requesting node sends a request to a SA – The SA sends MAD packets to SMAs on the switches to setup routes for the multicast packets • Each switch contains information on which ports to forward the multicast packet to • Multicast itself does not go through the subnet manager – Only the setup and teardown goes through the SM Network Based Computing Laboratory IT4 Innovations’18 60

Tools to Analyze InfiniBand Networks • Different types of tools exist: – High-level tools that internally talk to the subnet manager using management datagrams – Each hardware device exposes a few mandatory counters and a number of optional (sometimes vendor-specific) counters • Possible to write your own tools based on the management datagram interface – Several vendors provide such IB management tools Network Based Computing Laboratory IT4 Innovations’18 62

Network Discovery Tools • Starting with almost no knowledge about the system, we can identify several details of the network configuration – Example tools include: • ibstatus: shows adapter status • smpquery: SMP query tool • perfquery: reports performance/error counters of a port • ibportstate: shows status of IB port, enable/disable port • ibhosts: finds all the network adapters in the system • ibswitches: finds all the network switches in the system • ibnetdiscover: finds the connectivity between the ports • … and many others exist – Possible to write your own tools based on the management datagram interface • Several vendors provide such IB management tools Network Based Computing Laboratory IT4 Innovations’18 63

Health and Performance Monitoring Tools • Several tools exist to monitor the health and performance of the InfiniBand network – Example health monitoring tools include • ibdiagnet: queries for overall fabric health • ibportstate: identify state and link speed of an InfiniBand port • ibdatacounts: get InfiniBand port data counters – Example performance monitoring tools include • ibv_send_lat, ibv_write_lat: IB verbs level performance tests • perfquery: queries performance counters in IB HCA Network Based Computing Laboratory IT4 Innovations’18 64

Tools for Network Switching and Routing % ibroute -G 0x66a000700067c Lid Out Destination Port Info 0x0001 001 : (Channel Adapter portguid 0x0002c9030001e3f3: ' HCA-1') Packets to LID 0x0001 will 0x0002 013 : (Channel Adapter portguid 0x0002c9020023c301: ' HCA-1') be sent out on Port 001 0x0003 014 : (Channel Adapter portguid 0x0002c9030001e603: ' HCA-1') 0x0004 015 : (Channel Adapter portguid 0x0002c9020023c305: ' HCA-2') 0x0005 016 : (Channel Adapter portguid 0x0011750000ffe005: ' HCA-1') 0x0014 017 : (Switch portguid 0x00066a0007000728: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 8, Chip A') 0x0015 020 : (Channel Adapter portguid 0x0002c9020023c131: ' HCA-2') 0x0016 019 : (Switch portguid 0x00066a0007000732: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 10, Chip A') 0x0017 019 : (Channel Adapter portguid 0x0002c9030001c937: ' HCA-1') 0x0018 019 : (Channel Adapter portguid 0x0002c9020023c039: ' HCA-2') ... Network Based Computing Laboratory IT4 Innovations’18 65

Static Analysis of Network Contention 2 11 22 17 24 27 15 20 Spine Blocks Leaf Blocks 4 8 9 13 14 1 19 2 5 3 7 12 16 6 18 10 • Based on destination LIDs and switching/routing information, the exact path of the packets can be identified – If application communication pattern is known, we can statically identify possible network contention Network Based Computing Laboratory IT4 Innovations’18 66

Dynamic Analysis of Network Contention • IB provides many optional counters to query performance counters – PortXmitWait: Number of ticks in which there was data to send, but no flow-control credits – RNR NAKs: Number of times a message was sent, but the receiver has not yet posted a receive buffer • This can timeout, so it can be an error in some cases – PortXmitFlowPkts: Number of (link-level) flow-control packets transmitted on the port – SWPortVLCongestion: Number of packets dropped due to congestion Network Based Computing Laboratory IT4 Innovations’18 67

In-band Management vs. Out-of-band Management • InfiniBand provides two forms of management – Out-of-band management (similar to other networks) – In-band management (used by the subnet manager) • Out-of-band management requires a separate Ethernet port on the switch, where an administrator can plug in his/her laptop • In-band management allows the switch to receive management commands directly over the regular communication network InfiniBand connectivity (In-band management) Ethernet connectivity (Out-of-band management) Network Based Computing Laboratory IT4 Innovations’18 69

Overview of OSU INAM • A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI runtime – http://mvapich.cse.ohio-state.edu/tools/osu-inam/ • Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes • OSU INAM v0.9.2 released on 10/31/2017 • Significant enhancements to user interface to enable scaling to clusters with thousands of nodes • Improve database insert times by using 'bulk inserts‘ • Capability to look up list of nodes communicating through a network link • Capability to classify data flowing over a network link at job level and process level granularity in conjunction with MVAPICH2-X 2.3b • “Best practices “ guidelines for deploying OSU INAM on different clusters • Capability to analyze and profile node-level, job-level and process-level activities for MPI communication – Point-to-Point, Collectives and RMA • Ability to filter data based on type of counters using “drop down” list • Remotely monitor various metrics of MPI processes at user specified granularity • "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X • Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes Network Based Computing Laboratory IT4 Innovations’18 70

OSU INAM Features Comet@SDSC --- Clustered View Finding Routes Between Nodes (1,879 nodes, 212 switches, 4,377 network links) • Show network topology of large clusters • Visualize traffic pattern on different links • Quickly identify congested links/links in error state • See the history unfold – play back historical state of the network Network Based Computing Laboratory IT4 Innovations’18 71

OSU INAM Features (Cont.) Visualizing a Job (5 Nodes) Estimated Process Level Link Utilization • Job level view • Estimated Link Utilization view • Show different network metrics (load, error, etc.) for any live job • Classify data flowing over a network link at • Play back historical data for completed jobs to identify bottlenecks different granularity in conjunction with • Node level view - details per process or per node MVAPICH2-X 2.2rc1 • Job level and • CPU utilization for each rank/node • Process level • Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node Network Based Computing Laboratory IT4 Innovations’18 72

Common Challenges for Large-Scale Installations Common Challenges  Adapters and Interactions  I/O bus  Multi-port adapters  NUMA  Switches  Topologies  Switching / Routing  Bridges  IB interoperability Network Based Computing Laboratory IT4 Innovations’18 74

Common Challenges in Building HEC Systems with IB and HSE • Network adapters and interactions with other components – I/O bus interactions and limitations – Multi-port adapters and bottlenecks – NUMA interactions • Network switches • Network bridges Network Based Computing Laboratory IT4 Innovations’18 75

I/O bus limitations • Data communication traverses three buses (or P0 Core0 Core1 Memory links) before it reaches the network switch Core2 Core3 – Memory bus (memory to IO hub) P1 – I/O link (IO hub to the network adapter) Core0 Core1 Memory Core2 Core3 – Network link (network adapter to switch) I/O Bus • For optimal communication, all these need to be balanced Network Adapter • Network bandwidth: Network – 4X SDR (8 Gbps), 4X DDR (16 Gbps), 4X QDR (32 Gbps), 4X Switch FDR (56 Gbps), 4X EDR (100 Gbps) and 4X HDR (200 Gbps) • I/O link bandwidth: – 40 GigE (40 Gbps) – Tricky because several aspects need to be considered • Memory bandwidth: – Connector capacity vs. link capacity – Shared bandwidth (incoming and outgoing) – I/O link communication headers, etc. – For IB FDR (56 Gbps), memory bandwidth greater than 112 Gbps is required to fully utilize the network Network Based Computing Laboratory IT4 Innovations’18 76

PCI Express • Common I/O interconnect used on most current platforms – Can be configured as multiple lanes (1X, 4X, 8X, 16X, 32X) • Generation 1 provided 2 Gbps bandwidth per lane, Gen 2 provides 4 Gbps, and Gen 3 provides 8 Gbps per lane) Beware – Compatible with adapters using lesser lanes • If a PCIe connector is 16X, it will still support an 8X adapter by using only 8 lanes – Provides multiplexing across a single lane Beware • A 1X PCIe bus can be connected to an 8X PCIe connector (allowing an 8X adapter to be plugged in) – I/O interconnects are like networks with packetized communication • Communication headers for each packet • Reliability acknowledgments Use I/O bandwidth • Flow control acknowledgments • Typical efficiency is around 75-80% with 256 byte PCIe packets Beware Network Based Computing Laboratory IT4 Innovations’18 77

Multi-port adapters • Several multi-port adapters available in the market – Single adapter can drive multiple network ports at full bandwidth – Important to measure other overheads (memory bandwidth and I/O link bandwidth) before assuming performance benefit • Case Study: IB Dual-port 4x QDR adapter – Each network link is 32 Gbps (dual-port adapters can drive 64 Gbps) PCIe Gen2 8X link can give 32 Gbps data rate  around 24 Gbps effective rate (20 % – encoding overheads!!) • Dual-port IB QDR is not expected to give any benefit in this case PCIe Gen3 8X link can give 64 Gbps data rate  64 Gbps (minimal encoding overheads) – • Delivers close to peak performance with Dual-port IB adapters Network Based Computing Laboratory IT4 Innovations’18 78

NUMA Interactions Socket 2 Socket 3 Core Core Core Core Memory Memory 8 12 9 13 Core Core Core Core 10 11 14 15 QPI or HT Socket 1 Socket 0 PCIe Core Core Core Core 4 5 0 Memory Memory 1 Core Core Core Core 6 7 2 3 Network Card • Different cores in a NUMA platform have different communication costs Network Based Computing Laboratory IT4 Innovations’18 80

Impact of NUMA on Inter-node Latency 3 Core 0 -> 0 (Socket 0) 2.5 Core 7->7 (Socket 0) Send Latency (us) Core 14->14 (Socket 1) 2 Core 27->27 (Socket 1) 1.5 1 0.5 0 2 4 8 16 32 64 128 256 512 1K 2K Message Size (Bytes) • Cores in Socket 0 (closest to network card) have lowest latency • Cores in Socket 1 (one hop from network card) have highest latency ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switches Network Based Computing Laboratory IT4 Innovations’18 81

Impact of NUMA on Inter-node Bandwidth Intel Broadwell AMD MagnyCours 14000 3000 Core-0 Core-0 Send Bandwidth (MBps) 12000 Core-7 Core-6 2500 Core-14 Core-12 10000 Core-27 Core-18 Send Bandwidth (MBps) 2000 8000 1500 6000 1000 4000 500 2000 0 0 Message Size (bytes) Message Size (bytes) • NUMA interactions have significant impact on bandwidth ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switches ConnectX-2-QDR (36 Gbps): 2.5 GHz Hex-core (MagnyCours) AMD with IB (QDR) switches Network Based Computing Laboratory IT4 Innovations’18 82

Common Challenges in Building HEC Systems with IB and HSE • Network adapters and interactions with other components • Network switches – Switch topologies – Switching and Routing • Network bridges Network Based Computing Laboratory IT4 Innovations’18 84

Switch Topologies • InfiniBand installations come in multiple topologies – Single crossbar switches (up to 36-ports for QDR or FDR) • Applicable only to very small systems (hard to scale to large clusters) – Fat-tree topologies (medium scale topologies) • Provides full bisection bandwidth: Given independent communication between processes, you can find a switch configuration that provides fully non-blocking paths (though the same configuration might have contention if the communication pattern changes) • Issue: Number of switch components increases super-linearly with the number of nodes (Not scalable for large-scale systems) • Large scale installations can use more conservative topologies – Partial fat-tree topologies (over-provisioning) – 3D Torus (Sandia Red Sky and SDSC Gordon), Hypercube (SGI Altix) topologies, and 10D Hypercube (NASA Pleiades) Network Based Computing Laboratory IT4 Innovations’18 85

Switch Topology: Absolute Performance vs. Scalability Spine Blocks Leaf Blocks Crossbar ASIC (all-to-all connectivity) Full Fat-tree Topology (full bisection bandwidth) Spine Blocks Only a few links are connected Leaf Blocks Partial Fat-tree Topology Torus/Hypercube Topology (reduced inter-switch connectivity for more out-ports: super-linear scaling of switch components, but slower than (linear scaling of switch components) a full fat-tree topology) Network Based Computing Laboratory IT4 Innovations’18 86

Static Routing in IB + Adaptive Routing models from Qlogic (Intel) and Mellanox • IB standard only supports static routing – Not scalable for large systems where traffic might be non-deterministic causing hot-spots • Next generation IB switches are supporting adaptive routing (in addition to static routing): Outside the IB standard • Qlogic (Intel) support for adaptive routing – Continually monitors application messaging patterns and selects the optimum path for each traffic flow, eliminating slowdowns caused by pathway bottlenecks – Dispersive routing load-balances traffic among multiple pathways – http://ir.qlogic.com/phoenix.zhtml?c=85695&p=irol-newsarticle&id=1428788 • Mellanox support for adaptive routing – Supports moving traffic via multiple parallel paths – Dynamically and automatically re-routes traffic to alleviate congested ports – http://www.mellanox.com/related-docs/prod_silicon/PB_InfiniScale_IV.pdf Network Based Computing Laboratory IT4 Innovations’18 87

Common Challenges in Building HEC Systems with IB and HSE • Network adapters and interactions with other components • Network switches • Network bridges – IB interoperability with Ethernet and FC Network Based Computing Laboratory IT4 Innovations’18 88

IB-Ethernet and IB-FC Bridging Solutions • Mainly developed for backward compatibility with existing infrastructure – Ethernet over IB (EoIB) – Fibre Channel over IB (FCoIB) Host Host Virtual Ethernet/FC Adapter Ethernet Packet Convertor Switch Ethernet/FC Adapter (e.g., Mellanox BridgeX) IB Adapter Network Based Computing Laboratory IT4 Innovations’18 89

Ethernet/FC over IB • Can be used in an infrastructure where a part of the nodes are connected over Ethernet or FC – All of the IB connected nodes can communicate over IB – The same nodes can communicate with nodes in the older infrastructure using Ethernet-over-IB or FC-over-IB • Do not have the performance benefits of IB – Host thinks it is using an Ethernet or FC adapter – For example, with Ethernet, communication will be using TCP/IP • There is some hardware support for segmentation offload, but the rest of the IB features are unutilized • Note that this is different from VPI, as there is only one network connectivity from the adapter Network Based Computing Laboratory IT4 Innovations’18 90

System Specific Challenges for HPC Systems Common Challenges  Adapters and Interactions HPC  I/O bus  MPI  Multi-port adapters  Multi-rail  NUMA  Collectives  Switches  Scalability  Topologies  Application Scalability  Switching / Routing  Bridges  Energy Awareness  IB interoperability  PGAS  Programmability w/ Performance  Optimized Resource Utilization  GPU / XeonPhi  Programmability w/ Performance  Hide data movement costs  Heterogeneity aware design  Streaming, Deep Learning Network Based Computing Laboratory IT4 Innovations’18 92

HPC System Challenges and Case Studies • Message Passing Interface (MPI) • Partitioned Global Address Space (PGAS) models • GPU Computing • Xeon Phi Computing Network Based Computing Laboratory IT4 Innovations’18 93

Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2, MPI-3.0, and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,850 organizations in 85 countries – More than 440,000 (> 0.44 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 12th, 368,928-core (Stampede2) at TACC • 17th, 241,108-core (Pleiades) at NASA • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1 st in Jun’17, 10M cores, 100 PFlops) – Network Based Computing Laboratory IT4 Innovations’18 94

Design Challenges and Sample Results • Interaction with Multi-Rail Environments • Collective Communication • Scalability for Large-scale Systems • Energy Awareness Network Based Computing Laboratory IT4 Innovations’18 95

Impact of Multiple Rails on Inter-node MPI Bandwidth Single Rail Dual Rail 30000 14000 1 pair 12000 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) 25000 2 pairs 10000 20000 4 pairs 8000 8 pairs 15000 6000 16 pairs 10000 4000 5000 2000 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) ConnectX-4 EDR (100 Gbps): 2.4 GHz Deca-core (Haswell) Intel with IB (EDR) switches Designs based on: S. Sur, M. J. Koop, L. Chai and D. K. Panda, “Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms”, IEEE Hot Interconnects, 2007 Network Based Computing Laboratory IT4 Innovations’18 96

Hardware Multicast-aware MPI_Bcast on Stampede Small Messages (102,400 Cores) Large Messages (102,400 Cores) 40 500 Default Default Latency (us) 400 Latency (us) 30 Multicast Multicast 300 20 200 10 100 0 0 2 8 32 128 512 2K 8K 32K 128K Message Size (Bytes) Message Size (Bytes) 32 KByte Message 16 Byte Message 200 30 25 Default Default Latency (us) 150 Latency (us) 20 Multicast Multicast 100 15 10 50 5 0 0 Number of Nodes Number of Nodes ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch Network Based Computing Laboratory IT4 Innovations’18 97

Hardware Multicast-aware MPI_Bcast on Broadwell + EDR Small Messages (1,120 Cores) Large Messages (1,120 Cores) 7 120 Default Default 6 100 Latency (us) Latency (us) Multicast 5 Multicast 80 4 60 3 40 2 1 20 0 0 2 8 32 128 512 2K 8K 32K 128K Message Size (Bytes) Message Size (Bytes) 32 KByte Message 16 Byte Message 40 5 4 Latency (us) 30 Latency (us) 3 20 2 Default Default 10 1 Multicast Multicast 0 0 Number of Nodes Number of Nodes ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with Mellanox IB (EDR) switches Network Based Computing Laboratory IT4 Innovations’18 98

Advanced Allreduce Collective Designs Using SHArP and Multi-Leaders 60 0.6 Communication Latency (Seconds) 23% Lower is better 50 0.5 Latency (us) 40 0.4 30 0.3 40% 20 0.2 10 0.1 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 56 224 448 Message Size (Byte) Number of Processes MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP OSU Micro Benchmark (16 Nodes, 28 PPN) HPCG (28 PPN) • Socket-based design can reduce the communication latency by 23% and 40% on Xeon + IB nodes • Support is available in MVAPICH2 2.3a and MVAPICH2-X 2.3b M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi- Leader Design, Supercomputing '17. Network Based Computing Laboratory IT4 Innovations’18 99

Performance of MPI_Allreduce On Stampede2 (10,240 Processes) 300 2000 1800 250 1600 1400 200 Latency (us) 1200 150 1000 2.4X 800 100 600 400 50 200 0 0 4 8 16 32 64 128 256 512 1024 2048 4096 8K 16K 32K 64K 128K 256K Message Size Message Size MVAPICH2 MVAPICH2-OPT IMPI MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN • MPI_Allreduce latency with 32K bytes reduced by 2.4X Network Based Computing Laboratory IT4 Innovations’18 100

Latest version of the slides can be obtained from - PowerPoint PPT Presentation

Latest version of the slides can be obtained from http://www.cse.ohio-state.edu/~panda/it4-advanced.pdf InfiniBand, Omni-Path, and High-Speed Ethernet: Advanced Features, Challenges in Designing HEC Systems, and Usage A Tutorial at IT4

The latest version of AI M/ Enduse Model The latest version of AI M/ Enduse Model Go Hibino Go

Slides Dynamic and precise Product catalog 2017 Latest version of the catalogs You can always

Positioning Tables NFM / NH2 - AL Slides NE - AL Product catalog 2019 Latest version of the

Positioning Tables NFM NH2 - AL Slides NE - AL Product catalog 2015 Latest version of the

1 2 3 State R&D Graphic, Version 1 Version 1 4 State R&D Graphic, Version 1,

Frontal Dummies Frontal Dummies Latest Developments Latest Developments Page 1 Hybrid III

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Century SAGA Century SAGA Version 7.6 / Version 7.6 / Version 8.2 Version 8.2 Purpose

Fonctionnalits de la version 11 Nouveauts de la version 12 Version 11 and version 12 in a

D AMMIF Update Get the latest version of D AMMIF together with the latest release of ATSAS! ATSAS

1 Correct at 11 October 2019 - the latest information can be found on GOV.UK 2 Correct at 11

Our Favorite XSS Filters/IDS and how to Attack Them Most recent version of slides can be

Version control with subversion A short introduction Outline What is version control?

19. Dynamic Programming I Memoization, Optimal Substructure, Overlapping Sub-Problems,

DRAFT This paper is a draft submission to Inequality Measurement, trends, impacts, and

AIRS Project Overview March 27, 2007 March 27, 2007 Thomas S. Pagano Thomas S. Pagano AIRS

Safety Analysis of Systems Aaron R. Bradley Stanford University Safety Analysis of Systems

MGT-SM: A Method for Constructing Cellular Signal Transduction Networks Min Li, Ruiqing Zheng,

Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, IITB Objectives Learn

S T. Abdul-Aziz A Histomoniasis Slide Study Set L. R. McDougald American Association of Avian

Exploring Large Regression Model Spaces via Trans-dimensional Genetic Algorithms Ricardo S.