CompSci 514: Computer Networks Lecture 17: Network Support for - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access Xiaowei Yang Some slides adapted from http://www.cs.unh.edu/~rdr/rdma- intro-module.ppt

Overview • Introduction to RDMA • DCQCN: congestion control for large-scale RDMA deployments • Experience of deploying RDMA at a large scale datacenter network 2

What is RDMA? • A (relatively) new method for high-speed inter- machine communication – new standards – new protocols – new hardware interface cards and switches – new software

Remote Direction Memory Access • Read, write, send, receive etc. do not go through CPU

• two machines (Intel Xeon E5-2660 2.2GHz, 16 core, 128GB RAM, 40Gbps NICs, Windows Server 2012R2) connected via a 40Gbps switch.

R emote D irect M emory A ccess v R emote – data transfers between nodes in a network v D irect – no Operating System Kernel involvement in transfers – everything about a transfer offloaded onto Interface Card v M emory – transfers between user space application virtual memory – no extra copying or buffering v A ccess – send, receive, read, write, atomic operations

RDMA Benefits v High throughput v Low latency v High messaging rate v Low CPU utilization v Low memory bus contention v Message boundaries preserved v Asynchronous operation

RDMA Technologies v InfiniBand – (41.8% of top 500 supercomputers) – SDR 4x – 8 Gbps – DDR 4x – 16 Gbps – QDR 4x – 32 Gbps – FDR 4x – 54 Gbps v iWarp – internet Wide Area RDMA Protocol – 10 Gbps v RoCE – RDMA over Converged Ethernet – 10 Gbps – 40 Gbps

RDMA architecture layers

Software RDMA Drivers v Softiwarp – www.zurich.ibm.com/sys/rdma – open source kernel module that implements iWARP protocols on top of ordinary kernel TCP sockets – interoperates with hardware iWARP at other end of wire v Soft RoCE – www.systemfabricworks.com/downloads/roce – open source IB transport and network layers in software over ordinary Ethernet – interoperates with hardware RoCE at other end of wire

Similarities between TCP and RDMA v Both utilize the client-server model v Both require a connection for reliable transport v Both provide a reliable transport mode – TCP provides a reliable in-order sequence of bytes – RDMA provides a reliable in-order sequence of messages

How RDMA differs from TCP/IP v “zero copy” – data transferred directly from virtual memory on one node to virtual memory on another node v “kernel bypass” – no operating system involvement during data transfers v asynchronous operation – threads not blocked during I/O transfers

TCP/IP setup client server setup setup bind User User listen connect accept App App Kernel Kernel Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

RDMA setup client server setup setup rdma_bind rdma_ User User rdma_listen connect rdma_accept App App Kernel Kernel Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

TCP/IP setup client server setup setup bind User User listen connect accept App App Kernel Kernel Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

TCP/IP transfer client server setup setup transfer transfer bind data data User User listen connect send recv accept App copy App copy Kernel Kernel data data Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

RDMA transfer client server setup setup transfer transfer rdma_bind rdma_ rdma_ rdma_ data data User User rdma_listen post_ connect post_ rdma_accept App App recv send Kernel Kernel Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

“Normal” TCP/IP socket access model v Byte streams – requires application to delimit / recover message boundaries v Synchronous – blocks until data is sent/received – O_NONBLOCK, MSG_DONTWAIT are not asynchronous, are “try” and “try again” v send() and recv() are paired – both sides must participate in the transfer v Requires data copy into system buffers – order and timing of send() and recv() are irrelevant – user memory accessible immediately before and immediately after each send() and recv() call

TCP RECV() WIRE OPERATING SYSTEM NIC USER control allocate metadata recv() add to tables data packets sleep TCP virtual blocked buffers memory ACKs copy status access wakeup

RDMA RECV() WIRE USER CHANNEL ADAPTER allocate register recv queue metadata . . . recv() parallel virtual control activity memory data packets completion queue . . . status ACK poll_cq() access

RDMA access model v Messages – preserves user's message boundaries v Asynchronous – no blocking during a transfer, which – starts when metadata added to work queue – finishes when status available in completion queue v 1-sided (unpaired) and 2-sided (paired) transfers v No data copying into system buffers – order and timing of send() and recv() are relevant • recv() must be waiting before issuing send() – memory involved in transfer is untouchable between start and completion of transfer

Congestion Control for Large- Scale RDMA Deployments By Yibo Zhu et al.

Problem • RDMA requires a lossless data link layer • Ethernet is not lossless • Solution à RDMA over Converged Ethernet RoCE

RoCE details • Priority-based Flow Control (PFC) – When busy, send Pause – When not busy, send Resume

Problems with PFC • Per-port, not per flow • Unfairness: port-fair, not flow-fair • Collateral damage: head-of-line blocking for some flows

Experimental topology

Unfairness • H1-H4 write to R • H4 has no contention at port P2 • H1,H2, and H3 has contention on P3, and P4

Head of line blocking • VS à VR • H11,J14, H31-H32 à R • T4 congested, sends PAUSE messages • T1 Pauses all its incoming links regardless of their destinations

Solution • Per-flow congestion control control • Existing work: – QCN (Quantized Congestion Notification) • Using Ethernet SRC/DST and a flow ID to define a flow • Switch sends congestion notification to sender based on source MAC address • Only works at L2 • This work: DCQCN – Works for IP-routed networks

Why QCN does not work for IP networks? • Same packet has different SRC/DST MAC addresses.

DCQCN • DCQCN is a rate-based, end-to-end congestion protocol • Most of the DCQCN functionality is implemented in the NICs

High level Ideas • ECN-mark packets at an egress queue • Receiver sends Congestion Notification to sender • Sender reduces sending rates

Challenges • How to set buffer sizes at the egress queue • How often to send congestion notifications • How a sender should reduce its sending rate to ensure both convergence and fairness

Solutions provided by the paper • ECN must be set before PFC is triggered – Use PFC queue sizes to set ECN buffer • Use a fluid model to tune congestion parameters

RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn Microsoft

What this paper is about • Extending PFC to IP-routed network • Safety issues of RDMA – Livelock – Deadlock – Pause frame storm – Slower receiver symptoms • Performance observed in production networks

4MB message, 1K packets Drop packets with IP ID’s last byte 0xff (1/256)

S3 is dead. T1.p2 is congested Pause is sent to T1.p3, La.p1, To.p2, S1.

S4 à S2, S2 is dead Blue packet flooded to T0.p2 To.p2 is paused. Ingress T0.p3 pauses Lb.p0 Lb.p1 pauses T1.p4. T1.p1 pauses S4

Summary • What is RDMA • DCQCN: congestion control for RDMA • Deployment issues for RDMA

CompSci 514: Computer Networks Lecture 17: Network Support for - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access Xiaowei Yang Some slides adapted from http://www.cs.unh.edu/~rdr/rdma- intro-module.ppt Overview Introduction to RDMA DCQCN: congestion control

CompSci 514: Computer Networks Lecture 15 Practical Datacenter Networks Xiaowei Yang Overview

CompSci 514: Computer Networks Lecture 16: Network Function Virtualization Xiaowei Yang Adapted

CompSci 514: Computer Networks Lecture 17: Datacenter Network Architectures Xiaowei Yang

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang Roadmap

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

CompSci 514: Computer Networks L18: Datacenter Network Architectures II Xiaowei Yang 1

CompSci 514: Computer Networks Lecture 5: Congestion Control Xiaowei Yang 1 Outline

CompSci 514: Computer Networks Lecture 13: Distributed Hash Table Xiaowei Yang Overview

CompSci 514: Computer Networks Lecture 04: Evolution of the Internet Xiaowei Yang

CompSci 514 Computer Networks Lecture 20: Combating Denial of Service Attacks Xiaowei Yang How

CompSci 514: Computer Networks Lecture 21-2: From BitTorrent to BitTyrant Problem Statement

CompSci 514: Computer Networks Lecture 11: Software Defined Networking Xiaowei Yang 1

CompSci 514: Computer Networks Lecture 10: BGP problems Xiaowei Yang 1 Today Known

CompSci 514: Computer Networks Lecture 3: The Design Philosophy of the DARPA Internet Protocols

Rigid Geometric Transformations COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Axis2 Data-binding Thoughts Major changes from Axis 1.x Investigate the possibility

D3 Tutorial Data Binding and Loading Edit by Jiayi Xu and Han-Wei Shen, The Ohio State University

Programming Abstractions Week 2: Environments and Closures Stephen Checkoway Using variables

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University

Diamond: Automating Data Management and Storage for Wide-area, Reactive Applications Irene

QML for Desktop Applications Helmut Sedding Michael T. Wagner IPO.Plan GmbH Qt Developer Days

Terms with Bindings as an Abstract Data Type Jasmin Blanchette, Lorenzo Gheri, Andrei Popescu,

FRONT END FRAMEWORKS buil d we b ap ps faste r ! MP 1 TAKEAWAYS CSS is a mess (but SCSS/Sass

Sambuz

Useful Links

Newsletter

Mail Us