IEEE 802 Industry Connections Report The Next Generation Lossless Network in the Data Center BrightTalk, Data Center Transformation 3.0, January 2019 Paul Congdon, PhD
Disclaimer All speakers presenting information on IEEE standards speak as individuals, and their views ⚫ should be considered the personal views of that individual rather than the formal position, explanation, or interpretation of the IEEE. Page 3
Acknowledgements The initial technical contribution and sponsorship for this work was provided by Huawei ⚫ Technologies Co., Ltd. This presentation summaries work from the IEEE 802 Network Enhancements for the ⚫ Next Decade Industry Connections Activity (Nendica). Nendica : IEEE 802 “Network Enhancements for the Next Decade” Industry Connections ⚫ Activity An IEEE Industry Connections Activity Organized under the IEEE 802.1 Working Group https://1.ieee802.org/802-nendica/ Report Freely Available at: https://ieeexplore.ieee.org/document/8462819 Page 4
Our Digital Lives are driving Innovation in the DC Interactive Interactive Speech Image Recognition Recognition Human / Machine Interaction Autonomous Driving Page 5
Critical Use Case – Online Data Intensive Services (OLDI) • OLDI applications have real-time Aggregator Request deadlines and run in parallel on 1000s Deadline = 250 ms Deadline = 50 ms of servers. • Incast is a naturally occurring Aggregator Aggregator … Aggregator phenomenon. Deadline = 10 ms • Tail latency reduces the quality of the Worker Worker … Worker Worker … Worker results Page 6
Critical Use Case – Deep Learning • Massively parallel HPC applications, such AI training, are dependent on low latency and high throughput network. • Billions of parameters. Rank 0 … Partition 0 • Scale out is limited by network Rank 1 performance. … Partition 1 Rank 2 … Partition 2 Comsumed Time Sweet Spot Start Elapsed Time Dataset Computing Time Feed Data Training Network Time Overall Time MPI Allreduce Weights Send Weight Number of Computing Nodes Page 7
Critical Use Case – NVMe Over Fabrics • Disaggregated resource pooling, such as NVMe over Fabrics, use RDMA and run over converged network infrastructure. • Low latency and lossless are critical. • Ease of deployment and cloud scale are important success factors. Page 8
Critical Use Case – Cloudification of the Central Office Traditional Central Office Cloudified Central Office • Massive growth in Mobile and Orchestration Base-Band CDN Units Internet traffic is driving … Network Function Virtualization Infrastructure investment Firewall BRAS • To meet performance requirements IP VPN Telephony of traditional purpose built Standard High-Speed DPI equipment, SDN and NFV must run Ethernet Storage Switches on low-latency, low-loss, scalable and highly available network Subscribers Subscribers infrastructure Page 9
We are dealing with massive amounts of data and computing Divide and Conquer Cloud Infrastructure Neural Network Requirements: High Speed Network Storage • Fast-scalable storage • Parallel applications and data Real-time Natural • Cloud-ified Infrastructure Human/Machine Response Page 10
Congestion Creates the Problems Packet Loss Massive Data Network Latency Massive Compute Congestion Loss Parallelism can create congestion which leads to Massive Messaging loss making end-user Throughput unhappy Loss Page 11
The Impact of Congestion in Lossless Network The impact of congestion on network performance can be very serious. ⚫ As shown in paper (Pedro J. Garcia et al, IEEE Micro 2006) [1]: ⚫ Injecting hot-spot traffic Injecting hot-spot traffic Throughput diminishing by 70% Latency increasing of three orders of magnitude Network Throughput and Generated Traffic Average Packet Latency Network Performance Degrades Dramatically after Congestion Appears [1] Garcia, Pedro Javier, et al. "Efficient, scalable congestion management for interconnection networks." IEEE Micro 26.5 (2006): 52-66. Page 12
Dealing with Congestion today Explicit Congestion Notification (ECN) + Priority-based Flow Control (PFC) ECMP – Equal Cost MultiPath Routing ECN Congestion Feedback PFC Congestion … … … … … … ECN Mark ECMP … … … … … … … … … … … … Page 13
Ongoing challenges with congestion ECN Control Loop Delay Head-of-line Blocking ECMP Collisions ECN Congestion Feedback PFC Congestion 30G 30G 30G 15G HOLB … … … … … … ECN Mark 30G 30G ECMP 30G 15G … … … … … … … … … … … … 40G 40G Links Links Page 14
Potential New Lossless Technologies for the Data Center Goal = No Loss No Packet Loss ⚫ No Latency Loss ⚫ No Throughput Loss ⚫ Solutions Virtual Input Queuing - VIQ ⚫ Dynamic Virtual Lanes - DVL ⚫ Load-Aware Packet Spraying - LPS ⚫ Push & Pull Hybrid Scheduling - PPH ⚫ Page 15
VIQ (Virtual Input Queues) : Resolve Internal Packet Loss Incast Congestion leading to Coordinated egress-ingress queuing internal packet loss PFC threshold 1. During incast scenario, ingress queue counter doesn’t exceed the PFC threshold, so will not send PFC Pause Ingress queue counter frame to upstream. Packet will always come in from ingress port. Egress queue Ingress queue counter 2. But the physical egress queue has backlog because of convergence effect. VIQ could be looked as: that on out port, assign a dedicated queue for Packet loss occurs without egress- every in port. Memory changes from sharing to virtually monopolized ingress coordination. according to in ports. So that every in port could get fair scheduling. The tail latency of business could be controlled effectively. PFC threshold Page 16
DVL (Dynamic Virtual Lanes) 2 2 1 Upstream 3 1 Downstream 3 4 4 Ingress Port Egress Port Ingress Port Egress Port (Virtual Queues) (Virtual Queues) 1. Identify the flow Congested Flows causing congestion Non-Congested Flows and isolate locally 2. Signal to neighbor CIP when congested queue fills 3. Upstream isolates the flow too, eliminating Eliminate head-of-line blocking HoL Blocking PFC 4. If congested queue continues to fill, invoke PFC for lossless Page 17
LPS (Load-Aware Packet Spraying) Load Balancing Design Space Framework State Granularity ◼ Centralized ◼ Stateless ◼ Flow (e.g. Hedera, B4, SWAN) Slow to react for Data Centers ◼ Local ◼ Flowlet (e.g. ECMP, Flare, LocalFlow) ◼ Distributed Poor handling of asymmetric traffic ◼ Flowcell Notes ◼ Global ◼ Packet May require packet re-ordering LPS = Packet Spraying + Endpoint Reordering + Load-Aware Page 18
PPH (Push & Pull Hybrid Scheduling) Light load: All Light congestion: Heavy load: All Push. Acquire low Open Pull for part of Pull. Reduce latency. the congested path queuing delay, improve throughput. PPH = Congestion aware traffic scheduling Request (Pull) Push Data Push Data Push when load is light Grant Long RTT (Pull) Pull when load is high Request Pull Data Short RTT (Pull) Spine Spine Spine Spine 1 Request Request Grant 2 Grant 3 Data Data Leaf Leaf Leaf Leaf Leaf Leaf … … … … … … source source destination Page 19
Innovation for the Lossless Network Innovation Congestion Impact Mitigating Congestion Ingress thresholds unrelated Coordinate egress availability to egress buffer availability. Coordinated Virtual Input Queues with ingress demand. Avoid Incast causes internal packet Resources internal switch packet loss loss. Allow time for end-to-end Priority-based Flow Control congestion control. Move Isolate Dynamic Virtual Lane (Coarse grain). Victim flows Congestion congested flows out of the way. hurt by the congested flows Eliminate head-of-line blocking. Unbalanced load sharing. Load-balance flows at higher Spread the Load-aware Packet Spraying Elephant flow collisions block granularity. Use congestion Load mice flows. awareness to avoid collisions Unscheduled and network Source Source Scheduling decision integrated Schedule resource unaware many-to- Push & Pull Hybrid Scheduling the information from source, Appropriately Network Network one communication leads to network and destination. incast packet loss Destination Destination Page 20
Thank You Page 21
Recommend
More recommend