RouteBricks Exploating Parallelism To Scale Software Routers Paweł Bedyński 12 January 2011
About the paper Published: October 2009, 14 pages People: Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley) Katerina Argyraki „While certainly an Byung-Gon Chun improvement, in practice, Kevin Fall network processors have Gianluca Iannaccone proven hard to program: in Allan Knies the best case, the Maziar Manesh programmer needs to learn a Sylvia Ratnasamy new programming paradigm; Institutions in the worst, she must be EPFL Lausanne, Switzerland aware of (and program to Lancaster University, Lancaster, UK avoid) low- level issues (…)” Intel Research Labs, Berkeley, CA
What for? A little bit of history: Network equipment has focused primarily on performance Limited forms of packets processing New functionality and services renewed interest in programmable and extensible network equipement Main issue: High-end routers are diffucult to extend „software routers ” – easily programmable but have so far been suitable for low-packet rate env. Goal: Individual link speed: 10Gbps is already widespread Carrier-grade routers 10Gbps up to 92Tbps Software routers had problems with exceeding 1-5Gbps. RouteBricks: parallelization across servers and tasks within server
Design Principles Requirements: Variables: N ports Each port full-duplex Line rate R bps Router functionality: (1) packet processing, route lookup or classification (2) packet switching from input to output ports Existing solutions (N – 10-10k, R – 1-40 Gbps): Hardware routers: Packet processing happens in the linecard (per one or few ports). Each linecard must process at cR rate. Packet switching through a switch fabric and centralized scheduler, hence rate is NR Software routers: Both switching and packet processing at NR rate.
Design Principles 1. Router functionality should be parallelized across multiple servers NR is unrealistic in single server solution. This is 2-3 orders of magnitude away from current server performance. 2. Router functionality should be parallelized across multiple processing paths within each server Even cR (lowest c =2) is too much for single server if we don’t use potential of multicore architecture (1-4Gbps is reachable) Drawbacks, tradeoffs: Packet reordering, increased latency, more „ relaxed ” performance guarantees
Parallelizing accross servers Switching guarantees: (1) 100% throughput – all output ports can run at full line rate R bps, if the input trafic demands it. (2) fairness – each input port gets its fair share of the capacity of any output port (3) avoids reordering packets Constraints of commodity serves: Limited internal link rates – internal links cannot run at a rate higher than external line rate R Limited per-node processing rate – single server rate not higher than cR for small constant c>1 Limited per-node fanout – number of physical connection from each server constant and independent of the number of servers.
Parallelizing accross servers Routing algorithms: Static single-path – high „ speedups ” violates our constraint Adaptive single-path – needs centralized scheduling, but it should run at rate NR Load-balaced routing - VLB (Valiant Load Balancing) Benefits of VLB Guarantees 100% throuput and fairness without centralized scheduling Doesn’t require link speedups – traffic uniformly split across the cluster’s internal links Only +R (for intermediate traffic) to per-server traffic rate compared to solution without VLB (R for traffic comming in from external line, R for trafic which server should send out ) Problems: Packets Reordering Limited fanout : prevents us from using full-mesh topology when N exceeds server’s fanout
Parallelizing accross servers configurations: Current servers : each server can handle one router port and accommodate 5 NICs More NICs : 1RP (router port), 20 NICs Faster servers & more NICs : 2 RP, 20NICs Number of servers required to build an N - port, R =10Gbps/port router, for four different server configurations
Parallelizing within servers
Parallelizing within servers Each network queue should be accessed by a single core Locking is expensive Separate thread for polling and writing Threads statically assigned to cores Each packet should be handled by a single core Pipeline aproach is outperformed by parallel aproach
Parallelizing within servers „ Batch ” processing (NIC-driven, poll-driven) 3-fold performance improvement Increased latency
Evaluation: server parallelism Workloads: Minimal forwarding – traffic arriving at port i is just forwarded to port j (no routing-table lookup etc.) IP routing – full routing with checksum calculation, updating headers etc. IPsec packet encryption
The RB4 Parallel Router Specification: 4 Nehalem servers full-mesh topology Direct-VLB routing Each server assigned a single 10 Gbps external line
RB4 - implementation Minimizing packet processing: By encoding output node in MAC address (once) Only works if each „ internal ” port has as many receive queues as there are external ports Avoiding reordering: Standard VLB cluster allows reordering (multiple cores or load- balancing) Perfectly synchronized clocks solution – require custom operating systems and hardware Sequence numbers tags – CPU is a bottleneck Solution – avoid reordering within TCP/UDP Same flow packets are assigned to the same queue Set of same-flow packets ( δ – msec) are sent through the same intermediate node
RB4 - performance Forwarding performance: Workflow RB4 Expect. Explanation 64B 12 Gbps 12.7-19.4 Extra overhead caused by reordering avoidance algorithm Abilene 35 Gbps 33 - 49 Limited number of PCIe slots on prototype server Reordering (Abilene trace – single input, single output): <p1,p2,p3,p4,p5 > <p1,p4,p2,p3,p5> - one reordered sequence Measure reordering as fraction of same-flow packet sequences that were reordered RB4 - 0.15% when using reordering avoidance, 5,5% when using Direct VLB Latency: Per-server packet latency 24 μ s. DMA transfer (two back-and-forth transfers between NIC and memory: packet and descriptor) – 2,56; routing - 0.8; batching – 12.8 Traversal through RB4 includes 2-3 hops; hence estimated latency – 47.6 – 66.4 μ s Cisco 6500 Series router – 26.3 μ s (packet processing latency)
Discussion & Conclusions Not only performance-driven: Space Power Cost • Limiting space in RB4 by integrating • RB4 consumes 2.6KW • RB4 prototype $14.000 „ra w cost ” Ethernet controllers directly on the motherboard (done in laptops) – but idea was • For reference: nominal power not to change hardware rating of popular mid-range • For reference 40Gbps router loaded for 40Gbps is • Estimates made by extrapolating 1.6KW Cisco 7603 router results:400mm motherboard could $70.000. • RB4 can reduce power accomodate 6 controllers to drive 2x10Gbps and 30x1Gbps interfaces. With this we could consumption by slowing down connect 30-40 servers. Thus we would result components not stressed by the in 300-400Gbps router that occupies 30U workflow (rack unit). • For reference: Cisco 7600 Series 360Gbps in a 21U form-factor How to measure programmability ?
Q&A
Recommend
More recommend