routebricks
play

RouteBricks Exploating Parallelism To Scale Software Routers Pawe - PowerPoint PPT Presentation

RouteBricks Exploating Parallelism To Scale Software Routers Pawe Bedyski 12 January 2011 About the paper Published: October 2009, 14 pages People: Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)


  1. RouteBricks Exploating Parallelism To Scale Software Routers Paweł Bedyński 12 January 2011

  2. About the paper  Published: October 2009, 14 pages   People:  Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)  Katerina Argyraki „While certainly an  Byung-Gon Chun improvement, in practice,  Kevin Fall network processors have  Gianluca Iannaccone proven hard to program: in  Allan Knies the best case, the  Maziar Manesh programmer needs to learn a  Sylvia Ratnasamy new programming paradigm;  Institutions in the worst, she must be  EPFL Lausanne, Switzerland aware of (and program to  Lancaster University, Lancaster, UK avoid) low- level issues (…)”  Intel Research Labs, Berkeley, CA

  3. What for? A little bit of history:  Network equipment has focused primarily on performance  Limited forms of packets processing  New functionality and services renewed interest in programmable and extensible  network equipement Main issue:  High-end routers are diffucult to extend  „software routers ” – easily programmable but have so far been suitable for low-packet  rate env. Goal:  Individual link speed: 10Gbps is already widespread  Carrier-grade routers 10Gbps up to 92Tbps  Software routers had problems with exceeding 1-5Gbps.  RouteBricks: parallelization across servers and tasks within server 

  4. Design Principles  Requirements:  Variables:  N ports  Each port full-duplex  Line rate R bps  Router functionality:  (1) packet processing, route lookup or classification  (2) packet switching from input to output ports  Existing solutions (N – 10-10k, R – 1-40 Gbps):  Hardware routers:  Packet processing happens in the linecard (per one or few ports). Each linecard must process at cR rate.  Packet switching through a switch fabric and centralized scheduler, hence rate is NR  Software routers:  Both switching and packet processing at NR rate.

  5. Design Principles 1. Router functionality should be parallelized across multiple servers NR is unrealistic in single server solution. This is 2-3 orders of magnitude away  from current server performance. 2. Router functionality should be parallelized across multiple processing paths within each server Even cR (lowest c =2) is too much for single server if we don’t use  potential of multicore architecture (1-4Gbps is reachable) Drawbacks, tradeoffs:  Packet reordering, increased latency, more „ relaxed ” performance  guarantees

  6. Parallelizing accross servers  Switching guarantees:  (1) 100% throughput – all output ports can run at full line rate R bps, if the input trafic demands it.  (2) fairness – each input port gets its fair share of the capacity of any output port  (3) avoids reordering packets  Constraints of commodity serves:  Limited internal link rates – internal links cannot run at a rate higher than external line rate R  Limited per-node processing rate – single server rate not higher than cR for small constant c>1  Limited per-node fanout – number of physical connection from each server constant and independent of the number of servers.

  7. Parallelizing accross servers  Routing algorithms:  Static single-path – high „ speedups ” violates our constraint  Adaptive single-path – needs centralized scheduling, but it should run at rate NR  Load-balaced routing - VLB (Valiant Load Balancing)  Benefits of VLB  Guarantees 100% throuput and fairness without centralized scheduling  Doesn’t require link speedups – traffic uniformly split across the cluster’s internal links  Only +R (for intermediate traffic) to per-server traffic rate compared to solution without VLB (R for traffic comming in from external line, R for trafic which server should send out )  Problems:  Packets Reordering  Limited fanout : prevents us from using full-mesh topology when N exceeds server’s fanout

  8. Parallelizing accross servers  configurations:  Current servers : each server can handle one router port and accommodate 5 NICs  More NICs : 1RP (router port), 20 NICs  Faster servers & more NICs : 2 RP, 20NICs Number of servers required to build an N - port, R =10Gbps/port router, for four different server configurations

  9. Parallelizing within servers

  10. Parallelizing within servers  Each network queue should be accessed by a single core  Locking is expensive  Separate thread for polling and writing  Threads statically assigned to cores  Each packet should be handled by a single core  Pipeline aproach is outperformed by parallel aproach

  11. Parallelizing within servers  „ Batch ” processing (NIC-driven, poll-driven)  3-fold performance improvement  Increased latency

  12. Evaluation: server parallelism  Workloads:  Minimal forwarding – traffic arriving at port i is just forwarded to port j (no routing-table lookup etc.)  IP routing – full routing with checksum calculation, updating headers etc.  IPsec packet encryption

  13. The RB4 Parallel Router  Specification:  4 Nehalem servers  full-mesh topology  Direct-VLB routing  Each server assigned a single 10 Gbps external line

  14. RB4 - implementation  Minimizing packet processing:  By encoding output node in MAC address (once)  Only works if each „ internal ” port has as many receive queues as there are external ports  Avoiding reordering:  Standard VLB cluster allows reordering (multiple cores or load- balancing)  Perfectly synchronized clocks solution – require custom operating systems and hardware  Sequence numbers tags – CPU is a bottleneck  Solution – avoid reordering within TCP/UDP  Same flow packets are assigned to the same queue  Set of same-flow packets ( δ – msec) are sent through the same intermediate node

  15. RB4 - performance  Forwarding performance: Workflow RB4 Expect. Explanation 64B 12 Gbps 12.7-19.4 Extra overhead caused by reordering avoidance algorithm Abilene 35 Gbps 33 - 49 Limited number of PCIe slots on prototype server  Reordering (Abilene trace – single input, single output): <p1,p2,p3,p4,p5 >  <p1,p4,p2,p3,p5> - one reordered sequence  Measure reordering as fraction of same-flow packet sequences that were reordered  RB4 - 0.15% when using reordering avoidance, 5,5% when using Direct VLB   Latency: Per-server packet latency 24 μ s. DMA transfer (two back-and-forth transfers  between NIC and memory: packet and descriptor) – 2,56; routing - 0.8; batching – 12.8 Traversal through RB4 includes 2-3 hops; hence estimated latency – 47.6 – 66.4 μ s  Cisco 6500 Series router – 26.3 μ s (packet processing latency) 

  16. Discussion & Conclusions  Not only performance-driven: Space Power Cost • Limiting space in RB4 by integrating • RB4 consumes 2.6KW • RB4 prototype $14.000 „ra w cost ” Ethernet controllers directly on the motherboard (done in laptops) – but idea was • For reference: nominal power not to change hardware rating of popular mid-range • For reference 40Gbps router loaded for 40Gbps is • Estimates made by extrapolating 1.6KW Cisco 7603 router results:400mm motherboard could $70.000. • RB4 can reduce power accomodate 6 controllers to drive 2x10Gbps and 30x1Gbps interfaces. With this we could consumption by slowing down connect 30-40 servers. Thus we would result components not stressed by the in 300-400Gbps router that occupies 30U workflow (rack unit). • For reference: Cisco 7600 Series 360Gbps in a 21U form-factor  How to measure programmability ?

  17. Q&A

More recommend