RouteBricks Exploating Parallelism To Scale Software Routers Pawe - PowerPoint PPT Presentation

RouteBricks Exploating Parallelism To Scale Software Routers Paweł Bedyński 12 January 2011

About the paper  Published: October 2009, 14 pages   People:  Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)  Katerina Argyraki „While certainly an  Byung-Gon Chun improvement, in practice,  Kevin Fall network processors have  Gianluca Iannaccone proven hard to program: in  Allan Knies the best case, the  Maziar Manesh programmer needs to learn a  Sylvia Ratnasamy new programming paradigm;  Institutions in the worst, she must be  EPFL Lausanne, Switzerland aware of (and program to  Lancaster University, Lancaster, UK avoid) low- level issues (…)”  Intel Research Labs, Berkeley, CA

What for? A little bit of history:  Network equipment has focused primarily on performance  Limited forms of packets processing  New functionality and services renewed interest in programmable and extensible  network equipement Main issue:  High-end routers are diffucult to extend  „software routers ” – easily programmable but have so far been suitable for low-packet  rate env. Goal:  Individual link speed: 10Gbps is already widespread  Carrier-grade routers 10Gbps up to 92Tbps  Software routers had problems with exceeding 1-5Gbps.  RouteBricks: parallelization across servers and tasks within server 

Design Principles  Requirements:  Variables:  N ports  Each port full-duplex  Line rate R bps  Router functionality:  (1) packet processing, route lookup or classification  (2) packet switching from input to output ports  Existing solutions (N – 10-10k, R – 1-40 Gbps):  Hardware routers:  Packet processing happens in the linecard (per one or few ports). Each linecard must process at cR rate.  Packet switching through a switch fabric and centralized scheduler, hence rate is NR  Software routers:  Both switching and packet processing at NR rate.

Design Principles 1. Router functionality should be parallelized across multiple servers NR is unrealistic in single server solution. This is 2-3 orders of magnitude away  from current server performance. 2. Router functionality should be parallelized across multiple processing paths within each server Even cR (lowest c =2) is too much for single server if we don’t use  potential of multicore architecture (1-4Gbps is reachable) Drawbacks, tradeoffs:  Packet reordering, increased latency, more „ relaxed ” performance  guarantees

Parallelizing accross servers  Switching guarantees:  (1) 100% throughput – all output ports can run at full line rate R bps, if the input trafic demands it.  (2) fairness – each input port gets its fair share of the capacity of any output port  (3) avoids reordering packets  Constraints of commodity serves:  Limited internal link rates – internal links cannot run at a rate higher than external line rate R  Limited per-node processing rate – single server rate not higher than cR for small constant c>1  Limited per-node fanout – number of physical connection from each server constant and independent of the number of servers.

Parallelizing accross servers  Routing algorithms:  Static single-path – high „ speedups ” violates our constraint  Adaptive single-path – needs centralized scheduling, but it should run at rate NR  Load-balaced routing - VLB (Valiant Load Balancing)  Benefits of VLB  Guarantees 100% throuput and fairness without centralized scheduling  Doesn’t require link speedups – traffic uniformly split across the cluster’s internal links  Only +R (for intermediate traffic) to per-server traffic rate compared to solution without VLB (R for traffic comming in from external line, R for trafic which server should send out )  Problems:  Packets Reordering  Limited fanout : prevents us from using full-mesh topology when N exceeds server’s fanout

Parallelizing accross servers  configurations:  Current servers : each server can handle one router port and accommodate 5 NICs  More NICs : 1RP (router port), 20 NICs  Faster servers & more NICs : 2 RP, 20NICs Number of servers required to build an N - port, R =10Gbps/port router, for four different server configurations

Parallelizing within servers

Parallelizing within servers  Each network queue should be accessed by a single core  Locking is expensive  Separate thread for polling and writing  Threads statically assigned to cores  Each packet should be handled by a single core  Pipeline aproach is outperformed by parallel aproach

Parallelizing within servers  „ Batch ” processing (NIC-driven, poll-driven)  3-fold performance improvement  Increased latency

Evaluation: server parallelism  Workloads:  Minimal forwarding – traffic arriving at port i is just forwarded to port j (no routing-table lookup etc.)  IP routing – full routing with checksum calculation, updating headers etc.  IPsec packet encryption

The RB4 Parallel Router  Specification:  4 Nehalem servers  full-mesh topology  Direct-VLB routing  Each server assigned a single 10 Gbps external line

RB4 - implementation  Minimizing packet processing:  By encoding output node in MAC address (once)  Only works if each „ internal ” port has as many receive queues as there are external ports  Avoiding reordering:  Standard VLB cluster allows reordering (multiple cores or load- balancing)  Perfectly synchronized clocks solution – require custom operating systems and hardware  Sequence numbers tags – CPU is a bottleneck  Solution – avoid reordering within TCP/UDP  Same flow packets are assigned to the same queue  Set of same-flow packets ( δ – msec) are sent through the same intermediate node

RB4 - performance  Forwarding performance: Workflow RB4 Expect. Explanation 64B 12 Gbps 12.7-19.4 Extra overhead caused by reordering avoidance algorithm Abilene 35 Gbps 33 - 49 Limited number of PCIe slots on prototype server  Reordering (Abilene trace – single input, single output): <p1,p2,p3,p4,p5 >  <p1,p4,p2,p3,p5> - one reordered sequence  Measure reordering as fraction of same-flow packet sequences that were reordered  RB4 - 0.15% when using reordering avoidance, 5,5% when using Direct VLB   Latency: Per-server packet latency 24 μ s. DMA transfer (two back-and-forth transfers  between NIC and memory: packet and descriptor) – 2,56; routing - 0.8; batching – 12.8 Traversal through RB4 includes 2-3 hops; hence estimated latency – 47.6 – 66.4 μ s  Cisco 6500 Series router – 26.3 μ s (packet processing latency) 

Discussion & Conclusions  Not only performance-driven: Space Power Cost • Limiting space in RB4 by integrating • RB4 consumes 2.6KW • RB4 prototype $14.000 „ra w cost ” Ethernet controllers directly on the motherboard (done in laptops) – but idea was • For reference: nominal power not to change hardware rating of popular mid-range • For reference 40Gbps router loaded for 40Gbps is • Estimates made by extrapolating 1.6KW Cisco 7603 router results:400mm motherboard could $70.000. • RB4 can reduce power accomodate 6 controllers to drive 2x10Gbps and 30x1Gbps interfaces. With this we could consumption by slowing down connect 30-40 servers. Thus we would result components not stressed by the in 300-400Gbps router that occupies 30U workflow (rack unit). • For reference: Cisco 7600 Series 360Gbps in a 21U form-factor  How to measure programmability ?

RouteBricks Exploating Parallelism To Scale Software Routers Pawe - PowerPoint PPT Presentation

RouteBricks Exploating Parallelism To Scale Software Routers Pawe Bedyski 12 January 2011 About the paper Published: October 2009, 14 pages People: Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)

RouteBricks:Exploi2ngParallelismto ScaleSo9wareRouters

RouteBricks: Exploiting Parallelism To Scale Software Routers Mihai Dobrescu & Norbert Egi,

Robert Corman Director Center for Procurement and Contract Services Joshua Laipply Chief

Plowing the Streets of Pittsburgh A Dynamic Route Planning System J. Kinable 1 , 2 W-J. van Hoeve

Operator Placement for Stream-Processing Systems Written by Peter Pietzuch, Jonathan Ledlie,

Solving the Green Vehicle Routing Problem Juho Andelmin Enrico Bartolini 1

Traffic control in dynamic environments Irregular demands Arterial road at Karlsruhe Reoccurring

Ingress Point Spreading: A New Primitive for Adaptive Active Network Mapping Guillermo Baltra,

Primitives for Active Internet Topology Mapping: Toward High-Frequency Characterization Robert

Towards a Smart Data Transfer Node Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, Peter H.

Adaptive Controls for In-Line Checked Baggage Screening Systems (CBSS) Shalom Dolev and Cathal

Coordinated Non-intrusive Capturing of Flow Paths Tanja Zseby Competence Center Network Research

FTNCT-2019 at CDAC, Mohali (22 nd to 23 rd November) Date: 22 nd November Session Timing: 3:00 to

ITE San Francisco Bay Area Section Pre se nte d b y Randy Durre nb e rg e r F e b ruary 21,

Data Privacy, Mechanism Design and Learning Steven Wu University of Pennsylvania ITCS

Cyber and Electronic Warfare Division DSTO Partnerships Week 2015 Science and technology to

Flexible Innovation Testbed (SAFIT TM ) Final Presentation September 6, 2017 Sally C. Johnson

Connecting E-Hailing to Mass Transit Platform Marco Nie 1 1 Department of Civil and Environmental

P2 Wir ireless Mesh Network Smart Virtual Fiber for Surveillance Solution Panja Klaipothong

FPGA Architecture Support for Heterogeneous, Relocatable Partial Bitstreams Christophe H URIAUX v

ARGONNES AURORA EXASCALE COMPUTER SUSAN COGHLAN Aurora Technical Lead and ALCF-3 Project

TEAM Project Presentation 1 st MyWay Collaboration Workshop Andreas Schwarz, EICT Fraunhofer

This is where Virtual System Design starts! Our virtual testing capabilities enable us to deliver

Serial QDR LVDS High-Speed ADCs on Xilinx Series 7 FPGAs Bruno Valinoti, Rodrigo Melo April 10th

RouteBricks Exploating Parallelism To Scale Software Routers Pawe - PowerPoint PPT Presentation

RouteBricks Exploating Parallelism To Scale Software Routers Pawe Bedyski 12 January 2011 About the paper Published: October 2009, 14 pages People: Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)

RouteBricks:Exploi2ngParallelismto ScaleSo9wareRouters

RouteBricks: Exploiting Parallelism To Scale Software Routers Mihai Dobrescu &amp; Norbert Egi,

Robert Corman Director Center for Procurement and Contract Services Joshua Laipply Chief

Plowing the Streets of Pittsburgh A Dynamic Route Planning System J. Kinable 1 , 2 W-J. van Hoeve

Operator Placement for Stream-Processing Systems Written by Peter Pietzuch, Jonathan Ledlie,

Solving the Green Vehicle Routing Problem Juho Andelmin Enrico Bartolini 1

Traffic control in dynamic environments Irregular demands Arterial road at Karlsruhe Reoccurring

Ingress Point Spreading: A New Primitive for Adaptive Active Network Mapping Guillermo Baltra,

Primitives for Active Internet Topology Mapping: Toward High-Frequency Characterization Robert

Towards a Smart Data Transfer Node Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, Peter H.

Adaptive Controls for In-Line Checked Baggage Screening Systems (CBSS) Shalom Dolev and Cathal

Coordinated Non-intrusive Capturing of Flow Paths Tanja Zseby Competence Center Network Research

FTNCT-2019 at CDAC, Mohali (22 nd to 23 rd November) Date: 22 nd November Session Timing: 3:00 to

ITE San Francisco Bay Area Section Pre se nte d b y Randy Durre nb e rg e r F e b ruary 21,

Data Privacy, Mechanism Design and Learning Steven Wu University of Pennsylvania ITCS

Cyber and Electronic Warfare Division DSTO Partnerships Week 2015 Science and technology to

Flexible Innovation Testbed (SAFIT TM ) Final Presentation September 6, 2017 Sally C. Johnson

Connecting E-Hailing to Mass Transit Platform Marco Nie 1 1 Department of Civil and Environmental

P2 Wir ireless Mesh Network Smart Virtual Fiber for Surveillance Solution Panja Klaipothong

FPGA Architecture Support for Heterogeneous, Relocatable Partial Bitstreams Christophe H URIAUX v

ARGONNES AURORA EXASCALE COMPUTER SUSAN COGHLAN Aurora Technical Lead and ALCF-3 Project

TEAM Project Presentation 1 st MyWay Collaboration Workshop Andreas Schwarz, EICT Fraunhofer

This is where Virtual System Design starts! Our virtual testing capabilities enable us to deliver

Serial QDR LVDS High-Speed ADCs on Xilinx Series 7 FPGAs Bruno Valinoti, Rodrigo Melo April 10th

RouteBricks: Exploiting Parallelism To Scale Software Routers Mihai Dobrescu & Norbert Egi,