Ch 6b: Data-center networking Holger Karl Future Internet Computer Networks Group Universität Paderborn
Outline • Evolution of data centres • Topologies • Networking issues • Case study: Jupiter rising SS 19, v 0.9 FI - Ch 6b: Data-center networking 2
Evolution of data centres • Scale • Workloads: north-south traffic to east-west traffic • Data-parallel applications, map-reduce frameworks • Requires different optimization: bisection bandwidth • Latency! • Virtualization • Many virtual machines – scale • Moving virtual machines – reassign MAC addresses? SS 19, v 0.9 FI - Ch 6b: Data-center networking 3
Evolution: Scale SS 19, v 0.9 FI - Ch 6b: Data-center networking 4
Example: CERN LHC • 24 Gigabytes/s produced • 30 petabytes produced per year • > 300 petabytes online disk storage • > 1 petabyte per day processed https://home.cern/about/computing • > 550.000 cores • > 2 million jobs/day www.computerworld.com/article/2960642/cloud -storage/cerns-data-stores-soar-to-530m- gigabytes.html SS 19, v 0.9 FI - Ch 6b: Data-center networking 5
Evoluation: Workloads • Conventional: Mostly north-south traffic • From individual machine to gateway • Typical: Webserver farm • Modern: East-west traffic • From server to server • Typical: data-parallel applications like map/reduce SS 19, v 0.9 FI - Ch 6b: Data-center networking 6
Programming model – Rough idea ? Server #1 Related Final Process Lore Intermedia result Duis m dolor te results s ipsum vel dolor Server #2 Intermedia Dui Duis Collect s Process dolor te results aut vel e vel Server #3 Related Final Intermediate Duis Process Intermedia Process result results dolor vel te results s Server #4 Intermedia Related Final Lore Dui Process te results m Process s Intermedia result ipsum aut dolor e te results s vel Map Shuffle Reduce SS 19, v 0.9 FI - Ch 6b: Data-center networking 7
Evolution: Virtualization • Virtualize machines! • Many more MAC addresses to handle • Easily: hundreds of thousands of VMs • Scaling problem for switches • Give hierarchical MAC addresses? Eases routing • ARP! • More problematic: Moving a VM from one physical machine to another • Must not change IP address – one L2 domain!! • ARP? Caching? • Keep MAC address? Makes hierarchical MACs infeasible SS 19, v 0.9 FI - Ch 6b: Data-center networking 8
Topologies in data centres • Basic physical setups • 19’’ racks • Often: 42 units high (one unit: 1.75’’) • Server: 1U – 4U • Two processors per 1U@32 cores each: up to 2688 cores per rack (as of 2019), one core easily deals with 10 VMs • Blade enclosure: 10 U • Networking inside a rack: Top of rack (ToR) switch • 48 ports, 1G or 10G typical • 2-4 uplinks, often 10G, evolution to 40G, perhaps 100G in the future • Some (small) number of gateways to outside world • Core question: how to connect ToRs? • To support N/S and E/W traffic SS 19, v 0.9 FI - Ch 6b: Data-center networking 9
Topologies: requirements • High throughput / bisection bandwidth • Fault-tolerant setup: typically, 2-connected • Means: multiple paths between any two end hots in operation! • Not just spanning tree! • But: Loop freedom • VM migration support • one L2 domain! SS 19, v 0.9 FI - Ch 6b: Data-center networking 10
Topology: Example • Example: Cisco standard recommendation SS 19, v 0.9 FI - Ch 6b: Data-center networking 11
Clos Network 3-stage Clos • Idea: Build an nxn crossbar switch out of smaller kxk crossbar switches • Nonblocking 5-stage Clos https://upload.wikimedia.org/wikipedia/en/9/9a/Closnet work.png IEEE ANTS 2012 Tutorial SS 19, v 0.9 FI - Ch 6b: Data-center networking 12
Fat-Tree Topology: Special case of Clos IEEE ANTS 2012 Tutorial SS 19, v 0.9 FI - Ch 6b: Data-center networking 13
SS 19, v 0.9 FI - Ch 6b: Data-center networking 14
Questions to answer • Which path to use? • To exploit entire bisection bandwidth, without overload • Options • Central point • Valiant load balancing • Equal Cost Multi-Pathing (ECMP) • Choose path by hashing SS 19, v 0.9 FI - Ch 6b: Data-center networking 15
Papers to know • If we had time, we now would talk about: • Portland • VL2 • Helios • Hedera SS 19, v 0.9 FI - Ch 6b: Data-center networking 16
Networking issues • How to make sure forwarding works in a huge L2 domain? • With multi-pathing, so no spanning tree solution plausible • One approach: IETF TRILL (Transparent Interconnection of Lots of Links) • Idea: Start from a plain Ethernet with bridges that do spanning tree • But replace (subset of) bridges with RoutingBridges (Rbridge) • Operating on L2 • Still looks like a giant Ethernet domain to IP • Other buzzwords: Bridging protocols (802.1q), Provider Bridging (802.11ad), Provider Backbone Bridging(802.1ah); Shortest Path Bridging (IEEE 802.1aq), data center bridging (802.1Qaz. .1Qbb, 1Qau) SS 19, v 0.9 FI - Ch 6b: Data-center networking 17
TRILL operations • Rbridges find each other using a link-state protocol • Do routing on these link states • Along those computed paths, do tunneling over the Rbridges • Needs an extra header, first Rbridge encapsulates packet SS 19, v 0.9 FI - Ch 6b: Data-center networking 18
Case study: Jupiter rising • Google sigcom paper, 2015 • https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183. pdf • Figures, tables taken from that paper • Describes evolution of Google’s internal data center networks • Starting point: ToRs connected to ring of routers Figure 9: A 128x10G port Watchtower chassis (top left). The internal non-blocking topology over eight linecards (bottom left). Four chassis housed in two racks cabled with SS 19, v 0.9 FI - Ch 6b: Data-center networking 19 fiber (right). 188
Jupiter rising SS 19, v 0.9 FI - Ch 6b: Data-center networking 20
Jupiter rising: Challenges Challenge Our Approach (Section Discussed in) Introducing the network to production Initially deploy as bag-on-the-side with a fail-safe big-red button (3.2) High availability from cheaper components Redundancy in fabric, diversity in deployment, robust software, necessary protocols only, reliable out of band control plane (3.2, 3.3, 5.1) High fiber count for deployment Cable bundling to optimize and expedite deployment (3.3) Individual racks can leverage full uplink Introduce Cluster Border Routers to aggregate external bandwidth shared capacity to external clusters by all server racks (4.1) Incremental deployment Depopulate switches and optics (3.3) Routing scalability Scalable in-house IGP, centralized topology view and route control (5.2) Interoperate with external vendor gear Use standard BGP between Cluster Border Routers and vendor gear (5.2.5) Small on-chip bu ff ers Congestion window bounding on servers, ECN, dynamic bu ff er sharing of chip bu ff ers, QoS (6.1) Routing with massive multipath Granular control over ECMP tables with proprietary IGP (5.1) Operating at scale Leverage existing server installation, monitoring software; tools build and operate fabric as a whole; move beyond individual chassis-centric network view; single cluster-wide configuration (5.3) Inter cluster networking Portable software, modular hardware in other applications in the network hierarchy (4.2) Table 1: High-level summary of challenges we faced and our approach to address them. SS 19, v 0.9 FI - Ch 6b: Data-center networking 21 186
Jupiter rising: Generations Datacenter First Merchant ToR Aggregation Spine Block Fabric Host Bisection Generation Deployed Silicon Config Block Config Config Speed Speed BW Four-Post CRs 2004 vendor 48x1G - - 10G 1G 2T Firehose 1.0 2005 8x10G 2x10G up 2x32x10G (B) 32x10G (NB) 10G 1G 10T 4x10G (ToR) 24x1G down Firehose 1.1 2006 8x10G 4x10G up 64x10G (B) 32x10G (NB) 10G 1G 10T 48x1G down Watchtower 2008 16x10G 4x10G up 4x128x10G (NB) 128x10G (NB) 10G nx1G 82T 48x1G down Saturn 2009 24x10G 24x10G 4x288x10G (NB) 288x10G (NB) 10G nx10G 207T Jupiter 2012 16x40G 16x40G 8x128x40G (B) 128x40G (NB) 10/40G nx10G/ 1.3P nx40G Table 2: Multiple generations of datacenter networks. (B) indicates blocking, (NB) indicates Nonblocking. SS 19, v 0.9 FI - Ch 6b: Data-center networking 22 186
Jupiter rising: Firehose 1.0 Figure 5: Firehose 1.0 topology. Top right shows a sam- ple 8x10G port fabric board in Firehose 1.0, which formed Stages 2, 3 or 4 of the topology. SS 19, v 0.9 FI - Ch 6b: Data-center networking 23 186
Jupiter rising: Saturn Figure 12: Components of a Saturn fabric. A 24x10G Pluto ToR Switch and a 12-linecard 288x10G Saturn chassis (in- cluding logical topology) built from the same switch chip. Four Saturn chassis housed in two racks cabled with fiber (right). SS 19, v 0.9 FI - Ch 6b: Data-center networking 24 189
Jupiter rising: Jupiter Figure 13: Building blocks used in the Jupiter topology. Figure 14: Jupiter Middle blocks housed in racks. Figure 16: Two-stage fabrics used for inter-cluster and Figure 15: Four options to connect to the external network intra-campus connectivity. layer. SS 19, v 0.9 FI - Ch 6b: Data-center networking 25 190 190 191
Recommend
More recommend