Networking Challenges for the Next Decade Amin Vahdat On behalf of Google Technical Infrastructure and Google Cloud Platform APRIL 4, 2017
Google Network More than a collection of data centers FASTER (US, JP, TW) 2016 SJC (JP, HK, SG) 2013 Unity (US, JP) 2010 Network fiber Points of presence >100 Google Global Cache edge nodes
Google Cloud Regions Adding 11 new regions Finland 3 Netherlands 2 London 3 Frankfurt 3 3 Oregon Montreal 2 3 Belgium Iowa 4 California N Virginia 3 Tokyo 3 3 3 S Carolina Taiwan 3 Mumbai 3 Singapore 2 São Paulo 3 Current regions and number of zones # Sydney 3 # Future regions and number of zones
Ubiquitous Cloud...10x Scaling Datacenter Campus & Metro WAN Next-gen disaggregation of Cloud regions and campus Cloud replication and storage, memory and compute expansion driving DC bandwidth intensive cloud interconnect services (e.g., turnkey video, IoT) 10x 10x 10x Step Function Disruptions: Bandwidth, Latency, Availability, Predictability
The Pillars of SDN @ Google B4 Andromeda Jupiter WAN NFV and network Datacenter Interconnect virtualization Networking
The Pillars of SDN @ Google B4 Andromeda Jupiter Espresso WAN NFV and network Datacenter SDN for public Interconnect virtualization Networking Internet
B4: Google's Software Defined WAN B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]
B4: From Copy Network to Business Critical B4 traffic 2012 — 2016 B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]
Andromeda Google Infrastructure Services VNET: 10.1.1/24 Load Balancing DoS VNET: 192.168.32/24 ACLs VNET: 5.4/16 VPN NFV ToR ToR ToR ToR Internal Network 10.1.1/24 10.1.2/24 10.1.3/24 10.1.4/24
Google Datacenter Network Innovation And hardware scale that we could not buy Capacity Jupiter Watchtower Firehose 1.0 1.3Pb/s clusters in 2013 Saturn Firehose 4 Post 1.1 Time 10
The Pillars of SDN @ Google B4 Andromeda Jupiter Public WAN NFV and network Datacenter Internet? Interconnect virtualization Networking
The Pillars of SDN @ Google B4 Andromeda Jupiter Espresso WAN NFV and network Datacenter SDN for public Interconnect virtualization Networking Internet
Espresso in Context B4 Jupiter Data Center Google
Espresso in Context Peering Metro B2 B4 Jupiter Data Center Google Google
Espresso in Context User Peering Metro B2 Espresso B4 Jupiter Data Center Google Internet Google
Espresso: Before and After Router Espresso Cloud 1.0 Centric SDN Protocols Peering Local view Per-metro and global view Connectivity first Application signals Coarse fault recovery Real-time optimization
Espresso Architecture Overview Espresso Metro Peering Fabric BGP speaker Label-switched Fabric eBGP Peering External Peer
Espresso Architecture Overview Espresso Metro Host Peering Fabric Host Host Host Host Host BGP Packet speaker Processor Host Labeled packets Label-switched Host specify egress Fabric Host Host Host eBGP Peering External Peer
Espresso Architecture Overview Global Controller Application Signals Espresso Metro Local Control Host Peering Fabric Host Host Host Host Host BGP Packet speaker Processor Host Labeled packets Label-switched Host specify egress Fabric Host Host Host eBGP Peering External Peer
Next Decade Challenges in Networking The next wave in computing • Serverless compute in Cloud 3.0 • IoT • Tightly coupled, general purpose distributed computing It’s time to put it all together • Agile Scale • Jitter • Isolation • Performance is great, but only meaningful with availability, manageability, and velocity
Last Decade Cloud 1.0 Virtualization delivers capex savings to enterprise DCs
Now HW on Demand Cloud 1.0 Cloud 1.0 Cloud 2.0 Public cloud frees enterprise from private HW infrastructure Scheduling, load balancing primitives, “big data” query processing
The Third Wave of Cloud Computing Compute, not servers Cloud 1.0 Cloud 2.0 Cloud 3.0 Serverless compute, real-time intelligence, and machine learning Not data placement, load balancing, OS configuration and patching
The Third Wave of Cloud Computing Cloud 1.0 Cloud 2.0 Cloud 3.0 Networking should be aiming for Cloud 3.0
Networking and Cloud 3.0 Storage disaggregation: the datacenter is the storage appliance Seamless telemetry and scale up/down Transparent live migration Open Marketplace of services, securely placed and accessed
Networking and Cloud 3.0 Applications+Functions not VMs Policy not middleboxes Actionable Intelligence not data processing SLOs not placement/load balancing/scheduling
Next Decade Challenges in Networking The network will enable next-generation compute infrastructure The network can define next-generation storage infrastructure The right network infrastructure can deliver fundamental new capability
How we Prioritize Infrastructure Work Performance Stranding Velocity Manageability Availability
Availability is Paramount • First things first: an insecure infrastructure is an unavailable infrastructure • Stability is more important than efficiency • Network management is critical • Configuration is hard • Automation matters but can be counter to availability “Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure.” SIGCOMM 2016.
Build for Velocity • Velocity is the speed of iteration • Retrospective on “Tussle in Cyberspace: Defining Tomorrow’s Internet” • Build for hitless upgrades and self-validation • Debugging and tracing matter ○ Without visibility, performance does not matter • Network fabrics built for expansion and evolution • Launch and Iterate
Isolation is Critical; Stranding is Terrible Isolation with reservations is easy but leads to huge resource stranding ● General-purpose, shared infrastructure to approximate custom-built and reserved Isolation has many components ● Latency, bandwidth, but also the control plane ● Accounting and chargeback are big missing pieces Congestion Control is still really hard ● Rationalizing multiple control loops, flow, endpoint, flow group, Traffic Engineering
Performance only Matters if End to End Amdahl’s law applies and so an incredible, localized optimization that takes any effort to adopt will be ignored 1. Scale 2. Jitter 3. Storage Disaggregation Must optimize from the application all the way to the end user
How we Prioritize Infrastructure Work Performance Stranding Velocity Manageability Availability
Next Decade Challenges in Networking The next wave of computing • Serverless compute in Cloud 3.0 • IoT • Tightly coupled, general purpose distributed computing It’s time to put it all together • Agile Scale • Jitter • Isolation • Performance is great, but only meaningful with availability, manageability, and velocity
Thank You! Thank You!
Open Source Google Google Google Google Borg Google Borg MapReduce Bigtable Dremel Google Cloud Platform 36
Open Source TCP Open QUIC gRPC ... BBR Config Google Cloud Platform 37
Recommend
More recommend