The Forwarding Plane: An Old New Frontier of Networking Research CS244, Spring 2019 Changhoon Kim chang@barefootnetworks.com
2
What is SDN in plain English? • Ideally at the level for college freshmen – Because, if you can’t, you are not really understanding it! [ Feynman’s guiding principle ] “Making programming networks as easy as programming computers.” 3
Natural questions that follow “Making programming networks as easy as programming computers.” • Why should we program a network? – To realize some “beautiful ideas” easily, preferably on our own • What are those “beautiful ideas”? – Any impactful or intriguing apps in particular? • Why couldn’t we do this easily in the pre-SDN era? – Any fundamental shifts happening? 4
Pre-SDN state of the network industry Engineering Feature Feature Division Network Software Network Equipment Team Owner Vendor ASIC Team Years Years 5
Compared to other industries, this is very unnatural • Because we all know how to realize our own ideas by programming CPUs, GPUs, TPUs, etc. – Programs used in every phase (implement, verify, test, deploy, and maintain) – Extremely fast iteration and differentiation – We own our own ideas – A sustainable ecosystem where all participants benefit Can we replicate this healthy, sustainable ecosystem for networking? 6
What SDN pioneers had realized … Software Feature Team Network Network Equipment Owner Vendor ASIC Years Team 7
And, SDN started to unfold … Network Software Control-plane Team Vendor Weeks to Months Network Owner Feature Network ASIC Forwarding-plane Team Vendor Years 8
And, SDN started to unfold … Feature Innovation-rich, Various Days to Control-plane programmable layer Weeks to Weeks Software Projects Months Network Team Owner Feature Network ASIC Forwarding-plane Team Vendor Innovation-deprived, Years ossified layer 9
Reality: Packet forwarding speeds 6.4Tb/s 100000 10000 Switch Chip 1000 CPU 100 Gb/s (per chip) 10 1 0.1 1990 1995 2000 2005 2010 2015 2020 10
Reality: Packet forwarding speeds 6.4Tb/s 100000 Unaccommodating, performance-dominated zone? 10000 Switch Chip 80x 1000 CPU 100 Gb/s (per chip) 10 1 0.1 1990 1995 2000 2005 2010 2015 2020 11
“ Programmable switches are 10 -100x slower than fixed-function switches. They cost more and consume more power. ” Conventional wisdom in networking
Evidence: Tofino 6.5Tb/s switch (arrived Dec 2016) The world’s fastest and most programmable switch. No power or cost penalty compared to fixed-function switches. An incarnation of PISA (Protocol Independent Switch Architecture)
Domain-specific processors Machine Signal Computers Graphics Networking Learning Processing Java OpenCL Matlab Language TensorFlow >>> Compiler Compiler Compiler Compiler Compiler ? ? TPU CPU GPU DSP
Domain-specific processors Machine Signal Computers Graphics Networking Learning Processing P4 Java OpenCL Matlab TensorFlow >>> Compiler Compiler Compiler Compiler Compiler ? PISA TPU CPU GPU DSP (Protocol-Independent Switch Architecture)
PISA: An architecture for high-speed programmable packet forwarding 16
PISA: Protocol Independent Switch Architecture Match Action Memory ALU Programmable Parser 17
PISA: Protocol Independent Switch Architecture Ingress Buffer Egress Programmable Parser 18
PISA: Protocol Independent Switch Architecture Match Logic Action Logic (Mix of SRAM and TCAM for lookup tables, (ALUs for standard boolean and arithmetic operations, Programmable counters, meters, generic hash tables) header modification operations, hashing operations, etc.) Packet Generator Buffer Programmable M A M A Parser … … Ingress match-action stages (pre-switching) Egress match-action stages (post-switching) Recirculation CPU (Control plane) Generalization of RMT [sigcomm’13] 19
Why we call it protocol-independent packet processing 20
Device does not understand any protocols until it gets programmed IPv4 L2 ACL IPv6 Logical Data-plane View (your P4 program) Switch Pipeline packet packet packet packet Programmable Action ALUs Action ALUs Action ALUs Action ALUs Match Table Match Table Match Table Match Table Queues Fixed Action Parser CLK 21
(your P4 program) Logical Data-plane View Switch Pipeline physical resources Mapping logical data-plane design to Programmable Parser Match Table L2 Table L2 L2 Action ALUs L2 Action Macro Match Table IPv4 Table Action ALUs v4 Action Macro IPv4 IPv4 IPv6 IPv6 Match Table IPv6 Table Action ALUs v6 Action Macro ACL ACL Match Table ACL Table Action ALUs ACL Action Macro CLK Queues 22
Switch Pipeline (your P4 program) Logical Data-plane View Re- program in the field Programmable Parser L2 Table L2 L2 Action Macro MyEncap MyEncap My IPv4 IPv4 Table Encap v4 Action Macro Action Action IPv6 IPv6 Table IPv4 IPv6 IPv4 IPv6 Action Action v6 Action Macro ACL Table ACL Action Macro CLK ACL Queues 23
P4 language components State-machine; Parser Program Field extraction Match Tables + Table lookup and update; Actions Field manipulation; Control flow Control Flow Field assembly Deparser Program No: memory (pointers), loops, recursion, floating point 24
Questions and critiques …? § What does a compiler do? § What’s the latest on P4? Have you heard of P4 16 ? § How do you update tables at runtime? § Why is it important to derive a runtime API from a P4 program? § What about queueing, scheduling, and congestion control? 25
What exactly does a compiler do? PHV (Packet Header Vector) PHV’ Match Table constant (SRAM or TCAM) action Hash Gen key params Cross Bar Action & Instr ALUs Mem Programmable Queues Parser … … … … CLK 26
P4 16 : Why and how? § Embrace target heterogeneity without language churns § Architectural heterogeneity via architecture-language separation § Functional heterogeneity via extern types § Help reuse code more easily: portability and composability § Standard architecture and standard library § Local name space, local variables, lexical scoping, parameterization, and sub-procedure-like constructs § Make P4 programs more intuitive and explicit § Expressions, sequential execution semantics for actions, strong type, and explicit de-parsing 27
To recap: Why data-plane programming? 1. New features: Realize your beautiful ideas very quickly 2. Reduce complexity: Remove unnecessary features and tables 3. Efficient use of H/W resources: Achieve biggest bang for buck 4. Greater visibility: New diagnostics, telemetry, OAM, etc. 5. Modularity: Compose forwarding behavior from libraries 6. Portability: Specify forwarding behavior once; compile to many devices 7. Own your own ideas: No need to share your ideas with others “Protocols are being lifted off chips and into software” – Ben Horowitz 28
What kind of “ stunt ” can you do by programming data planes? 29
§ Advanced network measurement, analysis, and diagnostics § In-band Network Telemetry [SIGCOMM’15] , Packet History [NSDI’14], FlowRadar [ NSDI’16], Marple [SIGCOMM’17] § Advanced congestion control § RCP, XCP, TeXCP, DCQCN++, Timely++ § Novel DC network fabric § Flowlet switching, CONGA [SIGCOMM’15], HULA [SOSR’16], NDP [SIGCOMM’17] § World’s fastest middleboxes § L4 connection load balancing [SIGCOMM’17], TCP SYN authentication, etc. § Offloading parts of the distributed apps § NetCache [SOSP’17], NetChain [NSDI’18], SwitchPaxos [SOSR’15, ACM CCR‘16] § Jointly optimizing network and the apps running on it § Mostly-ordered Multicast [NSDI’15, SOSP’15] § And many more … -- we’re just starting to scratch the surface! 30
PISA: An architecture for high-speed programmable packet forwarding I/O event processing 31
What we have seen so far: Accelerating part of computing with PISA 1. DNS cache 2. Key-value cache [NetCache - SOSP’17] 3. Key-value replication [NetChain - NSDI’18] 4. Consensus acceleration [P4xos - CCR’16 , Eris - SOSP’17] 5. Parameter service for distributed deep learning 6. Pub-sub service 7. String searching [PPS – SOSR’19] 8. Pre-processing DB queries and streams 32
NetCache : Accelerating KV caching 33
Suppose a KV cluster coping with a highly-skewed & rapidly-changing workload gets and puts ToR Load Server
Suppose a KV cluster coping with a highly-skewed & rapidly-changing workload gets and puts ToR Load Server Q: How can you ensure a high throughput and bound tail latency?
Here comes the problem 1oCDche 1eWCDche(servers) 1eWCDche(cDche) 2.0 ThroughSuW (BQPS) 1.5 1.0 0.5 0.0 uQiforP ziSf-0.9 ziSf-0.95 ziSf-0.99 WorNloDd DisWribuWioQ
What if we had a very fast front-end server? gets and puts A read-only cache handling ToR Front-end Server hot keys directly! Load KV Servers Q: How big and fast the front-end cache should be? 37
Recommend
More recommend