microsoft s production
play

Microsofts Production Configurable Cloud Derek Chiou Microsoft - PowerPoint PPT Presentation

Microsofts Production Configurable Cloud Derek Chiou Microsoft Azure Cloud Silicon UT Austin H2RC Nov 14, 2016 1 Todays Data Centers O(100K) servers/data center Very dense, maximize number of servers Tens of MegaWatts


  1. Microsoft’s Production Configurable Cloud Derek Chiou Microsoft Azure Cloud Silicon UT Austin H2RC Nov 14, 2016 1

  2. Today’s Data Centers • O(100K) servers/data center • Very dense, maximize number of servers • Tens of MegaWatts • Strict power and cooling requirements • Secure, hot, noisy • Incrementally upgraded • 3 year server depreciation, upgraded quarterly • Applications change very rapid (weekly, monthly) • Many advantages including economies of scale, data all in one place, etc. • At data center scales, don’t need to get an order of magnitude improvement to make sense • Positive ROI at large scale easier to achieve • How can we improve efficiencies? H2RC Nov 14, 2016 2

  3. Efficiency via Specialization FPGAs ASICs Source: Bob Broderson, Berkeley Wireless group H2RC Nov 14, 2016 3

  4. What Does a Data Center Server With an FPGA look like? Depends on your point of view H2RC Nov 14, 2016 4

  5. Classic View of Computer DRAM network CPU Storage H2RC Nov 14, 2016 5

  6. Networking View of Computer DRAM Network “offload” network CPU Acc Storage H2RC Nov 14, 2016 6

  7. “Offload” Accelerator view of Server DRAM Intel MCP Acc Acc network CPU NIC Acc Storage H2RC Nov 14, 2016 7

  8. Our View of a Data Center Computer DRAM DRAM CPU DRAM CPU network FPGA Acc Storage H2RC Nov 14, 2016 8

  9. Benefits • Software receives packets slowly • Interrupt or polling • Parse packet, start right work • FPGA processes every packet anyways • Packet arrival is an event that FPGA deals with • Identify FPGA work, pass CPU work to CPU • Map common case work to FPGA • Processor never sees packet • Can read/modify system memory to keep app state consistent • CPU is complexity offload engine for FPGA! • Many possibilities • Distributed machine learning • Software defined networking • Memcached get H2RC Nov 14, 2016 9

  10. Catapult v2 Mezzanine card Converged Bing/Azure Architecture WCS 2.0 Server Blade Catapult V2 DRAM DRAM DRAM 40Gb/s CPU CPU FPGA Gen3 2x8 Switch QPI QSFP WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA QSFP Gen3 x8 NIC 40Gb/s QSFP Option Card Mezzanine • Completely flexible architecture Connectors Pikes Peak 1. local compute accelerator 2. remote compute accelerator WCS Tray Backplane 3. Network/storage accelerator H2RC Nov 14, 2016 10

  11. Network Connectivity (IP) H2RC Nov 14, 2016 11

  12. Case 1: Local compute accelerator Bing Ranking as a Service H2RC Nov 14, 2016 12

  13. Bing Document Ranking Flow Sele lectio ion as s a Serv rvice ice (S (SaaS) Ranki king as s a Serv rvic ice (RaaS) SaaS 1 RaaS 1 IFM 1 IFM 1 IFM 1 IFM 1 IFM 1 IFM 1 RaaS 2 SaaS 2 Selecte Sel ted IFM 2 IFM 2 IFM 2 IFM 2 IFM 2 IFM 2 Documents Do ts 10 10 blu lue lin links ks Query Qu RaaS 3 SaaS 3 IFM 3 IFM 3 IFM 3 IFM 3 IFM 3 IFM 3 RaaS SaaS IFM 44 IFM 44 48 48 IFM 44 IFM 44 IFM 44 IFM 44 Sele lectio ion-as as-a-Service (S (SaaS) Ranki king-as as-a-Service (RaaS) ) - Find all docs that contain query terms, - Compute scores for how relevant each selected - Filter and select candidate documents for document is for the search query ranking - Sort the scores and return the results H2RC Nov 14, 2016 13

  14. FE: Feature Extraction Query: “FPGA Configuration” Docu cument {Query, Document} NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 ~4K Dynamic Features ~2K Synthetic Features Sco core re L2 Score H2RC Nov 14, 2016 14

  15. Feature Extraction Accelerator Compressed Control/Data Distribution latches Document Tokens Stream PCIe Preprocessing FSM Free Form Feature Expression Gathering (FFE) Network H2RC Nov 14, 2016 15

  16. Bing Production Results 99.9% Query Latency versus Queries/sec HW vs. SW Latency and Load 99.9% software latency software 99.9% FPGA latency average FPGA query load FPGA average software load H2RC Nov 14, 2016 16

  17. Case 2: Remote accelerator H2RC Nov 14, 2016 17

  18. Feature Extraction FPGA faster than needed • Single feature extraction FPGA much faster than single server • Wasted capacity and/or wasted FPGA resources • Two choices • Somehow reduce performance and save FPGA resources • Allow multiple servers to use single FPGA? • Use network to transfer requests and return responses H2RC Nov 14, 2016 18

  19. Inter-FPGA communication • FPGAs can encapsulate SP1 SP3 SP0 SP2 their own UDP packets • Low-latency inter-FPGA L1/L2 CS0 CS1 CS2 CS3 communication (LTL) • Can provide strong ToR ToR network primitives L0 • But this topology opens FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA up other opportunities NIC NIC NIC NIC NIC NIC NIC NIC Server Server Server Server Server Server Server Server H2RC Nov 14, 2016 19

  20. Lightweight Transport Layer (LTL) Latencies 25 LTL L2 6x8 Torus Latency LTL Average Latency LTL 99.9th Percentile 20 Round-Trip Latency (us) 15 Example L1 latency histogram LTL L1 10 Example L0 latency histogram LTL L0 (same TOR) 5 6x8 Torus (can reach up to 48 FPGAs) Examples of L2 latency histograms for different pairs of FPGAs 0 1 10 100 1000 10000 10K 100000 100K 250K 1000000 Number of Reachable Hosts/FPGAs H2RC Nov 14, 2016 20

  21. Hardware Acceleration as a Service Across Data Center (or even across Internet) CS CS ToR ToR ToR ToR HPC Speech to text Large-scale Bing Ranking HW deep learning Bing Ranking SW H2RC Nov 14, 2016 21

  22. BrainWave: Scaling FPGAs To Ultra-Large DNN Models • Distribute NN models across as many FPGAs as needed (up to thousands) • Use HaaS and LTL to manage multi- FPGA execution • Very close to live production H2RC Nov 14, 2016 22

  23. BrainWave Publicly Demoed • Ignite 2016 • Translation DNN running on FPGAs • 2 orders of magnitude lower latency than CPU implementation • < 10% of power H2RC Nov 14, 2016 23

  24. Case 3: Networking accelerator H2RC Nov 14, 2016 24

  25. FPGA SmartNIC for Cloud Networking • Azure runs Software Defined Networking on the hosts • Software Load Balancer, Virtual Networks – new features each month • Before, we relied on ASICs to scale and to be COGS-competitive at 40G+ • But 12 to 18 month ASIC cycle + time to roll out new HW is too slow to keep up with SDN • SmartNIC gives us the agility of SDN with the speed and COGS of HW • Base SmartNIC provide common functions like crypto, GFT, QoS, RDMA on all hosts SLB Decap SLB NAT VNET ACL Metering Rule Rule Action Action Rule Action Rule Action Rule Action Rule Action Decap DNAT Rewrite Allow Meter VM * * * * * VFP Rewrite SR-IOV VMSwitch (Host Bypass) GFT SmartNIC QoS Flow Action Crypto RDMA 1.2.3.1->1.3.4.1, 62362->80 Decap, DNAT, Rewrite, Meter 50G H2RC Nov 14, 2016 25

  26. Azure Accelerated Networking • SR-IOV turned on • VM accesses NIC hardware directly, VM sends messages with no OS/hypervisor call Guest OS • FPGA determines flow of each Hypervisor packet, rewrites header to make data center compatible VFP VM • Reduces latency to roughly bare NIC NIC metal GFT/FPGA • Azure now has the fastest public cloud network • 25Gb/s at 25us latency • Fast crypto developed H2RC Nov 14, 2016 26

  27. We Are Hiring and Collaborating • We are hiring FPGA and software folks • Academic engagements • Research.Microsoft.com/catapult • Will provide boards to a limited number of academics (1 page proposal) • Will be giving access to clusters of up to 48 at TACC • Research grants • Internships • Please contact me if you’re interested • dechiou@microsoft.com • catapult@Microsoft.com H2RC Nov 14, 2016 27

  28. Will Configurable Clouds Change the World? • Being deployed for all new Azure and Bing machines • Many other properties as well • Ability to reprogram a datacenter’s hardware • Specialized compute acceleration • Networking, storage, security • Can turn homogenous machines into specialized SKUs dynamically • Hyperscale performance with low latency communication • Exa-ops of performance with a O(10us) diameter • What should we do with the world’s most powerful configurable fabric? H2RC Nov 14, 2016 28

Recommend


More recommend