re configurable clouds and the dawn of a new era
play

(Re)Configurable Clouds and the Dawn of a New Era Doug Burger @ - PowerPoint PPT Presentation

(Re)Configurable Clouds and the Dawn of a New Era Doug Burger @ Microsoft Research NExT FPL Keynote August 30, 2016 Client Cloud Training Humans GPUs Inference ASICs ? 5.8+ billion 250+ million 400+ million active users active


  1. (Re)Configurable Clouds and the Dawn of a New Era Doug Burger @ Microsoft Research NExT FPL Keynote August 30, 2016

  2. Client Cloud Training Humans GPUs Inference ASICs ?

  3. 5.8+ billion 250+ million 400+ million active users active accounts worldwide queries each month 2.4+ million 8.6+ trillion objects in Microsoft Azure emails per day storage 50+ billion 48+ million 50+ million 1 in 4 enterprise customers users in 41 active users minutes of connections handled markets each month 200+ Cloud Services: Div Diver ersity sity 1+ billion customers · 20+ million businesses · 90+ markets worldwide

  4. What Drives a Post-CPU “Enhanced” Cloud? Homogeneity Efficiency (ASICS) 6

  5. Catapult V0: BFB (2011) • Use commodity SuperMicro servers • 6 Xilinx LX240T FPGAs • One appliance per rack • All rack machines communicate over 1Gb Ethernet 1U rack-mounted • 2 x 10Ge ports • 3 x16 PCIe slots • 12 Intel Westmere • cores (2 sockets) Microsoft Confidential 8

  6. Bing Ranking Implementation Details FE FFE DTS DTT [3][11] DTT [2][11] DTT [1][11] DTT [0][11] Distribution latches Stream DTT [3][10] DTT [2][10] DTT [1][10] DTT [0][10] FFE [1][3] FFE [0][3] Control/Data Tokens Preprocessing DTT [3][9] DTT [2][9] DTT [1][9] DTT [0][9] FSM DTT [3][8] DTT [2][8] DTT [1][8] DTT [0][8] DTT [3][7] DTT [2][7] DTT [1][7] DTT [0][7] FFE [1][2] FFE [0][2] DTT [3][6] DTT [2][6] DTT [1][6] DTT [0][6] Feature DTT [3][5] DTT [2][5] DTT [1][5] DTT [0][5] Transmissio DTT [3][4] DTT [2][4] DTT [1][4] DTT [0][4] n FFE [1][1] FFE [0][1] Network DTT [3][3] DTT [2][3] DTT [1][3] DTT [0][3] DTT [3][2] DTT [2][2] DTT [1][2] DTT [0][2] DTT [3][1] DTT [2][1] DTT [1][1] DTT [0][1] FFE [1][0] FFE [0][0] DTT [3][0] DTT [2][0] DTT [1][0] DTT [0][0] FE0 FE1 FFE: 64 cores / chip 89 Non-BodyBlock 55 BodyBlock Basic Tile Basic Tile 256-512 threads Features Features Registers 34 State Machines 20 State Machines DTT: 48 DTT tiles/chip Constants DSP Complex ALU FFE 1 45 % Utilization Inst. 55 % Utilization Scheduling Logic Ln, ÷, div … Local 240 tree processors ALU FFE n Inst. DSP 2880 trees/chip Basic Tile Basic Tile Compression Thresholds

  7. • Fundamental flaws: • Additional single point of failure • Additional SKU to maintain • Too much load on the 1Gb network • Inelastic FPGA scaling or stranded capacity

  8. Catapult V1 Card (2012-2013) Stratix V • Altera Stratix V D5 • 172.6K ALMs, 2014 M20Ks 8GB DDR3 • 457KLEs • 1 KLE == ~12K gates • M20K is a 2.5KB SRAM • PCIe Gen 2 x8, 8GB DDR3 • 20 Gb network among FPGAs PCIe Gen3 x8 11

  9. Mapped Fabric into a Pod Top Of Rack Switch (TOR) Server 1 Server 48 FPGA FPGA DTWS S0 DTWS S0 Server 2 Server 47 FPGA FPGA DTWS S0 DTWS S0 Server 46 Server 3 FPGA FPGA DTWS S0 DTWS S0 10Gb Ethernet Links Server 45 Server 4 FPGA FPGA DTWS S1 DTWS S1 FPGA Torus … … Server 23 Server 26 DTWS FPGA S2 DTWS FPGA S2 • Low-latency access to a local FPGA • Compose multiple FPGAs to accelerate large workloads Server 24 Server 25 DTWS FPGA DTWS FPGA S2 S2 • Low-latency, high-bandwidth sharing of storage and memory across server boundaries

  10. 1,632 server pilot deployed in production datacenter 13

  11. • Fundamental flaws: • Microsoft was converging on a single SKU • No one else wanted the secondary network • Complex, difficult to handle failures • Difficult to service boxes • No killer infrastructure accelerator • Application presence is too small

  12. Catapult v2 Mezzanine card Catapult V2 Architecture WCS 2.0 Server Blade Catapult V2 DRAM DRAM DRAM 40Gb/s CPU CPU FPGA Gen3 2x8 Switch QPI QSFP WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA QSFP Gen3 x8 NIC 40Gb/s QSFP Option Card Mezzanine • The architecture justifies the economics Connectors Pikes Peak 1. Can act as a local compute accelerator 2. Can act as a network/storage accelerator WCS Tray Backplane 3. Can act as a remote compute accelerator 15

  13. (Also need to build a complete platform) Operation Hardware Acceleration Platform Development Driver Deployment Product SW LZMA FIFO LTL JPEG compression HW Library Flight Recorder Encryption Rams DTS Package(s) Catapult Runtime Library FPGA Watchdog HEX Queue Manager (AutoPilot Integration ) OpenCL/HLS Compiler Software Dev Kit & Runtime Library SDK Package Prod. HW/RTL Libs Catapult Kernel Driver Shell OpenCL BSP Role HW API CSI: Board & SKU Qualification Process CloudBuild Shell Shell FPGA Flow Package Golden Image Health & Config DRAM PCIe DMA Network License Servers Factory & Integration SVC Verification Team Test Suites

  14. Case 1: Use as a local accelerator

  15. Production Results (December 2015) 99.9% Query Latency versus Queries/sec HW vs. SW Latency and Load 99.9% software latency software 99.9% FPGA latency average FPGA query load FPGA average software load 18

  16. Case 2. Use as an infrastructure accelerator

  17. FPGA SmartNIC for Cloud Networking • Azure runs Software Defined Networking on the hosts • Software Load Balancer, Virtual Networks – new features each month • We rely on ASICs to scale and to be COGS-competitive at 40G+ • But 12 to 18 month ASIC cycle + time to roll out new HW is too slow to keep up with SDN • SmartNIC gives us the agility of SDN with the speed and COGS of HW • Base SmartNIC will provide common functions like crypto, GFT, QoS, RDMA on all hosts • 40Gb/s network, 20Gb/s crypto takes a significant fraction of a 24-core machine • Example: crypto and vswitch inline on the FPGA: 0% CPU cost SLB Decap SLB NAT VNET ACL Metering Rule Rule Action Action Rule Action Rule Action Rule Action Rule Action Decap DNAT Rewrite Allow Meter VM * * * * * VFP Rewrite SR-IOV VMSwitch (Host Bypass) GFT SmartNIC QoS Flow Action Crypto RDMA 1.2.3.1->1.3.4.1, 62362->80 Decap, DNAT, Rewrite, Meter 50G

  18. Case 3: Use as a remote accelerator

  19. Inter-FPGA communication • FPGAs can encapsulate SP1 SP3 SP0 SP2 their own UDP packets • Low-latency inter-FPGA L1/L2 CS0 CS1 CS2 CS3 communication (LTL) • Can provide strong ToR ToR network primitives L0 • But this topology opens FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA up other opportunities NIC NIC NIC NIC NIC NIC NIC NIC Server Server Server Server Server Server Server Server 22

  20. FPGA-to-FPGA LTL Round-Trip Latencies

  21. Hardware Acceleration as a Service CS CS ToR ToR ToR ToR HPC Speech to text Large-scale Bing Ranking HW deep learning Bing Ranking SW

  22. BrainWave: Scaling FPGAs To Ultra-Large Models • Thanks to Eric Chung and team • Distribute NN models across as many FPGAs as needed (up to thousands) • Recent Imagenet competition: 152- layer model • Use HaaS and LTL to manage multi- FPGA execution • Very close to live production • Only vectors travel over network • Low FPGA-FPGA latency at ~1.8us per L0 hop

  23. CPUs V2 Architecture Makes Configurable Clouds Possible FPGAs DRAM ToR ToR ToR ToR CS CS CS CS ToR ToR ToR ToR • Massive amounts of programmable logic will change datacenter architecture broadly Is an independent computer running outside of the CPU domain • Will affect network architecture (protocols, switches), storage architecture, security models •

  24. Will Catapult v2 be Deployed at Scale? 27

  25. Configurable Clouds will Change the World • Ability to reprogram a datacenter’s hardware protocols • Networking, storage, security • Can turn homogenous machines into specialized SKUs dynamically • Unprecedented performance and low latency at hyperscale • Exa-ops of performance with a 10 microsecond diameter • What would you do with the world’s most powerful fabric? 28

Recommend


More recommend