data centre acceleration
play

Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu - PowerPoint PPT Presentation

Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu Background 2019: Data centre traffic will reach 10.4 zettabytes. Annual growth rate of ~25%. 83% of traffic will come from the cloud 80% of workloads will be


  1. Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu

  2. Background 2019: ● Data centre traffic will reach 10.4 ○ zettabytes. Annual growth rate of ~25%. 83% of traffic will come from the cloud ○ 80% of workloads will be processed in the ○ cloud Failure of Dennard scaling ● Up to 50% - 80% of the chip may be kept powered ● down in order to comply with thermal constraints

  3. Solution: Application-specific accelerators Can significantly increase performance of data centres given fixed power budget ● Either be used as a coprocessor or as a complete replacement ● Studies have shown FPGA-based acceleration: ● 25 x better performance per watt ○ 50-75 x latency improvement ○

  4. Cloud Applications Two types of cloud applications: ● Offline batch processing. High volumes of data, involving several complicated processes ○ Large amount of data are offloaded to the FPGA. Overhead of communication ■ between FPGA and processor minimised. Online streaming processing. Smaller volumes of streaming data, involving simpler ○ processing. Packet processing of network interface card highest computational complexity. FPGAs can be used to offload both the NIC and actual processing of data packets. ■

  5. Issues to Overcome for FPGA-based Accelerators Heterogeneous system increases programming complexity. Main issues: ● Virtualisation and partitioning of FPGA ○ Configuration of FPGAs ○ Scheduling of hardware accelerators ○

  6. Frameworks Presented for FPGA Accelerators

  7. Virtualised Hardware Accelerators (University of Toronto) Aim to ‘virtualise’ FPGAs and enable them as a cloud resource. ● FPGA is split into several reconfigurable regions, with each region viewed as a single resource ● (Virtualised FPGA Resource - VFR). VFRs offered to users via OpenStack ●

  8. Framework VM as resource FPGA as resource ● Hardware accelerator is loaded across multiple FPGAs Instead of single bitstream, a ● collection of partial bitstreams is passed to the agent

  9. Reconfigurable Cloud Computing Environment (Technical University of Dresden) Users implement and execute their own hardware designs on virtual FPGAs. ● They can either allocate a complete physical FPGA or a portion of vFPGA. ● Hypervisor that has access to database containing all physical and virtual FPGA devices and ● their allocation status.

  10. FPGAs in Hyperscale Data Centers (IBM Zurich) Users can build their programmable fabrics of vFPGAs on the cloud ● Rent required FPGAs ●

  11. Implemented Hardware Accelerators: Ryft ONE

  12. Commercial Product - Ryft ONE Simultaneously analyze up to 48TB ● batch and streaming data. Can achieve up to 100x speedup ● while reducing costs by 70%. Functionality includes commonly ● used tasks e.g. term frequency, fuzzy search

  13. Implemented Hardware Accelerators: Microsoft Catapult v1

  14. Board Design Hardware acceleration applied to a group of 1632 servers ● 1 Altera Stratix V D5 FPGA per server, connected via PCIe ● FPGAs are interconnected so that resources can be shared - “ reconfigurable fabric ” ● Requirement: no jumper cables (for power or signalling) ● Limit power draw to under 25W → PCIe bus provided all necessary power. ○ Limit to under 20W - keeps the increase in power consumption below 10%. ○ Each FPGA has 8GB of local DRAM , since SRAM is too expensive ● Industrial-grade material allows FPGA to operate at up to 100°C ● Add electromagnetic-interference shielding to the FPGAs ●

  15. PC to FPGA interface Requirements : ● Low latency (< 10 seconds to transfer 16 KB) ○ Safe for multithreading ○ Custom PCIe interface with DMA support ● Low latency - avoid using system calls to transfer data ● 1 input buffer and 1 output buffer in user-level memory ○ FPGA is given base pointers to those buffers ○ Thread safety - divide the buffers into 64 equal sections ● Give each thread exclusive access to 1 or more sections ○ Each section is 64 KB ○

  16. Network Design FPGAs are connected together in a network ● Low-latency, high bandwidth ● 6 x 8 2D torus network topology ● Balances routability with cabling complexity ○ 20Gb/s bidirectional bandwidth at <1 ● microsecond latency Source: https://en.wikipedia.org/wiki/Torus_interconnect

  17. Overview of Bing Search Finds documents that match the search query Check the Top-level front-end Aggregator Selection service Cache miss cache (TLA) Return the Ranking service search results Ranks the documents

  18. Search Result Ranking 3 stages: ● Feature Extraction (FE) ○ Free Form Expressions (FFE) ○ Document Scoring ○ 8 FPGAs are arranged in a chain ● Queue manager passes documents ● from the selection service through the chain

  19. 1. Feature Extraction Search Result Ranking 2. Free Form Expressions 3. Document Scoring Search each document for features related to the search query ● Assign each feature a score ● Hardware allows for multiple feature extraction engines to run ● simultaneously Multiple instruction, single data ( MISD ) ○ Stream Preprocessing FSM : splits the input ● into control and data signals Feature Gathering Network : groups the ● features and sends them onwards

  20. 1. Feature Extraction Search Result Ranking 2. Free Form Expressions 3. Document Scoring Mathematical combinations of features ● Involves complex math with large floats ● A custom core was designed for creating ● FFEs Can fit 60 cores on a single D5 FPGA ● Characteristics: ● 1. Each core supports 4 threads 2. Threads are prioritised based on expected latency 3. Long latency operations can be split between multiple FPGAs

  21. 1. Feature Extraction Search Result Ranking 2. Free Form Expressions 3. Document Scoring Features Machine learning Floating-point score model Free form expressions ● Search results are ranked in order of document score. ● Documents are compressed to 64 KB before being passed to the ranking service (software implementation does not do this) ○ Compression has minimal impact

  22. Error Handling and Recovery Health Monitor ● Queries servers to check their status ○ If unresponsive : Soft boot → Hard boot → Flag for manual service ○ If responsive : return info about the FPGA: ○ PCIe errors ■ Errors for inter-FPGA network communication ■ DRAM status ■ If a temperature shutdown occurred ■ Mapping Manager ● Manages the roles of the FPGAs ○ Performs reconfigurations ○ Reconfiguring FPGAs may send garbage data, so a “TX halt” signal is sent to its ○ neighbours, telling them to temporarily ignore any data received

  23. Deployment Results Deployed to 34 pods (each pod is a 6 x 8 torus) → 1632 servers ● Increased ranking throughput by 95% at similar latency to a software-only ● solution Increase in power consumption was below 10% ● Increase in cost was below 30% ●

  24. A Cloud-Scale Acceleration Architecture: Microsoft v2

  25. Done by Microsoft, as a work towards Microsoft Catapult ● Overview v2 System Aims to improve fix various issues that came with Catapult ● v1 System. Implemented on Altera Stratix V D5 FPGA board that ● supports dual port 40Gbps LAN port Tested on 5760 servers with synthetic and mirrored ● production data. Achieved noticeable improvement in: ● Local Acceleration ○ Network Acceleration ○ Remote Acceleration ○

  26. Issues with Catapult v1 System 1. The secondary network (6x8 torus) required expensive and complex cabling. 2. Each FPGA needs full awareness of physical location of other machines Failure handling of the torus requires complex re-routing 3. of traffic to neighboring nodes . a. Performance Loss b. Isolation of nodes 4. Limited scalability for direct communication a. One rack -> 48 nodes 5. FPGA can be used for accelerating applications, but has limited ability to enhance datacenter infrastructure.

  27. Proposed Solution (bump-in-the-wire) Couple FPGA with network interface. ● a. FPGA device share the same network topology as the server itself All network traffic is routed through the FPGA ● a. Allow it to accelerate high-bandwidth network flows FPGA uses PCIe connection to host ● a. Gives it capability for local acceleration FPGA is able to generate and consume their own data ● packets, and communicates using LTL (Lightweight Transport Layer) a. Every FPGA can reach every other one in a small number of microseconds in hyperscale.

  28. Proposed Solution (bump-in-the-wire) Discrete NIC (Network Interface Card) ● Allows for simple bypassing, rather than ○ wasting FPGA resource on implementing NIC logic. Possible Drawback of this Design ● Buggy Application may cut network coming ○ into this server Failure in one server do not affect the ○ others Servers have auto health check, while ○ unresponsive server will be rebooted. Thus the proper FPGA image (golden image) will be used again.

  29. Shell Architecture Role: Application ● Logic Shell: I/O and ● Board Specific Logic Elastic Router: ● Intra-Communicat ion Interface for Roles LTL engine: ● Inter-communicati on interface for Roles.

  30. RESOURCE USAGE 44% of FPGA used to support all shell functions. Enough space left for the role(s) to provide large speed up.

Recommend


More recommend