datacentre acceleration
play

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) - PowerPoint PPT Presentation

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930) Background Data Centres Centralised computing and network infrastructure Cloud applications for storage and computation


  1. Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930)

  2. Background

  3. Data Centres ❖ Centralised computing and network infrastructure ❖ Cloud applications for storage and computation

  4. Current Issues - Performance ❖ End of Moore's Law ❖ Mainstream hardware unable to keep up with growing demand

  5. Current Issues - Power Efficiency ❖ Lots of designed redundancy in data centres ➢ Peak load handling ❖ Growing attention towards sustainability ➢ Environmental and cost concerns

  6. Solution? Application specific hardware acceleration ❖ 25x better performance per watt ❖ 50-75x latency improvement

  7. Cloud Computing Characteristics Two broad categories of cloud applications: ❖ Offline - process large quantities of data, complex operations ➢ Big data, MapReduce ❖ Online - data streaming and delivery ➢ Search engine, video streaming

  8. Implementation Frameworks

  9. Accelerator Frameworks Virtualised Hardware Accelerators FPGAs in the Cloud (IBM) (University of Toronto) ❏ Abstracts portions of FPGAs as a ❏ Abstracts reconfigurable FPGA pool of resources accelerators as Virtual Machines ❏ Predefined functions such as ❏ Openstack for resource control and encryption and hashing allocation

  10. Accelerator Frameworks Virtualised FPGA Accelerators FPGAs in Hyperscale Data Centres (University of Warwick) (IBM) ❏ Integration within server machines ❏ Direct user allocation of FPGA ❏ Usually provide library of operations partition which are faster on FPGA

  11. Implementations and Evaluation

  12. Speedup Metrics ❖ Performance speedup ➢ Kernel speedup - execution of specific task ➢ System speedup - execution of entire application ❖ Energy efficiency - fraction multiplier

  13. Successful Implementations Reconfigurable MapReduce Accelerator - Memcached Acceleration - HP University of Athens ❏ Two distinct accelerator blocks ❏ V1 - Map done by standard ❏ Network accelerator processors, reduce moved to FPGA ❏ Memcached accelerator ❏ V2 - Map moved to FPGA, ❏ Example of streaming acceleration reconfigurable by HLS Speedup 1x Speedup 4.3x Efficiency 10.9x Efficiency 33x

  14. Successful Implementations Microsoft Catapult - Bing Search Engine ❏ Altera FPGA PCIe board installed inside standard server machines ❏ Aid machine learning page ranking Speedup 1.95x Efficiency -

  15. Implemented Hardware Accelerators: Microsoft Catapult v1

  16. Design ● 6 x 8 2D torus embedded into a half-rack of 48 servers ● 1,632 servers ● 1 Altera Stratix VD5 FPGA and local DRAM per server ● PCI Express – PCIe ● Each FPGA has 8GB of local DRAM ● 20 Gbits per second of bidirectional bandwidth and only passive copper cables

  17. Software Interface Communication between FPGA and host CPU design: ● Interface via PCIe ● Interface must incur low latency ● Interface must be multi-threading safe ● FPGA is provided pointer to user space buffer space. ● Buffer space is divided into 64 slots. ● Each thread is statically assigned exclusive access to one or more slots

  18. Software Infrastructure The software infrastructure needs to ● Ensure correct operation ● Detect failures and recover Two services are introduced for these tasks: ● Mapping manager: Configures FPGAs with correct application images ● Health monitor: Is invoked when there is a suspected failure in one or more systems

  19. Correct Operation FPGA reconfiguration may cause instability in system Reason: ● It can appear as a failed PCIe device. This raises a non-maskable interrupt ● It may corrupt its neighbors by randomly sending traffic that appears valid Solution: ● The driver behind the reprogramming must disable non-maskable interrupts ● Send "TX Halt" message. Meaning ignore all message until link establishes

  20. Failure Detection and Recover ● Monitor server notice unresponsive servers ● Health monitor contact each machine to get status. ● Healthy service sends status of local FPGA ● Health monitor update machine list of failed servers ● Mapping manager moves the application

  21. Application Used in Bing's ranking engine Overview: ● If possible, query is served from front end cache ● TLA (Top level aggregator) send query to large number of machines ● These machine find documents ● It send it to machine running ranking service ● Return the search results

  22. Macropipeline ● Process pipe line is divided into macro-pipeline stages ● Time limit for micro-pipeline is 8 micro seconds ● It is 1600 FPGA clock cycles ● Queue manager passes documents from the selection service through the chain ● Tasks are distributed in this fashion: ○ 1 FPGA for feature extraction ○ 2 FPGA for free form expression ○ 1 FPGA for compression ○ 3 FPGA to hold machine learning models ○ 1 FPGA is a spare in case of machine failure

  23. Workload ● 3 stages: ○ Feature Extraction (FE) ○ Free Form Expressions (FFE) ○ Document Scoring ● Documents are only transmitted in compressed form to save bandwidth ● Due to the slot based communication interface, the compressed documents are truncated to 64 KB

  24. Feature Extraction ● Search each document for features related to the search query ● Each of the feature extraction engines can run in parallel, working on the same input stream (MISD computation) ● Hardware allows for multiple feature extraction engines to run simultaneously ● Multiple instruction, single data (MISD) ● Stream Preprocessing FSM: produces a series of control and data messages ● Feature Gathering Network: collects generated feature and value pairs and forwards them to the next pipeline stage

  25. Free Form Expressions ● Custom multicore processor that is efficient at processing thousands of threads with long-latency floating point operations ● 60 cores on a single FPGA ● Characteristics: ○ Each core supports 4 threads ○ Threads are prioritised based on expected latency ○ Long latency operations can be split between multiple FPGAs

  26. Document Scoring ● The features and FFEs as inputs and produces a single floating-point score ● Result scores are sorted

  27. Clound-scale acceleration

  28. Limitations of Catapult V1.0 ● Secondary network complex and expensive ● Failure handling of the torus required complex re-routing of traffic to neighboring nodes, causing both performance loss and isolation of nodes under certain failure patterns ● Number of FPGAs that could communicate with each other is limited to a single rack. ● Application-scale accelerators could not influence the whole datacenter infrastructure, such as network and storage flow.

  29. A new cloud-scale FPGA-based Architecture This architecture eliminates all of the limitations listed above with a single design. The architecture has been — and is being — deployed in the majority of new servers in Microsoft’s production data centers across more than 15 countries and 5 continents. We could call it Catapult V2.0.

  30. Network Topology of the architecture PCIE to local host CPU(local accelerator) It is placed between NIC and network switchs. (bump-in-the-wire) NIC(network interface card) QSFP(quad small Form-factor Pluggable)

  31. Flexibility of the model By enabling the FPGAs to generate and consume their own networking packets independent of the hosts, each and every FPGA in the datacenter can reach every other one. LTL(Lightweight Transport Layer) protocol for low latency communication between pairs of FPGAs. Every host could use remote FPGA resource. Not running FPGAs will be donated to a global pool by their host.

  32. Hardware achitecture Datacenter accelerators must be highly manageable, which means having few variations or versions. Single design must provide positive value across an extremely large, homogeneous deployment. They must be highly flexible at the system level, in addition to being programmable,to justify deployment across a hyperscale infrastructure. Divided into Three parts 1. Local acceleration handles high value scenarios such as search ranking acceleration where every server can benefit from having its own FPGA. 2. Network acceleration can support services such as intrusion detection, deep packet inspection and network encryption. 3. Global acceleration permits accelerators unused by their host servers to be made available for large-scale applications.

  33. Board Design

  34. Shell architecture (ER) with virtual channel support for allowing multiple Roles access to the network, and a Lightweight Transport Layer (LTL) engine used for enabling inter-FPGA communication.

  35. Datacenter deployment 5,760 servers containing this accelerator architecture and placed it into a production datacenter. 3,081 of these machines using the FPGA for local compute acceleration. two FPGAs had hard failures. a low number of soft errors, which were all correctable.

  36. Local Acceleration Bing Search Page ranking . With the single local FPGA, at the target 99th percentile latency, the throughput can be safely increased by 2.25x, which means that fewer than half as many servers would be needed to sustain the target throughput at the required latency.

Recommend


More recommend