draf a low power dram based reconfigurable acceleration
play

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric - PowerPoint PPT Presentation

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based Accelerators q Improve


  1. DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA – June 22, 2016

  2. FPGA-Based Accelerators q Improve performance and energy efficiency q Good balance between flexibility (CPUs) and efficiency (ASICs) q Recently used for many datacenter apps o Image/video processing, websearch, neural networks, … 2 ¡ Pictures: ¡Putnam, ¡et ¡al. ¡A ¡Reconfigurable ¡Fabric ¡for ¡Accelera:ng ¡Large-­‑Scale ¡Datacenter ¡Services. ¡ISCA’14 ¡

  3. Motivation q Deploy FPGAs in cost & power constrained systems q Datacenter systems o High-density FPGAs for large accelerators for multiple apps o Low-power FPGAs to simplify integration in servers and racks q Mobile systems o High-density FPGAs for accelerators for multiple apps o Low-power FPGAs for low cost and long battery life 3 ¡

  4. DRAF in a Nutshell q A high-density & low-power FPGA o Bit-level reconfigurable, just like conventional FPGAs q Uses dense DRAM technology for lookup tables o Replacing the SRAM technology in conventional FPGAs q DRAF vs. FPGA o 10 – 100x logic density o 1/3 power consumption o Multi-context support with fast context switch 4 ¡

  5. Challenges of Building DRAM-based FPGAs 5 ¡

  6. DRAM Array Structure Master ¡wordline Row ¡decoder Local ¡wordline Input ¡ …… …… …… bitline MAT Sense-­‑amp Subarray ¡ Output ¡ A DRAM subarray is naturally a lookup-table 6 ¡

  7. Challenges Master ¡wordline Row ¡decoder 10-­‑30 ¡ns ¡delay ¡ Local ¡wordline ~1k rows …… …… …… ~10-bit input bitline Mismatch LUT size MAT a 8192-bit LUT? Sense-­‑amp Slow speed a LUT with 10 ns delay? Destructive access ~8k-bit output data lost after access? 7 ¡

  8. Destructive Access q Explicit activation, restoration, and precharge operations o Longer access delay due to serialization q Issue of LUT chaining: order of LUT access Must activate L2 L1 L2 after L1 R1 L4 R2 Physical ¡ L3 Path Must activate L4 after both L2 & L3 User ¡ Clock 8 ¡

  9. DRAF Architecture Basic Logic Element Multi-Context Support Timing 9 ¡

  10. DRAF Overview q Same island layout and configurable interconnect as FPGA CLB Contains multiple basic DSP logic elements (BLEs) In DRAM technology Slower but not critical Block RAM Uses DRAM arrays 10 ¡

  11. Basic Logic Element 7-10 bits input Master ¡wordline 2-4 bits output Row ¡decoder Local ¡wordline …… …… …… 6 Narrower MAT bitline 1k bits to 8-16 bits MAT 14 Sense-­‑amp Specialized column logic Subarray Col ¡logic Better flexibility 4x2 4x2 FFs Additional FFs & MUXs 3 4 4 Registering & retiming Single-MAT access 4 Multi-context 11 ¡

  12. Multi-Context Support q DRAF supports 8-16 contexts per chip o Context: one MAT per BLE o Efficient use of MATs with little area and power overhead q Instant switch between active contexts o Similar to context-switch between processes on CPU q Context uses o One context per accelerator design or application o One context per part of a very large accelerator design 12 ¡

  13. Timing – Destructive Access q Issue of LUT chaining: order of LUT access q Solution: phase – similar to critical path finding L1 L2 R1 L4 R2 Phase ¡0 ¡ Phase ¡1 ¡ Physical ¡ L3 Phase ¡2 ¡ Path Phase ¡0 ¡ User ¡ Clock Phase ¡ Phase ¡0 ¡ Phase ¡1 ¡ Phase ¡2 ¡ Timeline ¡ 13 ¡

  14. Timing – Latency Optimization q Issue: precharge and restore delays q Solution: 3-way delay overlapping o Hide PRE/RST delays with wire propagation delay q Performance gap between DRAF and FPGA reduces from >10x to 2-4x LUT-­‑1 ¡ PRE ¡ ACT ¡ RST ¡ Wire ¡ LUT-­‑2 ¡ PRE ¡ ACT ¡ RST ¡ Saved ¡delay ¡ LUT-­‑1 ¡ PRE ¡ ACT ¡ RST ¡ Wire ¡ LUT-­‑2 ¡ PRE ¡ ACT ¡ RST ¡ 14 ¡

  15. Summary q Challenges à solutions o Mismatch LUT size à multi-context BLE o Destructive access à phase-based timing o Slow speed à 3-way delay overlapping q Other design features (see paper) o Sense-amp as register o Time-multiplexed routing o Handling DRAM Refresh 15 ¡

  16. Evaluation Area, power, performance against FPGA and CPU 16 ¡

  17. Methodology q Synthesize, place & route with Yosys + VTR q CACTI-3DD with 45 nm power and area models q Comparisons o 70 mm 2 FPGA based on Xilinx Virtex-6 o 70 mm 2 DRAF device, 8-context o Intel Xeon E5-2630 multi-core processor (2.3 GHz) q 18 accelerator designs o MachSuite, Sirius, Vivado HLS Video Library, VTR benchsuite o Web service, image processing, analytics, neural networks, … 17 ¡

  18. DRAF Chip Area & Power 10x area improvement 50x peak power reduction 1000 ¡ 1000 ¡ 100 ¡ Peak ¡Chip ¡Power ¡(W) ¡ 100 ¡ Chip ¡Area ¡(mm2) ¡ 10 ¡ 10 ¡ FPGA ¡ FPGA ¡ 1 ¡ 1 ¡ DRAF ¡ DRAF ¡ 0 ¡ 0.5 ¡ 1 ¡ 1.5 ¡ 0 ¡ 0.5 ¡ 1 ¡ 1.5 ¡ 0.1 ¡ 0.1 ¡ 0.01 ¡ 0.01 ¡ Logic ¡Capacity ¡ Logic ¡Capacity ¡ (in ¡million ¡6-­‑LUT ¡equivalents) ¡ ¡ (in ¡million ¡6-­‑LUT ¡equivalents) ¡ ¡ 18 ¡

  19. FPGA vs. DRAF (Area) q 8-context DRAF occupies 19% less area than 1-context FPGA o 10x area efficiency: 8 designs in less silicon area than 1 design before o But only one context can be active at a time Inefficient use of larger DRAM LUT exp/log functions 1.4 ¡ Normalized ¡Min ¡ Bounding ¡Area ¡ 1.2 ¡ 1 ¡ 0.8 ¡ 0.6 ¡ 0.4 ¡ 0.2 ¡ 0 ¡ aes ¡ backprop ¡ gemm ¡ gmm ¡ harris ¡ stemmer ¡ stencil ¡ viterbi ¡ editdist ¡ FPGA ¡Logic ¡ FPGA ¡Rou:ng ¡ DRAF ¡Logic ¡ DRAF ¡Rou:ng ¡ 19 ¡

  20. FPGA vs. DRAF (Power) q Use one context in DRAF q DRAF consumes 1/3 power of FPGA and 15% less energy o Note: current CAD tools are less efficient with DRAF 1 ¡ Normalized ¡Power ¡ 0.8 ¡ 0.6 ¡ ¡ 0.4 ¡ 0.2 ¡ 0 ¡ aes ¡ backprop ¡ gemm ¡ gmm ¡ harris ¡ stemmer ¡ stencil ¡ viterbi ¡ editdist ¡ FPGA ¡Logic ¡ FPGA ¡Rou:ng ¡ DRAF ¡Logic ¡ DRAF ¡Rou:ng ¡ 20 ¡

  21. Performance q DRAF is 2.7x slower than FPGA q DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core Efficient line buffer 1000 ¡ Normalized ¡Throughput ¡ exp/log functions 100 ¡ 10 ¡ 1 ¡ 0.1 ¡ aes ¡ backprop ¡ gemm ¡ gmm ¡ harris ¡ stemmer ¡ stencil ¡ viterbi ¡ CPU ¡ 4 ¡CPU ¡ FPGA ¡ DRAF ¡ 21 ¡

  22. Conclusions q DRAF: high-density and low-power reconfigurable fabric o Based on dense DRAM technology o Optimized timing + multi-context support q DRAF targets cost and power constrained applications o E.g., datacenters and mobile systems q DRAF trades off some performance for area & power efficiency o 10x smaller area, 3x less power, and 2.7x slower than FPGA o Still 13x speedup over Xeon cores 22 ¡

  23. Thanks! Questions?

  24. Backup

  25. Design flow q Verilog/VHDL programming and similar synthesis flow o DRAF has the same primitives (LUT, FF, DSP, BRAM) as FPGA q Specific tweaks o Wider LUT: more efficient packing o Optimize for latency rather than area • Routing delay is easier to handle o Additional timing requirements, e.g. phase, etc.

  26. Multi-Context q Why not do multi-context in SRAM FPGAs? q Store contexts in-place o High area overhead, can be use to implement more normal LUTs o In DRAF: little overhead due to dense DRAM MAT array q On-chip backup storage o Significant context switch overheads in power and latency o In DRAF: zero latency and power for context switch

  27. Design Exploration q Lots of data in paper q Main tradeoff is between area and latency o Larger LUT: better area, worse latency o Smaller LUT: worse area, better latency q A major limitation is the CAD tool o Cannot efficiently map applications to large LUTs q Final LUT size o 7-input, 2-output, 8-context o 64 rows, 32 columns, 2048-bit subarray

  28. DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis Session 8A, Wednesday 9am

  29. The Need for High-density & Low-power FPGAs q FPGA accelerators improve performance and energy efficiency o Recently used for many datacenter apps (Microsoft, Baidu, …) q Datacenter systems o Need high-density FPGAs for large accelerators for multiple apps o Need low-power FPGAs to simplify integration in servers and racks q Mobile systems o Need high-density FPGAs for accelerators for multiple apps o Need low-power FPGAs for low cost and long battery life

  30. DRAF: A High-density & Low-power FPGA q Based on dense DRAM arrays instead of SRAM LUTs o 10-100x density of convectional FPGAs o 1/3 power consumption of convectional FPGAs o 13x speedup over Xeon cores q Come to the talk to learn about o Dense, slow DRAM arrays as small, fast LUTs o Phase-based timing to address the problem of destructive reads o Multi-context support with instantaneous context switch q Session 8A, Wednesday 9am

Recommend


More recommend