reconfigurable acceleration fabric
play

Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, - PowerPoint PPT Presentation

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based Accelerators Improve performance


  1. DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA – June 22, 2016

  2. FPGA-Based Accelerators  Improve performance and energy efficiency  Good balance between flexibility (CPUs) and efficiency (ASICs)  Recently used for many datacenter apps o Image/video processing, websearch , neural networks, … 2 Pictures: Putnam, et al. A Reconfigurable Fabric for Accelerating Large- Scale Datacenter Services. ISCA’14

  3. Motivation  Deploy FPGAs in cost & power constrained systems  Datacenter systems o High-density FPGAs for large accelerators for multiple apps o Low-power FPGAs to simplify integration in servers and racks  Mobile systems o High-density FPGAs for accelerators for multiple apps o Low-power FPGAs for low cost and long battery life 3

  4. DRAF in a Nutshell  A high-density & low-power FPGA o Bit-level reconfigurable, just like conventional FPGAs  Uses dense DRAM technology for lookup tables o Replacing the SRAM technology in conventional FPGAs  DRAF vs. FPGA o 10 – 100x logic density o 1/3 power consumption o Multi-context support with fast context switch 4

  5. Challenges of Building DRAM-based FPGAs 5

  6. DRAM Array Structure Master wordline Row decoder Local wordline Input bitline MAT Sense-amp Subarray Output A DRAM subarray is naturally a lookup-table 6

  7. Challenges Master wordline Row decoder 10-30 ns delay Local wordline ~1k rows ~10-bit input bitline Mismatch LUT size MAT a 8192-bit LUT? Sense-amp Slow speed a LUT with 10 ns delay? ~8k-bit output Destructive access data lost after access? 7

  8. Destructive Access  Explicit activation, restoration, and precharge operations o Longer access delay due to serialization  Issue of LUT chaining: order of LUT access Must activate L2 after L1 L1 L2 R1 L4 R2 Physical L3 Path Must activate L4 after both L2 & L3 User Clock 8

  9. DRAF Architecture Basic Logic Element Multi-Context Support Timing 9

  10. DRAF Overview  Same island layout and configurable interconnect as FPGA CLB Contains multiple basic DSP logic elements (BLEs) In DRAM technology Slower but not critical Block RAM Uses DRAM arrays 10

  11. Basic Logic Element 7-10 bits input Master wordline 2-4 bits output Row decoder Local wordline 6 Narrower MAT bitline 1k bits to 8-16 bits MAT 14 Specialized column logic Sense-amp Subarray Col logic Better flexibility 4x2 4x2 FFs Additional FFs & MUXs 3 4 4 Registering & retiming Single-MAT access 4 Multi-context 11

  12. Multi-Context Support  DRAF supports 8-16 contexts per chip o Context: one MAT per BLE o Efficient use of MATs with little area and power overhead  Instant switch between active contexts o Similar to context-switch between processes on CPU  Context uses o One context per accelerator design or application o One context per part of a very large accelerator design 12

  13. Timing – Destructive Access  Issue of LUT chaining: order of LUT access  Solution: phase – similar to critical path finding L1 L2 R1 Phase 0 Phase 1 L4 R2 Physical L3 Phase 2 Path Phase 0 User Clock Phase Phase 0 Phase 1 Phase 2 Timeline 13

  14. Timing – Latency Optimization  Issue: precharge and restore delays  Solution: 3-way delay overlapping o Hide PRE/RST delays with wire propagation delay  Performance gap between DRAF and FPGA reduces from >10x to 2-4x LUT-1 PRE ACT RST Wire LUT-2 PRE ACT RST Saved delay LUT-1 PRE ACT RST Wire LUT-2 PRE ACT RST 14

  15. Summary  Challenges  solutions o Mismatch LUT size  multi-context BLE o Destructive access  phase-based timing o Slow speed  3-way delay overlapping  Other design features (see paper) o Sense-amp as register o Time-multiplexed routing o Handling DRAM Refresh 15

  16. Evaluation Area, power, performance against FPGA and CPU 16

  17. Methodology  Synthesize, place & route with Yosys + VTR  CACTI-3DD with 45 nm power and area models  Comparisons o 70 mm 2 FPGA based on Xilinx Virtex-6 o 70 mm 2 DRAF device, 8-context o Intel Xeon E5-2630 multi-core processor (2.3 GHz)  18 accelerator designs o MachSuite, Sirius, Vivado HLS Video Library, VTR benchsuite o Web service, image processing, analytics, neural networks, … 17

  18. DRAF Chip Area & Power 10x area improvement 50x peak power reduction 1000 1000 100 100 Peak Chip Power (W) Chip Area (mm2) 10 10 FPGA FPGA 1 1 DRAF DRAF 0 0.5 1 1.5 0 0.5 1 1.5 0.1 0.1 0.01 0.01 Logic Capacity Logic Capacity (in million 6-LUT equivalents) (in million 6-LUT equivalents) 18

  19. FPGA vs. DRAF (Area)  8-context DRAF occupies 19% less area than 1-context FPGA o 10x area efficiency: 8 designs in less silicon area than 1 design before o But only one context can be active at a time Inefficient use of larger DRAM LUT exp/log functions 1.4 Normalized Min 1.2 Bounding Area 1 0.8 0.6 0.4 0.2 0 aes backprop gemm gmm harris stemmer stencil viterbi editdist FPGA Logic FPGA Routing DRAF Logic DRAF Routing 19

  20. FPGA vs. DRAF (Power)  Use one context in DRAF  DRAF consumes 1/3 power of FPGA and 15% less energy o Note: current CAD tools are less efficient with DRAF 1 Normalized Power 0.8 0.6 0.4 0.2 0 aes backprop gemm gmm harris stemmer stencil viterbi editdist FPGA Logic FPGA Routing DRAF Logic DRAF Routing 20

  21. Performance  DRAF is 2.7x slower than FPGA  DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core Efficient line buffer 1000 Normalized Throughput exp/log functions 100 10 1 0.1 aes backprop gemm gmm harris stemmer stencil viterbi CPU 4 CPU FPGA DRAF 21

  22. Conclusions  DRAF: high-density and low-power reconfigurable fabric o Based on dense DRAM technology o Optimized timing + multi-context support  DRAF targets cost and power constrained applications o E.g., datacenters and mobile systems  DRAF trades off some performance for area & power efficiency o 10x smaller area, 3x less power, and 2.7x slower than FPGA o Still 13x speedup over Xeon cores 22

  23. Thanks! Questions?

Recommend


More recommend