logic for near data processing
play

Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis - PowerPoint PPT Presentation

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard MapReduce, graph scaling


  1. HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA – March 14, 2016

  2. PIM is Coming Back … End of Dennard MapReduce, graph scaling processing, deep neural networks, … Energy-bound systems Near-Data Processing (NDP) In- 3D memory HMC, HBM stacking analytics Figs: www.extremetech.com www.cisl.columbia.edu/grads/tuku/research/ 2 www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html

  3. NDP Logic Requirements  Area-efficient o High processing throughput to match the high memory bandwidth o 128 GBps per 50 mm 2 stack  > 32 Gflops  > 0.6 Gflops/mm 2  Power-efficient o Thermal constraints limit clock frequency o 5 W per stack  100 mW/mm 2  Flexible o Must amortize manufacturing cost through reuse across apps 3

  4. NDP Logic Options Area Power Flexibility Efficiency Efficiency    Programmable cores [IRAM, FlexRAM, NDC, TOP-PIM]    FPGA (fine-grained) [Active Pages]    CGRA (coarse-grained) [NDA]    ASIC [MSA, LiM] 4

  5. Reconfigurable Logic Challenges  FPGA o Area overhead due to support for bit-level configuration  CGRA o Traditional GGRAs • Limited flexibility in interconnects, only for regular computation patterns o DySER [HPCA’11] and NDA [HPCA’15] • High power due to circuit-switched routing • Inefficient for branches and irregular data layouts Heterogeneity: achieve the best of FPGA and CGRA 5

  6. Outline  Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions 6

  7. Overall System Architecture Memory stack with NDP capability: Multi-core chip with cache hierarchy: runs memory-intensive code Runs code with high temporal locality High-Speed Serial Link Memory Host Stack Processor Multiple stacks linked to host processor through serial links 7

  8. NDP Stack Vault: Channel • Vertical channel • Dedicated memory controller Bank • 8 – 16 vaults per stack ... DRAM Die NoC Vault vs. DDR3 channel Logic Vault • 10x bandwidth (160 GBps) Logic Die Vault logic: • 3-5x power improvement • Multiple PEs + control logic • NoC to interconnect vaults 8

  9. Iterative Execution Flow  Processing phase: PEs run tasks independently and in parallel  Communication phase: data exchange and sync b/w PEs [PACT’15] o Communication within and across stacks Task Task Task Task Process PEs PEs PEs PEs Local Buffer Task Task Task Task Pull 9

  10. Vault Logic  Handles task control and data communication  Allows the use of reconfigurable or custom PEs Streaming data To local To remote Input & output data info Mem Ctrl Router memory memory (e.g., graph edge list) from host or other PEs Separate queue for each Global op,addr,sz op,addr,sz op,addr,sz Task Controller consumer Generic Load/Store Unit (Host Queue Processor) • Coalesce writebacks • In-place combine ops Cached data DMA Ctrl DMA PE (e.g., graph vertices) Fixed Logic PE ... Scratchpad Buffer PE User-defined Logic Output write queues Output Queues 10

  11. Outline  Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions 11

  12. HRL Features  Fine-grained + coarse-grained reconfigurable blocks o LUTs for flexible control Area-efficiency and flexibility o ALUs for efficient arithmetic  Static interconnects Power-efficiency o Wide network for data o Separate and narrow network for control  Special blocks for branches & irregular data layout Flexibility Compute throughput per Watt: 2.2x over FPGA, 1.7x over CGRA 12

  13. HRL Array: Logic Blocks Control Input Data Input FPGA-style configurable block CGRA-style functional unit • LUTs for embedded control logic • Efficient 48-bit arithmetic/logic ops 1-bit • Special functions: sigmoid, tanh, etc. control CLB FU FU FU FU FU • Registers for pipelining and retiming routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU Flexible IO Interface CLB OMB OMB OMB OMB OMB • Simple but flexible alignment Output MUX Block • Configurable MUXes (tree, cascading, parallel) Control Output Data Output • Put close to output, low cost and flexible 13

  14. HRL Array: Routing Control Input Data Input 1-bit control CLB FU FU FU FU FU routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 14

  15. HRL Array: Routing 1-bit A control net tracks FU No switches b/w two networks Fully static for low power B 16-bit Block output to data net Y out routing track tracks connection Separate data and control Routing switch box to reduce area Block input MUX connection out sel INA INB Connection box and switch CLB MXB in OUT 1-bit control, few tracks 16-bit data, bus-based 15

  16. HRL Array: IO Control Input Data Input 1-bit control CLB FU FU FU FU FU routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 16

  17. HRL Array: IO Data IO Control IO • Connect to data net • Connects to control tracks • Simple 16-bit chunk alignment • Same as FPGA 48-bit fixed-point 16-bit short 32-bit int Ctrl IO Data IO 1-bit control tracks 16-bit data tracks 48-bit fixed- point 16-bit short Low cost and sufficiently 32-bit int flexible even for irregular data 17

  18. Outline  Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions 18

  19. Methodology  Workloads o 3 analytics frameworks: MapReduce, graph, DNN o 9 representative applications, 11 kernel circuits (KCs)  Technology o 45 nm area and power model o CGRA: DySER as in NDA [HPCA’11, HPCA’15] o FPGA: Xilinx Virtex-6  Tools o Synthesize, place & route by Yosys + VTR o System simulation by zsim 19

  20. Array Area 1.4 Coarse-grained FUs improve area efficiency Single Array Area (mm2) 1.2 1 Ctrl Routing 0.8 Data Routing 0.6 Routing 0.4 Logic 0.2 0 FPGA CGRA HRL  Same logic capacity for each type array 20

  21. Array Power HRL: static but flexible routing 40 CGRA: high power on circuit-switched network 35 30 Power (mW) FPGA: half the frequency 25 20 15 10 5 0 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 21

  22. Vault Power Efficiency HRL: 2.2x FPGA, 1.7x CGRA on perf/Watt 3.5 Normalized Perf/Watt 3 2.5 2 1.5 1 0.5 0 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 22

  23. Overall Performance  ASIC represents the upper bound of efficiency  Cores, FPGA, CGRA only match 30% to 80% of ASIC  HRL has 92% of ASIC performance on average Memory bandwidth not saturated Memory bandwidth saturated 1 Performance Normalized 0.8 0.6 0.4 0.2 0 GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP dA 23 Cores FPGA CGRA HRL ASIC

  24. Conclusions  NDP logic requirements: area + power efficiency, flexibility  Heterogeneous reconfigurable logic (HRL) o Fine-grained + coarse-grained logic blocks o Static and separate data and control networks o Special blocks for branching and layout management o Vault logic handles communication and control  HRL for in-memory analytics o 2.2x performance/Watt over FPGA and 1.7x over CGRA o Within 92% of ASIC performance 24

  25. Thanks! Questions?

Recommend


More recommend