HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA – March 14, 2016
PIM is Coming Back … End of Dennard MapReduce, graph scaling processing, deep neural networks, … Energy-bound systems Near-Data Processing (NDP) In- 3D memory HMC, HBM stacking analytics Figs: www.extremetech.com www.cisl.columbia.edu/grads/tuku/research/ 2 www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html
NDP Logic Requirements Area-efficient o High processing throughput to match the high memory bandwidth o 128 GBps per 50 mm 2 stack > 32 Gflops > 0.6 Gflops/mm 2 Power-efficient o Thermal constraints limit clock frequency o 5 W per stack 100 mW/mm 2 Flexible o Must amortize manufacturing cost through reuse across apps 3
NDP Logic Options Area Power Flexibility Efficiency Efficiency Programmable cores [IRAM, FlexRAM, NDC, TOP-PIM] FPGA (fine-grained) [Active Pages] CGRA (coarse-grained) [NDA] ASIC [MSA, LiM] 4
Reconfigurable Logic Challenges FPGA o Area overhead due to support for bit-level configuration CGRA o Traditional GGRAs • Limited flexibility in interconnects, only for regular computation patterns o DySER [HPCA’11] and NDA [HPCA’15] • High power due to circuit-switched routing • Inefficient for branches and irregular data layouts Heterogeneity: achieve the best of FPGA and CGRA 5
Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 6
Overall System Architecture Memory stack with NDP capability: Multi-core chip with cache hierarchy: runs memory-intensive code Runs code with high temporal locality High-Speed Serial Link Memory Host Stack Processor Multiple stacks linked to host processor through serial links 7
NDP Stack Vault: Channel • Vertical channel • Dedicated memory controller Bank • 8 – 16 vaults per stack ... DRAM Die NoC Vault vs. DDR3 channel Logic Vault • 10x bandwidth (160 GBps) Logic Die Vault logic: • 3-5x power improvement • Multiple PEs + control logic • NoC to interconnect vaults 8
Iterative Execution Flow Processing phase: PEs run tasks independently and in parallel Communication phase: data exchange and sync b/w PEs [PACT’15] o Communication within and across stacks Task Task Task Task Process PEs PEs PEs PEs Local Buffer Task Task Task Task Pull 9
Vault Logic Handles task control and data communication Allows the use of reconfigurable or custom PEs Streaming data To local To remote Input & output data info Mem Ctrl Router memory memory (e.g., graph edge list) from host or other PEs Separate queue for each Global op,addr,sz op,addr,sz op,addr,sz Task Controller consumer Generic Load/Store Unit (Host Queue Processor) • Coalesce writebacks • In-place combine ops Cached data DMA Ctrl DMA PE (e.g., graph vertices) Fixed Logic PE ... Scratchpad Buffer PE User-defined Logic Output write queues Output Queues 10
Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 11
HRL Features Fine-grained + coarse-grained reconfigurable blocks o LUTs for flexible control Area-efficiency and flexibility o ALUs for efficient arithmetic Static interconnects Power-efficiency o Wide network for data o Separate and narrow network for control Special blocks for branches & irregular data layout Flexibility Compute throughput per Watt: 2.2x over FPGA, 1.7x over CGRA 12
HRL Array: Logic Blocks Control Input Data Input FPGA-style configurable block CGRA-style functional unit • LUTs for embedded control logic • Efficient 48-bit arithmetic/logic ops 1-bit • Special functions: sigmoid, tanh, etc. control CLB FU FU FU FU FU • Registers for pipelining and retiming routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU Flexible IO Interface CLB OMB OMB OMB OMB OMB • Simple but flexible alignment Output MUX Block • Configurable MUXes (tree, cascading, parallel) Control Output Data Output • Put close to output, low cost and flexible 13
HRL Array: Routing Control Input Data Input 1-bit control CLB FU FU FU FU FU routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 14
HRL Array: Routing 1-bit A control net tracks FU No switches b/w two networks Fully static for low power B 16-bit Block output to data net Y out routing track tracks connection Separate data and control Routing switch box to reduce area Block input MUX connection out sel INA INB Connection box and switch CLB MXB in OUT 1-bit control, few tracks 16-bit data, bus-based 15
HRL Array: IO Control Input Data Input 1-bit control CLB FU FU FU FU FU routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 16
HRL Array: IO Data IO Control IO • Connect to data net • Connects to control tracks • Simple 16-bit chunk alignment • Same as FPGA 48-bit fixed-point 16-bit short 32-bit int Ctrl IO Data IO 1-bit control tracks 16-bit data tracks 48-bit fixed- point 16-bit short Low cost and sufficiently 32-bit int flexible even for irregular data 17
Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 18
Methodology Workloads o 3 analytics frameworks: MapReduce, graph, DNN o 9 representative applications, 11 kernel circuits (KCs) Technology o 45 nm area and power model o CGRA: DySER as in NDA [HPCA’11, HPCA’15] o FPGA: Xilinx Virtex-6 Tools o Synthesize, place & route by Yosys + VTR o System simulation by zsim 19
Array Area 1.4 Coarse-grained FUs improve area efficiency Single Array Area (mm2) 1.2 1 Ctrl Routing 0.8 Data Routing 0.6 Routing 0.4 Logic 0.2 0 FPGA CGRA HRL Same logic capacity for each type array 20
Array Power HRL: static but flexible routing 40 CGRA: high power on circuit-switched network 35 30 Power (mW) FPGA: half the frequency 25 20 15 10 5 0 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 21
Vault Power Efficiency HRL: 2.2x FPGA, 1.7x CGRA on perf/Watt 3.5 Normalized Perf/Watt 3 2.5 2 1.5 1 0.5 0 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 22
Overall Performance ASIC represents the upper bound of efficiency Cores, FPGA, CGRA only match 30% to 80% of ASIC HRL has 92% of ASIC performance on average Memory bandwidth not saturated Memory bandwidth saturated 1 Performance Normalized 0.8 0.6 0.4 0.2 0 GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP dA 23 Cores FPGA CGRA HRL ASIC
Conclusions NDP logic requirements: area + power efficiency, flexibility Heterogeneous reconfigurable logic (HRL) o Fine-grained + coarse-grained logic blocks o Static and separate data and control networks o Special blocks for branching and layout management o Vault logic handles communication and control HRL for in-memory analytics o 2.2x performance/Watt over FPGA and 1.7x over CGRA o Within 92% of ASIC performance 24
Thanks! Questions?
Recommend
More recommend