Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis - PowerPoint PPT Presentation

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA – March 14, 2016

PIM is Coming Back … End of Dennard MapReduce, graph scaling processing, deep neural networks, … Energy-bound systems Near-Data Processing (NDP) In- 3D memory HMC, HBM stacking analytics Figs: www.extremetech.com www.cisl.columbia.edu/grads/tuku/research/ 2 www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html

NDP Logic Requirements  Area-efficient o High processing throughput to match the high memory bandwidth o 128 GBps per 50 mm 2 stack  > 32 Gflops  > 0.6 Gflops/mm 2  Power-efficient o Thermal constraints limit clock frequency o 5 W per stack  100 mW/mm 2  Flexible o Must amortize manufacturing cost through reuse across apps 3

NDP Logic Options Area Power Flexibility Efficiency Efficiency    Programmable cores [IRAM, FlexRAM, NDC, TOP-PIM]    FPGA (fine-grained) [Active Pages]    CGRA (coarse-grained) [NDA]    ASIC [MSA, LiM] 4

Reconfigurable Logic Challenges  FPGA o Area overhead due to support for bit-level configuration  CGRA o Traditional GGRAs • Limited flexibility in interconnects, only for regular computation patterns o DySER [HPCA’11] and NDA [HPCA’15] • High power due to circuit-switched routing • Inefficient for branches and irregular data layouts Heterogeneity: achieve the best of FPGA and CGRA 5

Outline  Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions 6

Overall System Architecture Memory stack with NDP capability: Multi-core chip with cache hierarchy: runs memory-intensive code Runs code with high temporal locality High-Speed Serial Link Memory Host Stack Processor Multiple stacks linked to host processor through serial links 7

NDP Stack Vault: Channel • Vertical channel • Dedicated memory controller Bank • 8 – 16 vaults per stack ... DRAM Die NoC Vault vs. DDR3 channel Logic Vault • 10x bandwidth (160 GBps) Logic Die Vault logic: • 3-5x power improvement • Multiple PEs + control logic • NoC to interconnect vaults 8

Iterative Execution Flow  Processing phase: PEs run tasks independently and in parallel  Communication phase: data exchange and sync b/w PEs [PACT’15] o Communication within and across stacks Task Task Task Task Process PEs PEs PEs PEs Local Buffer Task Task Task Task Pull 9

Vault Logic  Handles task control and data communication  Allows the use of reconfigurable or custom PEs Streaming data To local To remote Input & output data info Mem Ctrl Router memory memory (e.g., graph edge list) from host or other PEs Separate queue for each Global op,addr,sz op,addr,sz op,addr,sz Task Controller consumer Generic Load/Store Unit (Host Queue Processor) • Coalesce writebacks • In-place combine ops Cached data DMA Ctrl DMA PE (e.g., graph vertices) Fixed Logic PE ... Scratchpad Buffer PE User-defined Logic Output write queues Output Queues 10

HRL Features  Fine-grained + coarse-grained reconfigurable blocks o LUTs for flexible control Area-efficiency and flexibility o ALUs for efficient arithmetic  Static interconnects Power-efficiency o Wide network for data o Separate and narrow network for control  Special blocks for branches & irregular data layout Flexibility Compute throughput per Watt: 2.2x over FPGA, 1.7x over CGRA 12

HRL Array: Logic Blocks Control Input Data Input FPGA-style configurable block CGRA-style functional unit • LUTs for embedded control logic • Efficient 48-bit arithmetic/logic ops 1-bit • Special functions: sigmoid, tanh, etc. control CLB FU FU FU FU FU • Registers for pipelining and retiming routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU Flexible IO Interface CLB OMB OMB OMB OMB OMB • Simple but flexible alignment Output MUX Block • Configurable MUXes (tree, cascading, parallel) Control Output Data Output • Put close to output, low cost and flexible 13

HRL Array: Routing Control Input Data Input 1-bit control CLB FU FU FU FU FU routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 14

HRL Array: Routing 1-bit A control net tracks FU No switches b/w two networks Fully static for low power B 16-bit Block output to data net Y out routing track tracks connection Separate data and control Routing switch box to reduce area Block input MUX connection out sel INA INB Connection box and switch CLB MXB in OUT 1-bit control, few tracks 16-bit data, bus-based 15

HRL Array: IO Control Input Data Input 1-bit control CLB FU FU FU FU FU routing tracks 16-bit CLB FU FU FU FU FU data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 16

HRL Array: IO Data IO Control IO • Connect to data net • Connects to control tracks • Simple 16-bit chunk alignment • Same as FPGA 48-bit fixed-point 16-bit short 32-bit int Ctrl IO Data IO 1-bit control tracks 16-bit data tracks 48-bit fixed- point 16-bit short Low cost and sufficiently 32-bit int flexible even for irregular data 17

Methodology  Workloads o 3 analytics frameworks: MapReduce, graph, DNN o 9 representative applications, 11 kernel circuits (KCs)  Technology o 45 nm area and power model o CGRA: DySER as in NDA [HPCA’11, HPCA’15] o FPGA: Xilinx Virtex-6  Tools o Synthesize, place & route by Yosys + VTR o System simulation by zsim 19

Array Area 1.4 Coarse-grained FUs improve area efficiency Single Array Area (mm2) 1.2 1 Ctrl Routing 0.8 Data Routing 0.6 Routing 0.4 Logic 0.2 0 FPGA CGRA HRL  Same logic capacity for each type array 20

Array Power HRL: static but flexible routing 40 CGRA: high power on circuit-switched network 35 30 Power (mW) FPGA: half the frequency 25 20 15 10 5 0 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 21

Vault Power Efficiency HRL: 2.2x FPGA, 1.7x CGRA on perf/Watt 3.5 Normalized Perf/Watt 3 2.5 2 1.5 1 0.5 0 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 22

Overall Performance  ASIC represents the upper bound of efficiency  Cores, FPGA, CGRA only match 30% to 80% of ASIC  HRL has 92% of ASIC performance on average Memory bandwidth not saturated Memory bandwidth saturated 1 Performance Normalized 0.8 0.6 0.4 0.2 0 GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP dA 23 Cores FPGA CGRA HRL ASIC

Conclusions  NDP logic requirements: area + power efficiency, flexibility  Heterogeneous reconfigurable logic (HRL) o Fine-grained + coarse-grained logic blocks o Static and separate data and control networks o Special blocks for branching and layout management o Vault logic handles communication and control  HRL for in-memory analytics o 2.2x performance/Watt over FPGA and 1.7x over CGRA o Within 92% of ASIC performance 24

Thanks! Questions?

Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis - PowerPoint PPT Presentation

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard MapReduce, graph scaling

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Markov Logic Markov Logic Probability First-Order Logic Propositional Logic Markov Logic

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Biscuit: A Framework for Near-Data Processing of Big Data Workloads Oct 21, 2016 Duck-Ho Bae

Logic Modeling Outline What is a logic model? How to use a logic model How to build a

Combining equilibrium logic and dynamic logic (an introduction and a very brief overview) Luis

Liquid Argon Near Detector Simulation Liquid Argon Near Detector Simulation Jonathan Asaadi 1

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

The logic of learning: The logic of learning: logic and knowledge representation logic and

Logic and Social Choice Theory Ulle Endriss Institute for Logic, Language and Computation

Introduction to Symbolic Logic David W. Agler 1 RL: Beyond Predicate Logic Predicate Logic

Gates and Logic: From Transistors to Logic Gates and Logic Circuits CS 3410: Computer System

Computational Logic A Motivational Introduction 1 Computational Logic programming algorithms

05Predicate Logic CS 5209: Foundation in Logic and AI Martin Henz and Aquinas Hobor February

Gates and Logic: From Transistors to Logic Gates and Logic Circuits Prof. Hakim Weatherspoon CS

Gates and Logic: From Transistors to Logic Gates and Logic Circuits Prof. Hakim Weatherspoon CS

Class notes 1. Homework 5 due Tuesday, November 13 th 11:59pm Real-World Robot Learning: Safety

Flexible Anonymous Network Flexible Anonymous Network Florentin Rochet Florentin Rochet ,

Serving Students with Disabilities during Periods of Remote or Blended Learning Technical

Flexibility and Robustness: The Cloud, Standards, Web Services and the Hybrid Future of

FCP: A Flexible Transport Framework for Accommoda:ng Diversity

Flexibility of the BLRM in Dose-Escalation Trials Ursula Garczarek Cytel Inc. | Hagen (DE)

Rigidity and flexibility of Hamiltonian 4-manifolds Liat Kessler University of Haifa Online

Stak Stakeholder Calls eholder Calls Back to School Back to School Planning Planning Agenda