asic accelerators
play

ASIC accelerators 1 Part 2 serial codes out Part 1 due tomorrow, - PowerPoint PPT Presentation

ASIC accelerators 1 Part 2 serial codes out Part 1 due tomorrow, 11:59PM Homework 3 Questions? 2 Consistent style easier to read even if your grammar is poor Usually better ofg rewriting completely have good habits Some


  1. ASIC accelerators 1 Part 2 — serial codes out Part 1 — due tomorrow, 11:59PM Homework 3 Questions? 2 Consistent style — easier to read even if your grammar is poor Usually — better ofg rewriting completely have good habits Some paper reviews copying phrases from papers in the semester I didn’t look closely enough at paper reviews earlier A Note on Quoting Papers 1 Enabling Large Design Space Exploration of Customized Architectures” Shao et al, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Han et al, “EIE: Efficient Inference Engine on Compressed Neural Networks” Supplementary reading: (Computer magazine version) Shao et al, “The Aladdin Approach to Accelerator Design and Modeling” Network Accelerators” Reagan et al, “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural This day’s papers: To read more… 3 You must make it obvious you are doing so This will get you in tons of trouble later if you don’t

  2. Accelerator motivation + 6 DFG scheduling two add functional units: one add functional unit: a b c d + + + result a b c d + + + result result d end of transistor scaling to handle/eliminate dependencies specialization as way to further improve performance especially performance per watt key challenge: how do we design/test custom chips quickly? 4 Behavioral High-Level Synthesis take C-like code, produce HW problem (according to Aladdin paper): requires lots of tuning… to make memory accesses/etc. efficient + 5 Data Flow Graphs int sum_ab = a + b; int sum_cd = c + d; int result = sum_ab + sum_cd; a + b c 7

  3. DFG realization — data path 8 full synthesis: tuning 10 Dynamic Data Dependency Graph 9 need to fjgure out memory/register connections actually need to make working control logic full synthesis: assume someone will fjgure out scheduling HW Aladdin trick: Dynamic DDG selectors for MUXes, write enable for regs MUX plus control logic result sum_cd sum_ab ADD ADD d b c a MUX 11 use dynamic (runtime) dependencies

  4. tuning: false dependencies select latency, number of ports (read/write units) without making something worse Pareto-optimum: can’t make anything better Pareto-optimum 14 wire lengths, etc., etc. control logic accounting Missing area/power modeling 13 memory model “the reason is that when striding over a partitioned tested via microbenchmarks library of functional units functional unit power/area + memory power/area Aladdin area/power modeling 12 loop-carried dependences.” cycle, though accessing difgerent elements of the array being read from and written to in the same 15 array, the HLS compiler conservatively adds

  5. design space example (GEMM) out difgerentiable Neural Networks (3) 18 out 16 Neural Networks (2) 17 19 Neural Networks (1) I 4 a 4 b 3 a 3 I 3 c 1 b 2 a 2 I 2 b 1 a 1 I 1 real world: out real = F ( I 1 , I 2 , I 3 , I 4 ) compute approximation out pred ≈ ˆ F ( I 1 , I 2 , I 3 , I 4 ) using intermediate values a i s, b i s I 4 a 4 neuron: a 1 = K ( w a 1 , 1 I 1 + w a 1 , 2 I 2 + · · · + w a 1 , 4 I 4 ) b 3 1 a 3 I 3 K ( x ) — activation function, e.g. 1 + e − x c 1 b 2 close to 0 as x approaches −∞ a 2 I 2 close to 1 as x approaches + ∞ b 1 a 1 I 1 a 1 = K ( w a 1 , 1 I 1 + w a 1 , 2 I 2 + · · · + w a 1 , 4 I 4 ) b 1 = K ( w b 1 , 1 a 1 + w b 1 , 2 a 2 + w b 1 , 3 a 3 ) w s — weights, selected by training

  6. Minerva’s problem mathematical — precision of calculations Neural network parameters 22 i.e. pipeline depth hardware — amount of intra-neuron parallelism approx. cores hardware — amount of inter-neuron parallelism hardware — size of memory, number of calculations hardware — size of memory, number of calculations evaluating neural networks mathematical — design of neural network Tradeofgs 21 High-level design 20 example: handwriting recognizer train model once, deploy in portable devices 23 goal: low-power, low-cost ( ≈ area) ASIC

  7. “intrinsic inaccuracy” 24 intrinsic inaccuracy assumption don’t care if precision variation similar to training variation sensible? 25 HW tradeofgs (1) 26 HW tradeofgs (1) 27

  8. parameters varied functional unit placement (in in pipeline) number of lanes 28 HW pipeline 29 Decreasing precision (1) from another neural network ASIC accelerator paper: 30 Decreasing precision (2) from another neural network ASIC accelerator paper: 31

  9. Pruning don’t run at low voltage/etc. physical faults can just be more noise “noise” from imprecise training data, rounding, etc. calculations are approximate anyways Algorithmic fault handling 34 redundancy — error correcting codes Traditional reliability techniques short-circuit calculations close to zero 33 SRAM danger zone 32 checking weights dynamically – compute 0 if input is near-zero without weights statically — remove neurons with almost all zero 35

  10. round-down on faults noted: lots of tricky branching on GPUs/CPUs. as a Computer, chapters 1 and 3 and 6 reading on schedule: Barroso et al, The Datacenter no paper review supercomputer AKA datacenters — most common modern next time: Warehouse-Scale Computers design tradeofgs in the huge 38 solved general sparse matrix-vector multiply problem omitted zero weights in more compact way 36 EIE — same conference note: other papers on this topic 37 size of models width of computations/storage amount of parallel computations huge number of variations: design exploration 39 best power per accuracy

  11. next week — security general areas of HW security: protect programs from each other — page tables, kernel mode, etc. protect programs from adversaries — bounds checking, etc. protect programs from people manipulating the hardware next week’s paper: last category 40

Recommend


More recommend