number units in multi core clusters
play

Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July - PowerPoint PPT Presentation

Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Grkaynak 1 Prof. Luca Benini 1,2 1 Integrated Systems Laboratory


  1. Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Gürkaynak 1 Prof. Luca Benini 1,2 1 Integrated Systems Laboratory 2 Università di Bologna

  2. Integrated Systems Laboratory Advanced Processing in IoT Sense Analyze and Classify Transmit Low Power Processing System Complex preprocessing close to sensor, e.g.: Feature extraction, regression, classification, compression, sensor fusion 2

  3. Integrated Systems Laboratory Arithmetic with High Dynamic Range (HDR) Desirable Low Power Processing System Idle: ~1µW Fixed-Point 100 µW - 2 mW Active: ~ 50mW 1 - 10 mW • Fixed-point: labor intensive, error-prone, quality losses 3

  4. Integrated Systems Laboratory Arithmetic with High Dynamic Range (HDR) Desirable Low Power Processing System Idle: ~1µW HDR Arithmetic 100 µW - 2 mW Active: ~ 50mW 1 - 10 mW • Fixed-point: labor intensive, error-prone, quality losses • Energy-efficient, low-cost HDR arithmetic desirable 4

  5. Integrated Systems Laboratory Logarithmic Number System (LNS) FP: integer exponent FP: integer mantissa 1 8 23 LNS: fixed-point exponent • Efficient MUL, DIV, SQRT • c = log 2 (2 a */ 2 b ) = log 2 (2 a ±b ) = a ±b • c = log 2 (sqrt(2 a )) = log 2 (2 0.5a ) = 0.5a = a >> 1 → Simple integer operations! • Nonlinear ADD, SUB, I2F, F2I • function interpolator → large LNS unit (LNU) 5

  6. Integrated Systems Laboratory Precision & Approximation • Bilateral filter example: LNS 8.23 (0.5ulp) LNS 8.17 (16 ulp) “precise” “approximate” • Error tolerant applications • Full precision not always required → Additional tuning knob 6

  7. Integrated Systems Laboratory Contributions • Generator framework for automatic generation of “precise” (0.5ulp) and “approximate” (> 0.5ulp) LNU instances. • Design space exploration of precise / approximate LNUs. • 33%-71% smaller LNU (precise) with more functionality than previous designs [8,9,27]. • Case study: accuracy/performance tradeoffs of a shared LNU in a 65nm CMOS multicore cluster. [8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC 2016 7

  8. Integrated Systems Laboratory Problematic LNS Additions/Subtractions • C=A ±B with A = 2 a , B = 2 b , C = 2 c • Easy case (ADD): c = log 2 (2 a + 2 b ) = max (a,b) + f + (|a-b|) • Hard case (SUB): c = log 2 (2 a - 2 b ) = max (a,b) + f - (|a-b|) critical region 8

  9. Integrated Systems Laboratory Critical Region Decomposition • Analytic transformation of f - into subfunctions • Literature: – Coleman (1995) [5] ASIC complexity 8.23bit, 0.5ulp – Arnold (1998) [4] (Synthesis): – Vouzis (2007) [7] – Coleman (2008) [8] 94 kGE – Ismail (2011) [9] 63 kGE – Gautschi, Popoff (2016) [27,11] 40 kGE – This work, using Paliouras (1996) [3] 27 kGE 9

  10. Integrated Systems Laboratory Critical Region Decomposition c = max (a,b) + f - (r) c = max (a,b) - log 2 ((1-2 -r ) / r) + log 2 (r) cotrans (r) critical region 10

  11. Integrated Systems Laboratory Function Approximation f (r) • E.g., 8.23 LNS Different methods: • r 1) LUT only (very large!) 1 st order f (r) 2 nd order 2) High order polynomial 3 rd order • Often high order required Interpolation error Large interpolator delay • r 3) LUT + piecewise poly f (r) Tradeoff: precomputation vs. interpolation • • Half precision - single precision: 1-2nd order r d d d d 11

  12. Integrated Systems Laboratory LNU Generator Framework • Specs: bitwidth , accuracy , order • Iterative fitting heuristic (similar to [30]) • Piecewise minimax polynomials (using Sollya [29]) [30] De Dinechin et al., “Automatic Generation of Polynomial -Based Hardware Architectures for Function Evaluation”, ASAP 2010 [29] Chevillarde et al., “ Sollya : An Environment for the Development of Numerical Codes”, ICMS 2010 12

  13. Integrated Systems Laboratory Architecture Template Preprocessing Block Main Interpolator Log/Exp Block Block Postprocessing Block 13

  14. Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) Main Interpolator Log/Exp Block Block Postprocessing Block 14

  15. Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) LUTs Log/Exp Block N th order interpolator Postprocessing Block 15

  16. Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) Postprocessing Block 16

  17. Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) 17

  18. Integrated Systems Laboratory “Precise” 32bit LNU: Features & Comparison ELM [8] ROM-less [9] ISSCC’16 [27] This Work F2I, I2F, EXP, F2I, I2F, EXP, Functionality ADD, SUB ADD, SUB LOG, ADD, SUB LOG, ADD, SUB Max error [ulp] 0.454 0.498 0.479 0.45 LUT size [Kbit] 256.4 183.3 113.1 64.2 Technology 180 nm 180 nm 65 nm 65 nm Area [um 2 ] 904’943 589’357 57’264 38’592 Post-synthesis 97 63 40 26.8 [kGE] Min delay [ns] 11.74 7.10 6 4.5 Max delay [ns] 13.15 14.79 6 4.5 [8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC 2016 18

  19. Integrated Systems Laboratory Design Space: Precision vs. Area @4.5ns delay in umc65, post-synthesis ulp in the LNS domain - 40% Tipping point 1 st → 2 nd order 19

  20. Integrated Systems Laboratory Case Study: HW Platform • Parallel Ultra-Low-Power (PULP) Platform [31] www.pulp-platform.org  4x 32b OpenRISC Cores (in-order) PE0 PE1  16 kByte shared L1 (TCDM), 16 kByte L2 memory LNU • Configurations: PE2 PE3 – 1 Shared LNU (Precise, Approx1, Approx2) • 4, 3 or 2 pipeline registers PE0 PE1 Fair round robin arbiter • FPU FPU – 4 Private FPUs (reference) FPU FPU • Directly integrated into cores PE2 PE3 2 pipeline register • [31] M. Gautschi et al., “Tailoring Instruction -Set Extensions for an Ultra-Low Power Tightly-Coupled Cluster of OpenRISC Cores,” in VLSI -SoC, 2015 20

  21. Integrated Systems Laboratory Chip Complexities Name FPU Precise Approx1 Approx2 Format IEEE754 LNS LNS LNS Bitwidth 8.23 8.23 8.20 8.17 Precision 0.5 ulp 0.5 ulp 4 ulp* 16 ulp* Order - 2 2 1 Pipeline Stages 2 4 3 2 FPU/LNU [kGE] 4x11 36 27 23 Total Complexity [kGE] 720 718 708 704 * In the LNS domain 21

  22. Integrated Systems Laboratory Kernel Level Results umc65, post-layout Pipeline depth is the relevant factor! Energy efficiency gains mainly due to corresponding speedup! 22

  23. Integrated Systems Laboratory Conclusions • Generator Framework for precise and approximate LNUs • Very compact 8.23bit LNU ( 33%-71% smaller ) • Shared setting attractive for LNU • Up to 4.2x more energy efficient than private FPU baseline • Approximation : • Additional gains in area, speedup and energy efficiency • Energy-efficiency gains mainly due to lower latency and speedup • Less time is needed to complete a task → lower system energy consumption 23

  24. Integrated Systems Laboratory Outlook • Vectorization and trigonometric extensions • Optimization opportunities for many algorithms to leverage LNS and approximation PULP Platform: Looking for Collaborators! • OpenRISC / RISC-V ISA • Open source, silicon proven • Extending DSP capabilities… • www.pulp-platform.org pulp@pulp.ethz.ch

  25. Integrated Systems Laboratory Q&A Acknowledgements: Nano Tera IcySoC project

  26. Integrated Systems Laboratory Backup Slides 26

  27. Integrated Systems Laboratory Outline • Motivation • Preliminaries: LNS Add/Sub and Interpolation • LNU Architecture and Generator Framework • Multicore Hardware Platform • Results • Conclusion • Q&A 27

  28. Integrated Systems Laboratory Private FPUs INT operations Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU HDR-ADD/SUB/MUL 50% 28

  29. Integrated Systems Laboratory Private LNUs INT operations Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU LNU LNU LNU LNU HDR MUL/DIV/SQRT ADD/SUB • Area: 1 LNU < 4 × standard IEEE compliant FPU (no DIV) • Poor LNU utilization ~ 0.2 29

  30. Integrated Systems Laboratory Shared LNU INT operations HDR-MUL/DIV/SQRT Core 0 Core 0 Core 1 Core 2 Core 3 Interconnect Arbiter LNU HDR-ADD/SUB/I2F/F2I 30

  31. Integrated Systems Laboratory Design Space Exploration • Bitwidth: – Half to Single Precision: 5.10 – 8.23 • Accuracy: – Precise (0.5ulp) and Approximate (up to 16ulp) • Order: – 1st/2nd Order Interpolation 31

  32. Integrated Systems Laboratory Design Space: Area vs. Delay * Required # pipeline stages for 500MHz target * * Precise Approx2 Approx1 32

  33. Integrated Systems Laboratory Kernels • Linear Algebra : AXPY, GEMM, GEMV, DotP • Matrix Factorizations : Chol, QR • Geometry : Homographies, Distances, Pojection Errors • Image : Gradient Magnitude, Bilateral, FIR • Audio : Butterworth, Sine, DCT-II • Other : Radial Basis Functions 50% 25% 33

Recommend


More recommend