peak performance model for a custom precision floating
play

Peak Performance Model for a Custom Precision Floating-Point Dot - PowerPoint PPT Presentation

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs UCHPC - UnConventional High performance Computing Workshop


  1. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs UCHPC - UnConventional High performance Computing Workshop Europar 2010 Manfred M¨ ucke, Bernd Lesser, Wilfried N. Gansterer { manfred.muecke | bernd.lesser | wilfried.gansterer } @univie.ac.at Research Lab Computational Technologies and Applications University of Vienna http://rlcta.univie.ac.at August 30th, 2010 Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  2. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work 1 Motivation 2 Architecture 3 Experiments 4 Dot-product performance model 5 Conclusions 6 Future work Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  3. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Motivation Accelerating scientific applications For instance: accelerating linear solvers accelerating matrix operations ... A central part of many scientific computing applications: dot-product operation Our work deals with Performance analysis of custom-precision dot-product architectures on FPGAs Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  4. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Why on FPGAs? There are applications that do not require double-precision data types: Keep double precision range (11bit exponent) Reduce mantissa (mantissa bit width ≤ 52) On CPUs / GPUs: Speedup can only be achieved if mantissa bit width = 23 bit (single precision) or = 10 bit (half precision) On FPGAs: FPGAs are the only hardware platform that can benefit from bit width reduction on a fine-scaled level Lower precision translates directly into increased parallelism → throughput → SPEEDUP Larger FPGAs translate into increased parallelism → throughput → SPEEDUP Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  5. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot-product: Our observation: The maximum size of a parallel floating-point dot-product on FPGAs scales superlinearly with decreasing mantissa bit width Question: How much more performance can we gain? Goal: Give a quantitative model for the performance improvement as function of the mantissa bitwidth Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  6. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Architecture Canonical Dot-Product for real valued input vectors a , b : n < a , b > = a T b = X a i · b i . i =1 Different possibilities to implement a dot-product in hardware Our choice: binary-tree based dot-product architecture } ∗ Thus: + ∗ m parallel multipliers m + m − 1 adders result ∗ + ∗ } m − 1 Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  7. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Splitting (arbitrary long) vectors: ⌊ n m ⌋− 1 n m n X X X X < a , b > = a i · b i = a i + j · m · b i + j · m + a i · b i i =1 j =0 i =1 i = ⌊ n m ⌋· m +1 We investigate: a 1 a 1 b 1 b 1 b 4 b 4 b 4 b 4 b 1 b 1 b 1 b 1 a 1 a 1 b 1 b 1 ∗ a 1 a 1 Custom dot-product operator b 1 b 1 · · · · · · · · · · · · + · · · · accepting a maximum input · · · · · · · · b m b 6 b m b 6 ∗ · · · · vector length m · · · · X + · · · · · · · · · · · · ∗ · · · · for different floating-point · · · · · · · · + b 5 b m b 5 b m mantissa bit widths a m a 5 b m b 5 ∗ a m a 5 b m b 5 a 6 a n b n b 6 } our focus Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  8. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Given a certain sized FPGA, we want to know: Peak performance as a function of the used mantissa bit width Dot-Product architecture: peak performance depends on Number of parallel multipliers m max Maximum frequency f max ∗ + ∗ + ∗ + ∗ Thus, we need: Implementation for each mantissa bit width Measure its hardware resource usage Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  9. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Experiments Implementation issues: We implemented a generic dot-product architecture for arbitrary vector lengths Standard IEEE 754 floating-point format Arbitrary precision floating-point modules: chosen library: FPLibrary ( Arnaire project , at ENS Lyon ) http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/ combinatorial operators used Measurement issues: Used synthesis tool: QuartusII ( Altera ) Automated measurements using TCL scripting language Set generics Synthesize implementation Record hardware resource usage Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  10. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Our implementation: Accepts generic parameters: mantissa bit width, exponent bitwidth, m m parallel multipliers (accepts 2 m input operands) Binary adder tree of depth ⌈ log 2 m ⌉ Stages pipelined (registers) Total latency: ⌈ log 2 m ⌉ + 3 b 3 mult a 3 adder b 2 mult a 2 adder result b 1 mult a 1 adder b 0 mult a 0 Peak performance: P = (2 m − 1) ∗ f max [Flop/s] Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  11. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Methodology: First: we perform measurements on largest Cyclone II FPGA device (EP2C70) Then: Develop model for approximating best these original measurements Finally: Verify the model class with the measurements obtained from two more recent devices FPGA FPGA Logic elements DSP blocks Emb. Memory Device Family [9x9bit blocks] [kbits] EP2C70 Cyclone II 68,416 300 1,125 EP3C80 Cyclone III 81,264 488 2,745 EP3SL70 Stratix III 67,500 576 2,214 Table: Hardware resources of used FPGAs. Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  12. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Maximum Dot Product Size Dot Product Peak Performance 160 70 14 EP2C70max. dot product size EP2C70 Peak EP2C70 f max 65 140 12 maximum clock frequency [MHz] 60 maximum input operand pairs 120 10 55 peak performance 100 50 [GFlop/s] 8 80 45 6 40 60 35 4 40 30 2 20 25 0 20 0 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] mantissa bit width [Bits] FPGA Peak Perf vs. Mantissa bit width Measure maximum dot-prod size m max and maximum frequency f max Mantissa sizes: 52 downto 4 Calculate peak performance P = (2 m − 1) ∗ f max [Flop/s] Observation: peak performance grows exponentially Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  13. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot-product performance model Model: Fit: fractional polynomial of form P ( p ) = c 1 + c 2 · p c 3 , c 1 , c 2 , c 3 ∈ Q EP2C70 : P ( p ) = − 7 . 37 + 32 . 16 · p − 0 . 35 Dot Product Peak Performance Model 14 EP2C70 Peak Fit: P EP2C70 (p) = -7.37 + 32.16*(p -0.35 ) 12 peak performance 10 P = (2 m − 1) ∗ f max [Flop/s] [GFlop/s] 8 6 P := Measured value 4 2 ˆ P := Modelled value 0 relative error 20 Errorrel = ( P − ˆ 10 P ) [%] 0 · 100 [%] -10 P -20 absolute error 1 [GFlop/s] 0.5 0 Errorabs = P − ˆ -0.5 P [Flop/s] -1 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  14. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot Product Peak Performance Model 25 Fit: P EP3SL70 (p) = -19.68 + 60.90*(p -0.26 ) EP3SL70 Peak Fit: P EP3C80 (p) = -10.31 + 43.29*(p -0.33 ) 20 EP3C80 Peak peak performance Fit: P EP2C70 (p) = -7.37 + 32.16*(p -0.35 ) EP2C70 Peak [GFlop/s] 15 10 5 0 relative error 20 10 0 [%] -10 -20 absolute error 1 [GFlop/s] 0.5 0 -0.5 -1 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] Verify observations on more recent FPGA devices (families): Given appropriate constants, peak performance as a function of mantissa bit width can be modeled quite accurately Maximum absolute error: 1GFlop/s Average relative error: ≈ 5 − 7 % Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Recommend


More recommend