dynamic precision numerics using a variable precision
play

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW - PowerPoint PPT Presentation

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW COPROCESSOR ARITH26 | BOCCO Andrea | 11 June 2019 INTRODUCTION: STATE OF THE ART Variable Precision (VP) computing has been investigated to improve convergence of


  1. DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW COPROCESSOR ARITH’26 | BOCCO Andrea | 11 June 2019

  2. INTRODUCTION: STATE OF THE ART ➢ Variable Precision (VP) computing has been investigated to improve convergence of algorithms. It has been investigated in: ▪ Software (SW): GMP [2] and MPFR [3] ▪ Slow, they might not met requirements in high speed applications ▪ Hardware (HW): ▪ Kulisch [4] : large fixed point accumulator ▪ Schulte and Swartzlander [5] : mantissas divided in multiple words ➢ None of the previous works show how to store efficiently VP Floating Point (FP) number in main memory ▪ They support IEEE 754 FP format in main memory [1] IEEE754-2008 2008. IEEE Standard for Floating-Point Arithmetic. IEEE 754-2008 https://doi.org/10.1109/IEEESTD.2008.4610935 [2] Torbjörn Granlund and the GMP development team. 2012. GNU MP: The GNU Multiple Precision Arithmetic Library. https://gmplib.org/ [3] Laurent Fousse, et al. MPFR: A Multiple precision Binary Floating-point Library with Correct Rounding. https://doi.org/10.1145/1236463.1236468 [4] Ulirich Kulisch. 2013. Computer arithmetic and validity: Theory, implementation, and applications [5] M. J. Schulte and E. E. Swartzlander. 2000. A family of variable precision interval arithmetic processors. https://doi.org/10.1109/12.859535 | 2

  3. INTRODUCTION: MY WORK Our previous work [6] : a VP FP hardware accelerator : • Supports the UNUM type I format in Rocket tile main memory 1 5 FPU RISC-V • Does computation internally with another Rocket LSU Chip R R (hardware friendly) FP format $ $ RoCC A A 2 L1 L1 • M M 3 Supports I nterval A rithmetic (IA) UNUM co-proc LSU Scratchpad 4 This work: ▪ Refines the UNUM type I FP format. ▪ Proposes a new VP FP architecture. ▪ Proposes a new programming model. ▪ Benchmarks our system. [6] A. Bocco, Y. Durand, F. Dinechin, 2019, SMURF: Scalar Multiple-precision UNUM RISC-V Floating-point Accelerator for Scientific Computing. | 3

  4. OUTLINE • Choice of the memory format: the UNUM type I • Refinements on the UNUM type I FP format • The adopted VP FP Architecture • The programming model • System benchmark: gauss elimination solver • Conclusions | 4

  5. OUTLINE • Choice of the memory format: the UNUM type I • Refinements on the UNUM type I FP format • The adopted VP FP Architecture • The programming model • System benchmark: gauss elimination solver • Conclusions | 5

  6. CHOICE OF THE MEMORY FORMAT: THE UNUM TYPE I We decided to use the UNUM type I FP format in main memory • It is 6 sub-fields self-descriptive FP format es bits fs bits s e f u es-1 fs-1 sign exponent fraction ubit exponent fraction size size 3 more that conventional IEEE 754 FP numbers • WHY? • UNUM is a VP FP format • It self-encodes the exponent and fraction field lengths However UNUM type I has some peculiarities to be fixed: • How to organize UNUM arrays in main memory • How to organize the UNUM fields in memory | 6

  7. OUTLINE • Choice of the memory format: the UNUM type I • Refinements on the UNUM type I FP format • The adopted VP FP Architecture • The programming model • System benchmark: gauss elimination solver • Conclusions | 7

  8. REFINEMENTS ON THE UNUM TYPE I FP FORMAT: - UNUM FIELD ORGANIZATION For a UNUM/ubound which spans multiple addresses in main memory it is important to have the descriptor fields present in the lower addresses. ➢ We have re-organized the order of the fields for UNUM and ubound LSB MSB s u es-1 fs-1 e f 1 left right left right left right 2 s u es-1 fs-1 s u es-1 fs-1 e e f f 00--00 00--00 @1’: ? U1 @1’: U1 ? ? ? @2’: ? U2 ? ? ? FF--FF FF--FF p p | 8

  9. REFINEMENTS ON THE UNUM TYPE I FP FORMAT: - UNUM ARRAY ORGANIZATION Handling a two-element UNUM array on main memory with p bits parallelism p p p U1_0 U1_1 U1 : U2_0 U2_1 U2_2 U2 : bit length 0 p 2p 3p 1 2 00--00 00--00 U3_0 U3_0 @1’: U1_0 @1’: U1_0 U3_1 U3_1 U1_1 U1_1 ! U3_2 U2_0 U3_2 @2’’: U2_0 U2_1 @2’: U3=U1*U2 U2_1 U2_2 Array support : Guarantee affine U2_2 addressing FF--FF FF--FF p p scheme | 9

  10. OUTLINE • Choice of the memory format: the UNUM type I • Refinements on the UNUM type I FP format • The adopted VP FP Architecture • The programming model • System benchmark: gauss elimination solver • Conclusions | 10

  11. THE ADOPTED VP FP ARCHITECTURE • 1 integer register file (iRF): 32 integer general purpose register (GPR) + pc, in the main processor. • 1 g-bound register file (gRF): 32 entries, in the co-processor. • UNUMs/u-bounds are strictly considered as memory formats: • Load operations: • Load UNUMs/u-bounds from the main memory, and converts them into internal g-bounds. • Store operations: • Convert internal g-bounds (entries of the internal gRF) into u-bounds. Store the latter the main memory. • The coprocessor internal parallelism is fixed to 64 bits • Coprocessor’s status registers: Rocket tile • 1 5 DUE FPU RISC-V • SUE Rocket • LSU Chip MBB NEW! R R $ $ • A A WGP RoCC 2 L1 L1 M M 3 UNUM co-proc LSU Scratchpad 4 | 11

  12. THE MBB: MAXIMUM BYTE BUDGET UNUM format is variable length (up to a maximum length) ▪ It is impossible to have compacted arrays having random access to its elements ➢ We define the Maximum Byte Budget (MBB) as the maximum length that a UNUM number can have in main memory LSU MBB MBB u ’0 g0 u0 u’1 g1 u1 u’2 g2 G2U u2 BMF u’3 g3 u3 u’4 g4 u4 MBB ➢ The user can address VP FP numbers specifying their length with Byte granularity. | 12

  13. THE BMF: BOUNDED MEMORY FORMAT ess ’ fss ’ es_max fs_max s u es-1 fs-1 1a) 0 1 1-----1 1-----1 1--------------1 1---------------------------------1 qNaN 2a) 1 1 1-----1 1-----1 1--------------1 1---------------------------------1 sNaN +∞↓ 3a) 0 0 1-----1 1-----1 1--------------1 1---------------------------------1 UNUSED BITS - ∞↓ MBB 4a) 1 0 1-----1 1-----1 1--------------1 1---------------------------------1 +∞) right >= 5a) 0 1 1-----1 1-----1 1--------------1 1-------------------------------10 (- ∞ left max unum lengh 6a) 1 1 1-----1 1-----1 1--------------1 1-------------------------------10 +∞) right 7a) 0 1 es-1 fs-1 1------1 1---------------------1 (- ∞ left 8a) 1 1 es-1 fs-1 1------------1 1------------------------1 9a) s u es-1 fs-1 e f x 1b) 0 1 1--------1 1--------1 qNaN 2b) 1 1 1--------1 1--------1 sNaN UNUSED BITS +∞↓ MBB 3b) 0 0 1--------1 1--------1 < - ∞↓ 4b) 1 0 1--------1 1--------1 +∞) right max unum lengh 5b) 0 1 es-1 fs-1 1------1 1---------------------1 (- ∞ left 6b) 1 1 es-1 fs-1 1------------1 1------------------------1 7b) s u es-1 fs-1 e f x s u es-1 fs-1 es fs fss ’’ ess ’’ bit length 0 MBB*8 | 13

  14. OUTLINE • Choice of the memory format: the UNUM type I • Refinements on the UNUM type I FP format • The adopted VP FP Architecture • The programming model • System benchmark: gauss elimination solver • Conclusions | 14

  15. THE COPROCESSOR PROGRAMMING MODEL Our hardware is best suited for VP kernels which exploit three different storage types: • The external (main memory) storage • The intermediate (L1 cache) storage • The internal (register-level) storage 01: k = 0 Legend: Outermost loop 02: while convergence not reached do · Intermediate loop 03: for i := 1:n do Ā = x b Innermost loop 04:  =0 05: for j := 1:n do Rocket tile 2 FPU 06: if j ≠ i then RISC-V (𝒍) LSU 07: 𝝉 += 𝒃 𝒋𝒌 𝒚 𝒌 R $ RoCC 08: end A UNUM L1 UNUM M 09: end co-proc LSU co-proc (𝒍+𝟐) = 𝝉 𝟐 3 1 Scratchpad 10: 𝒚 𝒋 𝒃 𝒋𝒋 (𝒄 𝒋 − 𝝉) 11: end 12: k=k+1 x 13: end | 15

  16. OUTLINE • Choice of the memory format: the UNUM type I • Refinements on the UNUM type I FP format • The adopted VP FP Architecture • The programming model • System benchmark: gauss elimination solver • Conclusions | 16

  17. SYSTEM BENCHMARK: GAUSS ELIMINATION SOLVER Our system benchmarked with a Gauss elimination solver, both in UNUM (scalar) and ubound (interval), showed: • A gain of up to 65 decimal digits on IEEE double • The result precision is constrained by the adopted precision in memory. • Intervals do not converge always but it is useful in the computational error estimation (Ax-b). • A speed up of 4-10x with respect to the MPFR software library | 17

  18. OUTLINE • Choice of the memory format: the UNUM type I • Refinements on the UNUM type I FP format • The adopted VP FP Architecture • The programming model • System benchmark: gauss elimination solver • Conclusions | 18

Recommend


More recommend