ece232 hardware organization and design
play

ECE232: Hardware Organization and Design Lecture 9: Floating Point - PowerPoint PPT Presentation

ECE232: Hardware Organization and Design Lecture 9: Floating Point Adapted from Computer Organization and Design , Patterson & Hennessy, UCB Floating Point Representation for non-integral numbers Including very small and very large


  1. ECE232: Hardware Organization and Design Lecture 9: Floating Point Adapted from Computer Organization and Design , Patterson & Hennessy, UCB

  2. Floating Point Representation for non-integral numbers  Including very small and very large numbers • Like scientific notation  – 2.34 × 10 56 • +0.002 × 10 – 4 • +987.02 × 10 9 • In binary  normalized ± 1. xxxxxxx 2 × 2 yyyy • Types float and double in C  ECE232: Floating Point 2

  3. Floating Point Numbers The largest 32 bit unsigned integer number is  1111 1111 1111 1111 1111 1111 1111 1111 = 4,294,967,295 What if we want to encode the approx. age of the earth?  4,600,000,000 or 4.6 x 10 9 or the weight in kg of one a.m.u. (atomic mass unit)  0.0000000000000000000000000166 or 1.6 x 10 -27 There is no way we can encode either of the above in a 32-  bit integer. ECE232: Floating Point 3

  4. Exponential Notation The following are equivalent representations of 1,234  123,400.0 x 10 -2 12,340.0 x 10 -1 The representations differ in that the decimal place – 1,234.0 x 10 0 the “point” - “floats” to the 123.4 x 10 1 left or right (with the 12.34 x 10 2 appropriate adjustment in 1.234 x 10 3 the exponent). 0.1234 x 10 4 0.01234x 10 5 ECE232: Floating Point 4

  5. Parts of a Floating Point Number Exponent -0.9876 x 10 -3 Sign of exponent Sign of Location of Mantissa mantissa decimal point Base Mantissa is also called Significand ECE232: Floating Point 5

  6. Single Precision Format Note that the exponent has no explicit sign bit  Base?  32 bits M: Mantissa (23 bits) E: Exponent (8 bits) S: Sign of mantissa (1 bit) ECE232: Floating Point 6

  7. Normalization The mantissa M is a normalized fraction  Has an implied decimal place on left  Has an implied (hidden) “ 1 ” on left of the decimal place  E.g.,  • Fraction  10100000000000000000000 • Represents 1.101 2 = 1.625 10 The significand= 1.f is in the range [1, 2-ulp]  • ulp – unit in the last position     S E Bias ( 1 ) 1 . 2 F f ECE232: Floating Point 7

  8. IEEE Floating-Point Format single: 8 bits single: 23 bits double: 11 bits double: 52 bits S Exponent Fraction       S (Exponent Bias) x ( 1) (1 Fraction) 2 S: sign bit (0  non-negative, 1  negative)  Normalize significand: 1.0 ≤ |significand| < 2.0  Always has a leading pre-binary-point 1 bit, so no need to • represent it explicitly (hidden bit) Significand is Fraction with the “1.” restored • Exponent: excess representation: actual exponent + Bias  Ensures exponent is unsigned • Single: Bias = 127; Double: Bias = 1203 • ECE232: Floating Point 8

  9. Single-Precision Range Exponents 00000000 and 11111111 reserved  Smallest value  Exponent: 00000001 •  actual exponent = 1 – 127 = – 126 Fraction: 000…00  significand = 1.0 • ±1.0 × 2 – 126 ≈ ±1.2 × 10 – 38 • Largest value  exponent: 11111110 •  actual exponent = 254 – 127 = +127 Fraction: 111…11  significand ≈ 2.0 • ±2.0 × 2 +127 ≈ ±3.4 × 10 +38 • ECE232: Floating Point 9

  10. Floating-Point Example Represent – 0.75  – 0.75 = ( – 1) 1 × 1.1 2 × 2 – 1 • S = 1 • Fraction = 1000…00 2 • Exponent = – 1 + Bias • • Single: – 1 + 127 = 126 = 01111110 2 • Double: – 1 + 1023 = 1022 = 01111111110 2 Single: 101111110 1000…00  Double: 101111111110 1000…00  ECE232: Floating Point 10

  11. Floating-Point Example What number is represented by the single-precision float  110000001 01000…00 S = 1 • Fraction = 01000…00 2 • Fxponent = 10000001 2 = 129 • x = ( – 1) 1 × (1 + 01 2 ) × 2 (129 – 127)  = ( – 1) × 1.25 × 2 2 = – 5.0 ECE232: Floating Point 11

  12. Floating-Point Addition Consider a 4-digit decimal example  9.999 × 10 1 + 1.610 × 10 – 1 • 1. Align decimal points  Shift number with smaller exponent • 9.999 × 10 1 + 0.016 × 10 1 • 2. Add significands  9.999 × 10 1 + 0.016 × 10 1 = 10.015 × 10 1 • 3. Normalize result & check for over/underflow  1.0015 × 10 2 • 4. Round and renormalize if necessary  1.002 × 10 2 • ECE232: Floating Point 12

  13. Floating-Point Addition Now consider a 4-digit binary example  1.000 2 × 2 – 1 + – 1.110 2 × 2 – 2 (0.5 + – 0.4375) • 1. Align binary points  Shift number with smaller exponent • 1.000 2 × 2 – 1 + – 0.111 2 × 2 – 1 • 2. Add significands  1.000 2 × 2 – 1 + – 0.111 2 × 2 – 1 = 0.001 2 × 2 – 1 • 3. Normalize result & check for over/underflow  1.000 2 × 2 – 4 , with no over/underflow • 4. Round and renormalize if necessary  1.000 2 × 2 – 4 (no change) = 0.0625 • ECE232: Floating Point 13

  14. Steps in Addition/Subtraction Step 1: Calculate difference d of the two exponents -  d=|E1 - E2| Step 2: Shift significand of smaller number by d positions to  the right Step 3: Add aligned significands and set exponent of result  to exponent of larger operand Step 4: Normalize resultant significand and adjust exponent  if necessary Step 5: Round resultant significand and adjust exponent if  necessary ECE232: Floating Point 14 Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002

  15. Example: Single precision 0 10000010 11010000000000000000000 1.1101 2 130 – 127 = 3 0 = positive mantissa +1.1101 2 x 2 3 = 1110.1 2 = 14.5 10 ECE232: Floating Point 15

  16. Converting to IEEE format Example - decimal number: -3.154 X 10 0  What is the sign?  What is the exponent?  What is the mantissa?  Converting Mixed Numbers – Decimal to Binary 456.78 10 = 4 x 10 2 + 5 x 10 1 + 6 x 10 0 + 7 x 10 -1 +8 x 10 -2 1011.11 2 = 1 x 2 3 + 0 x 2 2 + 1 x 2 1 + 1 x 2 0 + 1 x 2 -1 + 1 x 2 -2 = 8 + 0 + 2 + 1 + 1/2 + ¼ = 11 + 0.5 + 0.25 = 11.75 10 ECE232: Floating Point 16

  17. How to convert whole Decimal to Binary Successive division by 2  1 57143 10 = 1101111100110111 2 1 1  3 0 6 1 13 1 27 1 55 1 111 1 223 0 446 0 892 1 1785 1 3571 0 7142 1 14285 1 28571 1 57143 ECE232: Floating Point 17

  18. Converting fractional Decimal to Binary Successive multiplication by 2 12 0.784 0 0 0.154 13 1.568 1 1 0.308 0 14 1.136 1 2 0.616 0 15 0.272 0 3 1.232 1 16 0.544 0 4 0.464 0 17 1.088 1 5 0.928 0 18 0.176 0 6 1.856 1 19 0.352 0 7 1.712 1 20 0.704 0 8 1.424 1 21 1.408 1 9 0.848 0 22 0.816 0 10 1.696 1 11 1.392 1 23 1.632 1 Decimal 0.154 = .0010 0111 0110 1100 1000 101 ECE232: Floating Point 18

  19. Floating Point Special Representations       S E 127 1 1 . f 2 ( 1 ) 1 . 2 F f  There are two Zeroes,  0, and two Infinities  ∞  NaN (Not-a-Number) may have a sign and have a non-zero fraction - used for program diagnostics  NaNs and Infinities have all 1s in the Exp field, E=255. F+  =  , F/  = 0 ECE232: Floating Point 19 Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002

  20. Floating Point Special Representations       S E 127 1  E  254 1 1 . f 2 ( 1 ) 1 . 2 F f Single Precision Double Precision Object represented Exponent Fraction Exponent Fraction 0 0 0 0 0 0 nonzero 0 nonzero ± denormalized number 1-254 Anything 1-2046 Anything ± floating point number 255 0 2047 0 ± infinity 255 nonzero 2047 nonzero NaN (not a number) ECE232: Floating Point 20

  21. Smallest & Largest Numbers The smallest non-zero positive and largest non-zero negative  normalized numbers (represented by 1 in the Exp field and 0…0 in the Fraction field) are ±2 −126 ≈ ±1.175494351×10 −38 • The smallest non-zero positive and largest non-zero negative  denormalized numbers (represented by all 0s in the Exp field and 0…01 in the Fraction field) are ±2 −149 ≈ ±1.4012985×10 −45 • The largest finite positive and smallest finite negative numbers  (represented by 254 in the Exp field and 1…1 in the Fraction field) are ±(2)(2 127 )≈ ±3.40×10 38 • ECE232: Floating Point 21

  22. FP Adder Hardware Step 1 Step 2 Step 3 Step 4 ECE232: Floating Point 22

  23. Single Precision Summary Type Exponent Mantissa Value Zero 0000 0000 000 0000 0000 0000 0000 0000 0 One 0111 1111 000 0000 0000 0000 0000 0000 1 Denormalized number 0000 0000 100 0000 0000 0000 0000 0000 5.9 × 10 -39 Largest normalized number 1111 1110 111 1111 1111 1111 1111 1111 3.4 × 10 38 Smallest normalized number 0000 0001 000 0000 0000 0000 0000 0000 1.18 × 10 -38 Infinity 1111 1111 000 0000 0000 0000 0000 0000 Infinity NaN 1111 1111 010 0000 0000 0000 0000 0000 NaN ECE232: Floating Point 23

  24. Summary Floating point numbers represent large numbers with fractions  Number formats are different than 2’s complement.  Requires some memorization • Addition requires aligning, adding, and then realigning  Do examples!  The best way to learn floating point operations • ECE232: Floating Point 24

Recommend


More recommend