ECE232: Hardware Organization and Design Lecture 9: Floating Point Adapted from Computer Organization and Design , Patterson & Hennessy, UCB
Floating Point Representation for non-integral numbers Including very small and very large numbers • Like scientific notation – 2.34 × 10 56 • +0.002 × 10 – 4 • +987.02 × 10 9 • In binary normalized ± 1. xxxxxxx 2 × 2 yyyy • Types float and double in C ECE232: Floating Point 2
Floating Point Numbers The largest 32 bit unsigned integer number is 1111 1111 1111 1111 1111 1111 1111 1111 = 4,294,967,295 What if we want to encode the approx. age of the earth? 4,600,000,000 or 4.6 x 10 9 or the weight in kg of one a.m.u. (atomic mass unit) 0.0000000000000000000000000166 or 1.6 x 10 -27 There is no way we can encode either of the above in a 32- bit integer. ECE232: Floating Point 3
Exponential Notation The following are equivalent representations of 1,234 123,400.0 x 10 -2 12,340.0 x 10 -1 The representations differ in that the decimal place – 1,234.0 x 10 0 the “point” - “floats” to the 123.4 x 10 1 left or right (with the 12.34 x 10 2 appropriate adjustment in 1.234 x 10 3 the exponent). 0.1234 x 10 4 0.01234x 10 5 ECE232: Floating Point 4
Parts of a Floating Point Number Exponent -0.9876 x 10 -3 Sign of exponent Sign of Location of Mantissa mantissa decimal point Base Mantissa is also called Significand ECE232: Floating Point 5
Single Precision Format Note that the exponent has no explicit sign bit Base? 32 bits M: Mantissa (23 bits) E: Exponent (8 bits) S: Sign of mantissa (1 bit) ECE232: Floating Point 6
Normalization The mantissa M is a normalized fraction Has an implied decimal place on left Has an implied (hidden) “ 1 ” on left of the decimal place E.g., • Fraction 10100000000000000000000 • Represents 1.101 2 = 1.625 10 The significand= 1.f is in the range [1, 2-ulp] • ulp – unit in the last position S E Bias ( 1 ) 1 . 2 F f ECE232: Floating Point 7
IEEE Floating-Point Format single: 8 bits single: 23 bits double: 11 bits double: 52 bits S Exponent Fraction S (Exponent Bias) x ( 1) (1 Fraction) 2 S: sign bit (0 non-negative, 1 negative) Normalize significand: 1.0 ≤ |significand| < 2.0 Always has a leading pre-binary-point 1 bit, so no need to • represent it explicitly (hidden bit) Significand is Fraction with the “1.” restored • Exponent: excess representation: actual exponent + Bias Ensures exponent is unsigned • Single: Bias = 127; Double: Bias = 1203 • ECE232: Floating Point 8
Single-Precision Range Exponents 00000000 and 11111111 reserved Smallest value Exponent: 00000001 • actual exponent = 1 – 127 = – 126 Fraction: 000…00 significand = 1.0 • ±1.0 × 2 – 126 ≈ ±1.2 × 10 – 38 • Largest value exponent: 11111110 • actual exponent = 254 – 127 = +127 Fraction: 111…11 significand ≈ 2.0 • ±2.0 × 2 +127 ≈ ±3.4 × 10 +38 • ECE232: Floating Point 9
Floating-Point Example Represent – 0.75 – 0.75 = ( – 1) 1 × 1.1 2 × 2 – 1 • S = 1 • Fraction = 1000…00 2 • Exponent = – 1 + Bias • • Single: – 1 + 127 = 126 = 01111110 2 • Double: – 1 + 1023 = 1022 = 01111111110 2 Single: 101111110 1000…00 Double: 101111111110 1000…00 ECE232: Floating Point 10
Floating-Point Example What number is represented by the single-precision float 110000001 01000…00 S = 1 • Fraction = 01000…00 2 • Fxponent = 10000001 2 = 129 • x = ( – 1) 1 × (1 + 01 2 ) × 2 (129 – 127) = ( – 1) × 1.25 × 2 2 = – 5.0 ECE232: Floating Point 11
Floating-Point Addition Consider a 4-digit decimal example 9.999 × 10 1 + 1.610 × 10 – 1 • 1. Align decimal points Shift number with smaller exponent • 9.999 × 10 1 + 0.016 × 10 1 • 2. Add significands 9.999 × 10 1 + 0.016 × 10 1 = 10.015 × 10 1 • 3. Normalize result & check for over/underflow 1.0015 × 10 2 • 4. Round and renormalize if necessary 1.002 × 10 2 • ECE232: Floating Point 12
Floating-Point Addition Now consider a 4-digit binary example 1.000 2 × 2 – 1 + – 1.110 2 × 2 – 2 (0.5 + – 0.4375) • 1. Align binary points Shift number with smaller exponent • 1.000 2 × 2 – 1 + – 0.111 2 × 2 – 1 • 2. Add significands 1.000 2 × 2 – 1 + – 0.111 2 × 2 – 1 = 0.001 2 × 2 – 1 • 3. Normalize result & check for over/underflow 1.000 2 × 2 – 4 , with no over/underflow • 4. Round and renormalize if necessary 1.000 2 × 2 – 4 (no change) = 0.0625 • ECE232: Floating Point 13
Steps in Addition/Subtraction Step 1: Calculate difference d of the two exponents - d=|E1 - E2| Step 2: Shift significand of smaller number by d positions to the right Step 3: Add aligned significands and set exponent of result to exponent of larger operand Step 4: Normalize resultant significand and adjust exponent if necessary Step 5: Round resultant significand and adjust exponent if necessary ECE232: Floating Point 14 Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002
Example: Single precision 0 10000010 11010000000000000000000 1.1101 2 130 – 127 = 3 0 = positive mantissa +1.1101 2 x 2 3 = 1110.1 2 = 14.5 10 ECE232: Floating Point 15
Converting to IEEE format Example - decimal number: -3.154 X 10 0 What is the sign? What is the exponent? What is the mantissa? Converting Mixed Numbers – Decimal to Binary 456.78 10 = 4 x 10 2 + 5 x 10 1 + 6 x 10 0 + 7 x 10 -1 +8 x 10 -2 1011.11 2 = 1 x 2 3 + 0 x 2 2 + 1 x 2 1 + 1 x 2 0 + 1 x 2 -1 + 1 x 2 -2 = 8 + 0 + 2 + 1 + 1/2 + ¼ = 11 + 0.5 + 0.25 = 11.75 10 ECE232: Floating Point 16
How to convert whole Decimal to Binary Successive division by 2 1 57143 10 = 1101111100110111 2 1 1 3 0 6 1 13 1 27 1 55 1 111 1 223 0 446 0 892 1 1785 1 3571 0 7142 1 14285 1 28571 1 57143 ECE232: Floating Point 17
Converting fractional Decimal to Binary Successive multiplication by 2 12 0.784 0 0 0.154 13 1.568 1 1 0.308 0 14 1.136 1 2 0.616 0 15 0.272 0 3 1.232 1 16 0.544 0 4 0.464 0 17 1.088 1 5 0.928 0 18 0.176 0 6 1.856 1 19 0.352 0 7 1.712 1 20 0.704 0 8 1.424 1 21 1.408 1 9 0.848 0 22 0.816 0 10 1.696 1 11 1.392 1 23 1.632 1 Decimal 0.154 = .0010 0111 0110 1100 1000 101 ECE232: Floating Point 18
Floating Point Special Representations S E 127 1 1 . f 2 ( 1 ) 1 . 2 F f There are two Zeroes, 0, and two Infinities ∞ NaN (Not-a-Number) may have a sign and have a non-zero fraction - used for program diagnostics NaNs and Infinities have all 1s in the Exp field, E=255. F+ = , F/ = 0 ECE232: Floating Point 19 Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002
Floating Point Special Representations S E 127 1 E 254 1 1 . f 2 ( 1 ) 1 . 2 F f Single Precision Double Precision Object represented Exponent Fraction Exponent Fraction 0 0 0 0 0 0 nonzero 0 nonzero ± denormalized number 1-254 Anything 1-2046 Anything ± floating point number 255 0 2047 0 ± infinity 255 nonzero 2047 nonzero NaN (not a number) ECE232: Floating Point 20
Smallest & Largest Numbers The smallest non-zero positive and largest non-zero negative normalized numbers (represented by 1 in the Exp field and 0…0 in the Fraction field) are ±2 −126 ≈ ±1.175494351×10 −38 • The smallest non-zero positive and largest non-zero negative denormalized numbers (represented by all 0s in the Exp field and 0…01 in the Fraction field) are ±2 −149 ≈ ±1.4012985×10 −45 • The largest finite positive and smallest finite negative numbers (represented by 254 in the Exp field and 1…1 in the Fraction field) are ±(2)(2 127 )≈ ±3.40×10 38 • ECE232: Floating Point 21
FP Adder Hardware Step 1 Step 2 Step 3 Step 4 ECE232: Floating Point 22
Single Precision Summary Type Exponent Mantissa Value Zero 0000 0000 000 0000 0000 0000 0000 0000 0 One 0111 1111 000 0000 0000 0000 0000 0000 1 Denormalized number 0000 0000 100 0000 0000 0000 0000 0000 5.9 × 10 -39 Largest normalized number 1111 1110 111 1111 1111 1111 1111 1111 3.4 × 10 38 Smallest normalized number 0000 0001 000 0000 0000 0000 0000 0000 1.18 × 10 -38 Infinity 1111 1111 000 0000 0000 0000 0000 0000 Infinity NaN 1111 1111 010 0000 0000 0000 0000 0000 NaN ECE232: Floating Point 23
Summary Floating point numbers represent large numbers with fractions Number formats are different than 2’s complement. Requires some memorization • Addition requires aligning, adding, and then realigning Do examples! The best way to learn floating point operations • ECE232: Floating Point 24
Recommend
More recommend