Floating point representation
(Unsigned) Fixed-point representation The numbers are stored with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. Suppose we have 8 bits to store a real number, where 5 bits store the integer part and 3 bits store the fractional part: 1 0 1 1 1.0 1 1 ! 2 !$ 2 !# 2 !" 2 $ 2 # 2 % 2 " 2 ! Smallest number: 00000.001 # = 0.125 Largest number: 11111.111 # = 31.875
(Unsigned) Fixed-point representation Suppose we have 64 bits to store a real number, where 32 bits store the integer part and 32 bits store the fractional part: $" $# π ) 2 ) + 4 π ) 2 ') π $" β¦ π # π " π ! . π " π # π $ β¦ π $# # = 4 )*! )*" = π !" Γ 2 !" +π !# Γ 2 !# + β― + π # Γ 2 # +π " Γ 2 $" +π % Γ 2 % + β― + π !% Γ 2 $!% Smallest number: π & = 0 βπ and π " , π # , β¦ , π $" = 0 and π $# = 1 β 2 '$# β 10 '"! Largest number: π & = 1 βπ and π & = 1 βπ β 2 $" + β― + 2 ! + 2 '" + β― + 2 '$# β 10 (
(Unsigned) Fixed-point representation Suppose we have 64 bits to store a real number, where 32 bits store the integer part and 32 bits store the fractional part: $" $# π ) 2 ) + 4 π ) 2 ') π $" β¦ π # π " π ! . π " π # π $ β¦ π $# # = 4 )*! )*" Smallest number ββ 10 '"! Largest number β β 10 ( 0 β
(Unsigned) Fixed-point representation Range : difference between the largest and smallest numbers possible. More bits for the integer part βΆ increase range Precision : smallest possible difference between any two numbers More bits for the fractional part βΆ increase precision π ! π " π # . π " π ! π $ ! π " π # . π " π ! π $ π % ! OR Wherever we put the binary point, there is a trade-off between the amount of range and precision. It can be hard to decide how much you need of each! Fix: Let the binary point βfloatβ
Floating-point numbers A floating-point number can represent numbers of different order of magnitude (very large and very small) with the same number of fixed digits. In general, in the binary system, a floating number can be expressed as π¦ = Β± π Γ 2 & π is the significand, normally a fractional value in the range [1.0,2.0) π is the exponent
Floating-point numbers Numerical Form: π¦ = Β±π Γ 2 " = Β±π # . π $ π % π & β¦ π ' Γ 2 " Fractional part of significand ( π digits) π ! β 0,1 Exponent range : π β π, π Precision : p = π + 1
βFloatingβ the binary point 1011.1 ! = 1Γ8 + 0Γ4 + 1Γ2 + 1Γ1 + 1Γ 1 2 = 11.5 "# 10111 ! = 1Γ16 + 0Γ8 + 1Γ4 + 1Γ2 + 1Γ1 = 23 "# = 1011.1 ! Γ 2 " = 23 "# 101.11 ! = 1Γ4 + 0Γ2 + 1Γ1 + 1Γ 1 2 + 1Γ 1 4 = 5.75 "# = 1011.1 ! Γ 2 &" = 5.75 "# Move βbinary pointβ to the left by one bit position: Divide the decimal number by 2 Move βbinary pointβ to the right by one bit position: Multiply the decimal number by 2
Converting floating points Convert (39.6875) "! = 100111.1011 # into floating point representation 1.001111011 # Γ 2 + (39.6875) "! = 100111.1011 # =
alized floating-point numbers No Normal Normalized floating point numbers are expressed as π¦ = Β± 1. π $ π % π & β¦ π ' Γ 2 " = Β± 1. π Γ 2 " where π is the fractional part of the significand, π is the exponent and π ! β 0,1 . Hidden bit representation: The first bit to the left of the binary point π " = 1 does not need to be stored, since its value is fixed. This representation βaddsβ 1-bit of precision (we will show some exceptions later, including the representation of number zero).
Iclicker question Determine the normalized floating point representation 1. π Γ 2 π of the decimal number π¦ = 47.125 ( π in binary representation and π in decimal) 1.01110001 * Γ 2 π A) 1.01110001 * Γ 2 π B) 1.01111001 * Γ 2 π C) D) 1.01111001 * Γ 2 π
Normalized floating-point numbers π¦ = Β± π Γ 2 ' = Β± 1. π " π ! π $ β¦ π ( Γ 2 ' = Β± 1. π Γ 2 ' β’ Exponent range : π, π β’ Precision : p = π + 1 β’ Smallest positive normalized FP number: UFL = 2 , β’ Largest positive normalized FP number: OFL = 2 &'" (1 β 2 $( )
Normalized floating point number scale ββ +β 0
Floating-point numbers: Simple example A βtoyβ number system can be represented as π¦ = Β±1. π " π # Γ2 - for π β [β4,4] and π ) β {0,1} . 1.00 ! Γ2 " = 1 1.00 ! Γ2 ! = 4.0 1.00 ! Γ2 $ = 2 1.01 ! Γ2 " = 1.25 1.01 ! Γ2 $ = 2.5 1.01 ! Γ2 ! = 5.0 1.10 ! Γ2 " = 1.5 1.10 ! Γ2 $ = 3.0 1.10 ! Γ2 ! = 6.0 1.11 ! Γ2 " = 1.75 1.11 ! Γ2 $ = 3.5 1.11 ! Γ2 ! = 7.0 1.00 ! Γ2 % = 8.0 1.00 ! Γ2 #$ = 0.5 1.00 ! Γ2 & = 16.0 1.01 ! Γ2 % = 10.0 1.01 ! Γ2 #$ = 0.625 1.01 ! Γ2 & = 20.0 1.10 ! Γ2 % = 12.0 1.10 ! Γ2 #$ = 0.75 1.10 ! Γ2 & = 24.0 1.11 ! Γ2 % = 14.0 1.11 ! Γ2 #$ = 0.875 1.11 ! Γ2 & = 28.0 1.00 ! Γ2 #! = 0.25 1.00 ! Γ2 #% = 0.125 1.00 ! Γ2 #& = 0.0625 1.01 ! Γ2 #! = 0.3125 1.01 ! Γ2 #% = 0.15625 1.01 ! Γ2 #& = 0.078125 1.10 ! Γ2 #! = 0.375 1.10 ! Γ2 #& = 0.09375 1.10 ! Γ2 #% = 0.1875 1.11 ! Γ2 #! = 0.4375 1.11 ! Γ2 #& = 0.109375 1.11 ! Γ2 #% = 0.21875 Same steps are performed to obtain the negative numbers. For simplicity, we will show only the positive numbers in this example.
π¦ = Β±1. π " π # Γ2 - for π β [β4,4] and π ) β {0,1} β’ Smallest normalized positive number: 1.00 # Γ2 '% = 0.0625 β’ Largest normalized positive number: 1.11 # Γ2 % = 28.0 β’ Any number π¦ closer to zero than 0.0625 would UNDERFLOW to zero. β’ Any number π¦ outside the range β28.0 and + 28.0 would OVERFLOW to infinity.
Machine epsilon Machine epsilon ( π % ): is defined as the distance (gap) between 1 and the β’ next larger floating point number. π¦ = Β±1. π " π % Γ2 * for π β [β4,4] and π ) β {0,1} 1.00 % Γ2 # = 1 1.01 % Γ2 # = 1.25 π π = 0.01 # Γ2 ' = π. ππ
Machine numbers: how floating point numbers are stored?
Floating-point number representation What do we need to store when representing floating point numbers in a computer? π¦ = Β± 1. π Γ 2 π π π π¦ = Β± sign exponent significand Initially, different floating-point representations were used in computers, generating inconsistent program behavior across different machines. Around 1980s, computer manufacturers started adopting a standard representation for floating-point number: IEEE (Institute of Electrical and Electronics Engineers) 754 Standard.
Floating-point number representation Numerical form: π¦ = Β± 1. π Γ 2 π Representation in memory: π π π¦ = π sign exponent significand π¦ = (β1) π 1. π Γ 2 π :πππππ π = π β πππππ
Finite representation: not all Precisions: numbers can be represented exactly! IEEE-754 Single precision (32 bits): π π‘ π = π + 127 π¦ = exponent significand sign (8-bit) (23-bit) (1-bit) IEEE-754 Double precision (64 bits): π π‘ π = π + 1023 π¦ = exponent significand sign (11-bit) (52-bit) (1-bit)
Special Values: π¦ = (β1) π 1. π Γ 2 π = π π π 1) Zero : π¦ = π‘ 000 β¦ 000 0000 β¦ β¦ 0000 2) Infinity : +β ( π‘ = 0) and ββ π‘ = 1 π¦ = π‘ 111 β¦ 111 0000 β¦ β¦ 0000 3) NaN : (results from operations with undefined results) π¦ = π‘ πππ§π’βπππ β 00 β¦ 00 111 β¦ 111 Note that the exponent π = 000 β¦ 000 and π = 111 β¦ 111 are reserved for these special cases, which limits the exponent range for the other numbers.
IEEE-754 Single Precision (32-bit) π¦ = (β1) π 1. π Γ 2 π π‘ π = π + 127 π exponent significand sign (8-bit) (23-bit) (1-bit) π‘ = 0: positive sign, π‘ = 1: negative sign Reserved exponent number for special cases: π = 11111111 # = 255 and π = 00000000 # = 0 Therefore 0 < c < 255 The largest exponent is U = 254 β 127 = 127 The smallest exponent is L = 1 β 127 = β126
IEEE-754 Single Precision (32-bit) π¦ = (β1) π 1. π Γ 2 π Example: Represent the number π¦ = β67.125 using IEEE Single- Precision Standard 1000011.001 # = 1.000011001 # Γ2 ( 67.125 = π = 6 + 127 = 133 = 10000101 # 00001100100000 β¦ 000 1 10000101 23-bit 8-bit 1-bit
IEEE-754 Single Precision (32-bit) π¦ = (β1) π 1. π Γ 2 π = π = π + 127 π π π β’ Machine epsilon ( π - ): is defined as the distance (gap) between 1 and the next larger floating point number. 1 $" = 0 01111111 00000000000000000000000 1 $" + π ' = 0 01111111 00000000000000000000001 π π = π !ππ β 1.2 Γ 10 !+ β’ Smallest positive normalized FP number: UFL = 2 + = 2 $"%, β 1.2 Γ10 $!- β’ Largest positive normalized FP number: OFL = 2 &'" (1 β 2 $( ) = 2 "%- (1 β 2 $%. ) β 3.4 Γ10 !-
Recommend
More recommend