floating point representation
play

Floating point representation (Unsigned) Fixed-point representation - PowerPoint PPT Presentation

Floating point representation (Unsigned) Fixed-point representation The numbers are stored with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. Suppose we have 8 bits to store a real number, where


  1. Floating point representation

  2. (Unsigned) Fixed-point representation The numbers are stored with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. Suppose we have 8 bits to store a real number, where 5 bits store the integer part and 3 bits store the fractional part: 1 0 1 1 1.0 1 1 ! 2 !$ 2 !# 2 !" 2 $ 2 # 2 % 2 " 2 ! Smallest number: 00000.001 # = 0.125 Largest number: 11111.111 # = 31.875

  3. (Unsigned) Fixed-point representation Suppose we have 64 bits to store a real number, where 32 bits store the integer part and 32 bits store the fractional part: $" $# 𝑏 ) 2 ) + 4 𝑐 ) 2 ') 𝑏 $" … 𝑏 # 𝑏 " 𝑏 ! . 𝑐 " 𝑐 # 𝑐 $ … 𝑐 $# # = 4 )*! )*" = 𝑏 !" Γ— 2 !" +𝑏 !# Γ— 2 !# + β‹― + 𝑏 # Γ— 2 # +𝑐 " Γ— 2 $" +𝑐 % Γ— 2 % + β‹― + 𝑐 !% Γ— 2 $!% Smallest number: 𝑏 & = 0 βˆ€π‘— and 𝑐 " , 𝑐 # , … , 𝑐 $" = 0 and 𝑐 $# = 1 β†’ 2 '$# β‰ˆ 10 '"! Largest number: 𝑏 & = 1 βˆ€π‘— and 𝑐 & = 1 βˆ€π‘— β†’ 2 $" + β‹― + 2 ! + 2 '" + β‹― + 2 '$# β‰ˆ 10 (

  4. (Unsigned) Fixed-point representation Suppose we have 64 bits to store a real number, where 32 bits store the integer part and 32 bits store the fractional part: $" $# 𝑏 ) 2 ) + 4 𝑐 ) 2 ') 𝑏 $" … 𝑏 # 𝑏 " 𝑏 ! . 𝑐 " 𝑐 # 𝑐 $ … 𝑐 $# # = 4 )*! )*" Smallest number β†’β‰ˆ 10 '"! Largest number β†’ β‰ˆ 10 ( 0 ∞

  5. (Unsigned) Fixed-point representation Range : difference between the largest and smallest numbers possible. More bits for the integer part ⟢ increase range Precision : smallest possible difference between any two numbers More bits for the fractional part ⟢ increase precision 𝑏 ! 𝑏 " 𝑏 # . 𝑐 " 𝑐 ! 𝑐 $ ! 𝑏 " 𝑏 # . 𝑐 " 𝑐 ! 𝑐 $ 𝑐 % ! OR Wherever we put the binary point, there is a trade-off between the amount of range and precision. It can be hard to decide how much you need of each! Fix: Let the binary point β€œfloat”

  6. Floating-point numbers A floating-point number can represent numbers of different order of magnitude (very large and very small) with the same number of fixed digits. In general, in the binary system, a floating number can be expressed as 𝑦 = Β± π‘Ÿ Γ— 2 & π‘Ÿ is the significand, normally a fractional value in the range [1.0,2.0) 𝑛 is the exponent

  7. Floating-point numbers Numerical Form: 𝑦 = Β±π‘Ÿ Γ— 2 " = ±𝑐 # . 𝑐 $ 𝑐 % 𝑐 & … 𝑐 ' Γ— 2 " Fractional part of significand ( π‘œ digits) 𝑐 ! ∈ 0,1 Exponent range : 𝑛 ∈ 𝑀, 𝑉 Precision : p = π‘œ + 1

  8. β€œFloating” the binary point 1011.1 ! = 1Γ—8 + 0Γ—4 + 1Γ—2 + 1Γ—1 + 1Γ— 1 2 = 11.5 "# 10111 ! = 1Γ—16 + 0Γ—8 + 1Γ—4 + 1Γ—2 + 1Γ—1 = 23 "# = 1011.1 ! Γ— 2 " = 23 "# 101.11 ! = 1Γ—4 + 0Γ—2 + 1Γ—1 + 1Γ— 1 2 + 1Γ— 1 4 = 5.75 "# = 1011.1 ! Γ— 2 &" = 5.75 "# Move β€œbinary point” to the left by one bit position: Divide the decimal number by 2 Move β€œbinary point” to the right by one bit position: Multiply the decimal number by 2

  9. Converting floating points Convert (39.6875) "! = 100111.1011 # into floating point representation 1.001111011 # Γ— 2 + (39.6875) "! = 100111.1011 # =

  10. alized floating-point numbers No Normal Normalized floating point numbers are expressed as 𝑦 = Β± 1. 𝑐 $ 𝑐 % 𝑐 & … 𝑐 ' Γ— 2 " = Β± 1. 𝑔 Γ— 2 " where 𝑔 is the fractional part of the significand, 𝑛 is the exponent and 𝑐 ! ∈ 0,1 . Hidden bit representation: The first bit to the left of the binary point 𝑐 " = 1 does not need to be stored, since its value is fixed. This representation ”adds” 1-bit of precision (we will show some exceptions later, including the representation of number zero).

  11. Iclicker question Determine the normalized floating point representation 1. π’ˆ Γ— 2 𝒏 of the decimal number 𝑦 = 47.125 ( π’ˆ in binary representation and 𝒏 in decimal) 1.01110001 * Γ— 2 πŸ” A) 1.01110001 * Γ— 2 πŸ“ B) 1.01111001 * Γ— 2 πŸ” C) D) 1.01111001 * Γ— 2 πŸ“

  12. Normalized floating-point numbers 𝑦 = Β± π‘Ÿ Γ— 2 ' = Β± 1. 𝑐 " 𝑐 ! 𝑐 $ … 𝑐 ( Γ— 2 ' = Β± 1. 𝑔 Γ— 2 ' β€’ Exponent range : 𝑀, 𝑉 β€’ Precision : p = π‘œ + 1 β€’ Smallest positive normalized FP number: UFL = 2 , β€’ Largest positive normalized FP number: OFL = 2 &'" (1 βˆ’ 2 $( )

  13. Normalized floating point number scale βˆ’βˆž +∞ 0

  14. Floating-point numbers: Simple example A ”toy” number system can be represented as 𝑦 = Β±1. 𝑐 " 𝑐 # Γ—2 - for 𝑛 ∈ [βˆ’4,4] and 𝑐 ) ∈ {0,1} . 1.00 ! Γ—2 " = 1 1.00 ! Γ—2 ! = 4.0 1.00 ! Γ—2 $ = 2 1.01 ! Γ—2 " = 1.25 1.01 ! Γ—2 $ = 2.5 1.01 ! Γ—2 ! = 5.0 1.10 ! Γ—2 " = 1.5 1.10 ! Γ—2 $ = 3.0 1.10 ! Γ—2 ! = 6.0 1.11 ! Γ—2 " = 1.75 1.11 ! Γ—2 $ = 3.5 1.11 ! Γ—2 ! = 7.0 1.00 ! Γ—2 % = 8.0 1.00 ! Γ—2 #$ = 0.5 1.00 ! Γ—2 & = 16.0 1.01 ! Γ—2 % = 10.0 1.01 ! Γ—2 #$ = 0.625 1.01 ! Γ—2 & = 20.0 1.10 ! Γ—2 % = 12.0 1.10 ! Γ—2 #$ = 0.75 1.10 ! Γ—2 & = 24.0 1.11 ! Γ—2 % = 14.0 1.11 ! Γ—2 #$ = 0.875 1.11 ! Γ—2 & = 28.0 1.00 ! Γ—2 #! = 0.25 1.00 ! Γ—2 #% = 0.125 1.00 ! Γ—2 #& = 0.0625 1.01 ! Γ—2 #! = 0.3125 1.01 ! Γ—2 #% = 0.15625 1.01 ! Γ—2 #& = 0.078125 1.10 ! Γ—2 #! = 0.375 1.10 ! Γ—2 #& = 0.09375 1.10 ! Γ—2 #% = 0.1875 1.11 ! Γ—2 #! = 0.4375 1.11 ! Γ—2 #& = 0.109375 1.11 ! Γ—2 #% = 0.21875 Same steps are performed to obtain the negative numbers. For simplicity, we will show only the positive numbers in this example.

  15. 𝑦 = Β±1. 𝑐 " 𝑐 # Γ—2 - for 𝑛 ∈ [βˆ’4,4] and 𝑐 ) ∈ {0,1} β€’ Smallest normalized positive number: 1.00 # Γ—2 '% = 0.0625 β€’ Largest normalized positive number: 1.11 # Γ—2 % = 28.0 β€’ Any number 𝑦 closer to zero than 0.0625 would UNDERFLOW to zero. β€’ Any number 𝑦 outside the range βˆ’28.0 and + 28.0 would OVERFLOW to infinity.

  16. Machine epsilon Machine epsilon ( πœ— % ): is defined as the distance (gap) between 1 and the β€’ next larger floating point number. 𝑦 = Β±1. 𝑐 " 𝑐 % Γ—2 * for 𝑛 ∈ [βˆ’4,4] and 𝑐 ) ∈ {0,1} 1.00 % Γ—2 # = 1 1.01 % Γ—2 # = 1.25 𝝑 𝒏 = 0.01 # Γ—2 ' = 𝟏. πŸ‘πŸ”

  17. Machine numbers: how floating point numbers are stored?

  18. Floating-point number representation What do we need to store when representing floating point numbers in a computer? 𝑦 = Β± 1. π’ˆ Γ— 2 𝒏 𝑛 𝑔 𝑦 = Β± sign exponent significand Initially, different floating-point representations were used in computers, generating inconsistent program behavior across different machines. Around 1980s, computer manufacturers started adopting a standard representation for floating-point number: IEEE (Institute of Electrical and Electronics Engineers) 754 Standard.

  19. Floating-point number representation Numerical form: 𝑦 = Β± 1. π’ˆ Γ— 2 𝒏 Representation in memory: 𝑑 𝑔 𝑦 = 𝒕 sign exponent significand 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒅:π’•π’Šπ’‹π’ˆπ’– 𝒏 = 𝒅 βˆ’ π’•π’Šπ’‹π’ˆπ’–

  20. Finite representation: not all Precisions: numbers can be represented exactly! IEEE-754 Single precision (32 bits): 𝑔 𝑑 𝑑 = 𝑛 + 127 𝑦 = exponent significand sign (8-bit) (23-bit) (1-bit) IEEE-754 Double precision (64 bits): 𝑔 𝑑 𝑑 = 𝑛 + 1023 𝑦 = exponent significand sign (11-bit) (52-bit) (1-bit)

  21. Special Values: 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒏 = 𝒕 𝒅 π’ˆ 1) Zero : 𝑦 = 𝑑 000 … 000 0000 … … 0000 2) Infinity : +∞ ( 𝑑 = 0) and βˆ’βˆž 𝑑 = 1 𝑦 = 𝑑 111 … 111 0000 … … 0000 3) NaN : (results from operations with undefined results) 𝑦 = 𝑑 π‘π‘œπ‘§π‘’β„Žπ‘—π‘œπ‘• β‰  00 … 00 111 … 111 Note that the exponent 𝑑 = 000 … 000 and 𝑑 = 111 … 111 are reserved for these special cases, which limits the exponent range for the other numbers.

  22. IEEE-754 Single Precision (32-bit) 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒏 𝑑 𝑑 = 𝑛 + 127 𝑔 exponent significand sign (8-bit) (23-bit) (1-bit) 𝑑 = 0: positive sign, 𝑑 = 1: negative sign Reserved exponent number for special cases: 𝑑 = 11111111 # = 255 and 𝑑 = 00000000 # = 0 Therefore 0 < c < 255 The largest exponent is U = 254 βˆ’ 127 = 127 The smallest exponent is L = 1 βˆ’ 127 = βˆ’126

  23. IEEE-754 Single Precision (32-bit) 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒏 Example: Represent the number 𝑦 = βˆ’67.125 using IEEE Single- Precision Standard 1000011.001 # = 1.000011001 # Γ—2 ( 67.125 = 𝑑 = 6 + 127 = 133 = 10000101 # 00001100100000 … 000 1 10000101 23-bit 8-bit 1-bit

  24. IEEE-754 Single Precision (32-bit) 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒏 = 𝑑 = 𝑛 + 127 𝒅 π’ˆ 𝒕 β€’ Machine epsilon ( πœ— - ): is defined as the distance (gap) between 1 and the next larger floating point number. 1 $" = 0 01111111 00000000000000000000000 1 $" + πœ— ' = 0 01111111 00000000000000000000001 𝝑 𝒏 = πŸ‘ !πŸ‘πŸ’ β‰ˆ 1.2 Γ— 10 !+ β€’ Smallest positive normalized FP number: UFL = 2 + = 2 $"%, β‰ˆ 1.2 Γ—10 $!- β€’ Largest positive normalized FP number: OFL = 2 &'" (1 βˆ’ 2 $( ) = 2 "%- (1 βˆ’ 2 $%. ) β‰ˆ 3.4 Γ—10 !-

Recommend


More recommend