Mathematical Preliminaries and Error Analysis Instructor: Wei-Cheng Wang 1 Department of Mathematics National TsingHua University Fall 2011 1These slides are based on Prof. Tsung-Ming Huang(NTNU)’s original slides
Error Algorithms and Convergence Outline Round-off errors and computer arithmetic 1 IEEE standard floating-point format Absolute and Relative Errors Machine Epsilon Loss of Significance Algorithms and Convergence 2 Algorithm Stability Rate of convergence
Error Algorithms and Convergence IEEE standard floating-point format Terminologies binary: 二 進 位 , decimal: 十 進 位 , hexadecimal: 十 六 進 位 exponent: 指 數 , mantissa: 尾 數 floating point numbers: 浮 點 數 chopping: 無 條 件 捨 去 , rounding: 四 捨 五 入 (X 捨 Y 入 ) single precision: 單 精 度 , double precision: 雙 精 度 round-off error: 捨 入 誤 差 significant digits: 有 效 位 數 loss of significance: 有 效 位 數 喪 失
Error Algorithms and Convergence IEEE standard floating-point format Example What is the binary representation of 2 3 ? Solution: To determine the binary representation for 2 3 , we write 2 3 = (0 .a 1 a 2 a 3 . . . ) 2 . Multiply by 2 to obtain 4 3 = ( a 1 .a 2 a 3 . . . ) 2 . Therefore, we get a 1 = 1 by taking the integer part of both sides.
Error Algorithms and Convergence IEEE standard floating-point format Subtracting 1, we have 1 3 = (0 .a 2 a 3 a 4 . . . ) 2 . Repeating the previous step, we arrive at 2 3 = (0 . 101010 . . . ) 2 .
Error Algorithms and Convergence IEEE standard floating-point format In the computational world, each representable number has only a fixed and finite number of digits. For any real number x , let x = ± 1 .a 1 a 2 · · · a t a t +1 a t +2 · · · × 2 m , denote the normalized scientific binary representation of x . In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985 . In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures to design floating-point hardware.
Error Algorithms and Convergence IEEE standard floating-point format Single precision The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number ± q × 2 m as shown in the following figure. sign of mantissa 8 bits exponent 23 bits normalized mantissa 0 1 8 9 31 The first bit is a sign indicator, denoted s . It is followed by an 8-bit exponent c and a 23-bit mantissa f . The base for the exponent and mantissa is 2, and the actual exponent is c − 127 . The value of c is restricted to 0 ≤ c ≤ 255 .
Error Algorithms and Convergence IEEE standard floating-point format The actual exponent of the number is restricted to − 127 ≤ c − 127 ≤ 128 . A normalization is imposed so that the leading digit in fraction be 1, and this digit ”1” is not stored as part of the 23-bit mantissa f . The resulting floating-point number takes the form ( − 1) s 2 c − 127 (1 + f ) .
Error Algorithms and Convergence IEEE standard floating-point format Example What is the decimal number of the machine number 01000000101000000000000000000000? The leftmost bit is zero, which indicates that the number is 1 positive. The next 8 bits, 10000001 , are equivalent to 2 c = 1 · 2 7 + 0 · 2 6 + · · · + 0 · 2 1 + 1 · 2 0 = 129 . The exponential part of the number is 2 129 − 127 = 2 2 . The final 23 bits specify the mantissa: 3 f = 0 · (2) − 1 + 1 · (2) − 2 + 0 · (2) − 3 + · · · + 0 · (2) − 23 = 0 . 25 . Consequently, this machine number precisely represents 4 the decimal number ( − 1) s 2 c − 127 (1 + f ) = 2 2 · (1 + 0 . 25) = 5 .
Error Algorithms and Convergence IEEE standard floating-point format Example What is the decimal number of the machine number 01000000100111111111111111111111? The final 23 bits specify that the mantissa is 1 0 · (2) − 1 + 0 · (2) − 2 + 1 · (2) − 3 + · · · + 1 · (2) − 23 f = = 0 . 2499998807907105 . Consequently, this machine number precisely represents 2 the decimal number 2 2 · (1 + 0 . 2499998807907105) ( − 1) s 2 c − 127 (1 + f ) = = 4 . 999999523162842 .
Error Algorithms and Convergence IEEE standard floating-point format Example What is the decimal number of the machine number 01000000101000000000000000000001? The final 23 bits specify that the mantissa is 1 0 · 2 − 1 + 1 · 2 − 2 + 0 · 2 − 3 + · · · + 0 · 2 − 22 + 1 · 2 − 23 f = = 0 . 2500001192092896 . Consequently, this machine number precisely represents 2 the decimal number 2 2 · (1 + 0 . 2500001192092896) ( − 1) s 2 c − 127 (1 + f ) = = 5 . 000000476837158 .
Error Algorithms and Convergence IEEE standard floating-point format Summary Above three examples 01000000100111111111111111111111 ⇒ 4 . 999999523162842 01000000101000000000000000000000 ⇒ 5 01000000101000000000000000000001 ⇒ 5 . 000000476837158 Only a relatively small subset of the real number system is used for the representation of all the real numbers. This subset, which are called the floating-point numbers , contains only rational numbers, both positive and negative. When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.
Error Algorithms and Convergence IEEE standard floating-point format The smallest (normalized) positive number Let s = 0 , c = 1 and f = 0 . This corresponds to 2 − 126 · (1 + 0) ≈ 1 . 175 × 10 − 38 The largest number Let s = 0 , c = 254 and f = 1 − 2 − 23 which is equivalent to 2 127 · (2 − 2 − 23 ) ≈ 3 . 403 × 10 38 Definition If a number x with | x | < 2 − 126 · (1 + 0) , then we say that an underflow has occurred and is generally set to zero. It is sometimes referred to as an IEEE ’subnormal’ or ’denormal’ and corresponds to c = 0 . If | x | > 2 127 · (2 − 2 − 23 ) , then we say that an overflow has occurred.
Error Algorithms and Convergence IEEE standard floating-point format Double precision A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure. sign of mantissa 1 11-bit exponent mantissa 0 1 11 12 52-bit normalized mantissa 63 The first bit is a sign indicator, denoted s . It is followed by an 11-bit exponent c and a 52-bit mantissa f . The actual exponent is c − 1023 .
Error Algorithms and Convergence IEEE standard floating-point format Format of floating-point number ( − 1) s × (1 + f ) × 2 c − 1023 The smallest (normalized) positive number Let s = 0 , c = 1 and f = 0 which is equivalent to 2 − 1022 · (1 + 0) ≈ 2 . 225 × 10 − 308 . The largest number Let s = 0 , c = 2046 and f = 1 − 2 − 52 which is equivalent to 2 1023 · (2 − 2 − 52 ) ≈ 1 . 798 × 10 308 .
Error Algorithms and Convergence IEEE standard floating-point format Chopping and rounding For any real number x , let x = ± 1 .a 1 a 2 · · · a t a t +1 a t +2 · · · × 2 m , denote the normalized scientific binary representation of x . chopping: simply discard the excess bits a t +1 , a t +2 , . . . to 1 obtain fl ( x ) = ± 1 .a 1 a 2 · · · a t × 2 m . rounding: add ± 2 − ( t +1) × 2 m to x and then chop the 2 excess bits to obtain a number of the form fl ( x ) = ± 1 .δ 1 δ 2 · · · δ t × 2 m . In this method, if a t +1 = 1 , we add 1 to a t to obtain fl ( x ) , and if a t +1 = 0 , we merely chop off all but the first t digits.
Error Algorithms and Convergence Absolute and Relative Errors Definition (Round-off error) The error resulting from replacing a number with its floating-point form is called round-off error or rounding error . Definition (Absolute Error and Relative Error) If x ⋆ is an approximation to the exact value x , the absolute error is | x ⋆ − x | and the relative error is | x ⋆ − x | , provided that x � = 0 . | x | Example (a) If x = 0 . 3000 × 10 − 3 and x ∗ = 0 . 3100 × 10 − 3 , then the absolute error is 0 . 1 × 10 − 4 and the relative error is 0 . 3333 × 10 − 1 . (b) If x = 0 . 3000 × 10 4 and x ∗ = 0 . 3100 × 10 4 , then the absolute error is 0 . 1 × 10 3 and the relative error is 0 . 3333 × 10 − 1 .
Error Algorithms and Convergence Absolute and Relative Errors Remark As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful. Definition In decimal expressions, the number x ∗ is said to approximate x to t significant digits if t is the largest non-negative integer for which | x − x ∗ | ≤ 5 × 10 − t . | x |
Error Algorithms and Convergence Absolute and Relative Errors In binary expressions, if the floating-point representation fl chop ( x ) for the number x is obtained by t digits chopping, then the relative error is | 0 . 00 · · · 0 a t +1 a t +2 · · · × 2 m | | x − fl chop ( x ) | = | x | | 1 .a 1 a 2 · · · a t a t +1 a t +2 · · · × 2 m | | 0 .a t +1 a t +2 · · · | | 1 .a 1 a 2 · · · a t a t +1 a t +2 · · · | × 2 − t . = The minimal value of the denominator is 1 . The numerator is bounded above by 1. As a consequence � � x − fl chop ( x ) � ≤ 2 − t . � � � � x �
Recommend
More recommend