I believe I should add a hardware designer's perspective, since I design and build floating point hardware. Knowing the source of the error may help understand what's going on in the software, and ultimately, I hope this helps explain why floating point errors occur and seem to accumulate over time.
1. Overview
From an engineering perspective, most floating point operations will have some error, because the hardware doing the floating point calculations only needs to have an error of less than one-half of a unit in the last bit. Therefore, many hardware will stop at a precision that only requires an error of less than one unit on the last bit of a single operation, which is especially problematic in floating point division. What constitutes a single operation depends on how many operands are required for that unit. For most, it's two, but some units require 3 or more operands. Therefore, there is no guarantee that repeated operations will result in the desired error, as errors accumulate over time. 2. standard
Most processors follow the
IEEE-754
standard, but some processors use denormalized or different standards
. For example, there is a denormalization mode in IEEE-754 that allows very small floating point numbers to be represented at the expense of precision. However, the standardized mode of IEEE-754 is described below, which is the typical mode of operation. In the IEEE-754 standard, hardware designers are allowed to use any error/epsilon value as long as it is less than one-half of a unit of the last digit, and the result only needs to be less than one-half of a unit. At the last position of an operation. This explains why errors accumulate when operations are repeated. For IEEE-754 double precision, this is bit 54 because 53 bits are used to represent the numeric portion (normalized) of the floating point number, also known as the mantissa (e.g. 5.3 in 5.3e5). The following sections go into more detail about the causes of hardware errors in various floating-point operations.
3. Causes of division rounding errors
The main cause of floating point division errors is the division algorithm used to calculate the quotient. Most computer systems use inverse multiplication to calculate divisions, primarily
Z=X/Y
, Z = X * (1/Y). The division is calculated iteratively, i.e. a number of bits of the quotient are calculated every cycle until the required precision is reached, which for IEEE-754 is any value where the last bit error is less than one unit. The reciprocal table of Y (1/Y) is called a quotient selection table (QST) in slow division. The size of the quotient selection table (in bits) is usually the width of the base, or the number of digits in the base. The quotient is calculated in each iteration, plus some guard bits. For the IEEE-754 standard, double precision (64 bits), it is the size of the divider base, plus some guard bits k, where k>=2. For example, a typical quotient selection table for a divider that computes 2-bit quotients (base 4) at a time would be 2 2= 4 bits (plus some optional bits).
3.1 Division rounding error: Approximation of reciprocal
The reciprocals in the quotient selection table depend on the division: slow division like SRT division, or fast division like Goldschmidt division; each entry is modified according to the division algorithm to try to produce the lowest possible mistake. Regardless, all reciprocals are approximations of the actual reciprocals and will introduce some element of error. Both slow division and fast division calculate the quotient iteratively, i.e. each step calculates a number of digits in the quotient and then subtracts the result from the dividend. The divider repeats these steps until the error is less than one-half unit in the last place. Slow division methods compute the quotient of a fixed number of digits in each step and are generally less expensive to build, whereas fast division methods compute a variable number of digits in each step and are generally more expensive to build. The most important part about division is that most of them rely on repeated multiplication of a approximation of the reciprocal, so it's easy to make mistakes.
4. Rounding errors in other operations: Truncation
Another cause of rounding errors in all operations is the different truncation modes of the final answer allowed by IEEE-754. There are truncate, round toward zero, round to nearest (default), round-down, round up. For a single operation, all methods introduce an error element of less than one unit at the end. Truncation also cumulatively increases the final error over time and with repeated operations. This truncation error is particularly problematic when exponentiation involves some form of repeated multiplication.
5. Repeat operation
Because the hardware that performs floating point calculations only needs to produce a result with an error of less than one-half in the last bit in one operation, if you are not careful, the error will increase with repeated operations. This is why in calculations that require bounded errors, mathematicians use methods such as rounding to the nearest Interval arithmetic is combined with IEEE's variant 754 Rounding Mode to predict rounding errors and correct them. Rounding to the nearest even digit (last digit) is the default rounding mode for IEEE-754 due to the lower relative error compared to other rounding modes.
In short, the root cause of errors in floating point operations is a combination of hardware truncation and reciprocal truncation during division. Since the IEEE-754 standard only requires that the error in the last bit of a single operation be less than one-half, floating-point errors from repeated operations will accumulate unless corrected.
binaryfloating pointThe math goes like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the matter is that numbers are represented in this format as integers multiplied by powers of 2; rational numbers whose denominators are not powers of 2 (such as 0.1, which is 1/10) cannot be accurately represented .
For the standard binary64 format 0.1, its representation can be written exactly as
0.10000000000000000055511151231257827021181583404541015625 (decimal), or
In contrast, the rational number 0.1, which is 1/10, can be written exactly as
0.1 (decimal), or
0x1.99999999999999...p-4 is similar to C99 hexadecimal floating point notation, where ... represents an endless sequence of 9s.
The constants 0.2 and 0.3 in the program will also be approximations of their true values. As it happens, the double closest to 0.2 is larger than the rational number 0.2, but the closest double code>0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 ends up being greater than the rational number 0.3 and is therefore inconsistent with the constants in the code.
The same problem exists with plain old decimal (base 10) numbers, which is why a number like 1/3 ends up as 0.333333333...
You have just stumbled upon a number (3/10) that is easily represented in decimal, but not in the binary system. It also goes both ways (to a certain extent): 1/16 is an ugly number in decimal (0.0625), but in binary it looks like the 10,000th decimal (0.0001)** - if we're in due We have a habit of using a base 2 number system in our daily lives, and you'll even see the number and instinctively understand that you can reach that number by halving something, and halving it again, and again and again.
Of course, this is not exactly how floating point numbers are stored in memory (they use scientific notation). However, it does illustrate that binary floating point precision errors tend to arise because the "real world" numbers we are usually interested in are usually powers of ten - but that's only because we use the decimal number system today. This is why we say 71% instead of "5 out of every 7" (71% is an approximation because 5/7 cannot be represented exactly by any decimal number).
So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other N-based number system :)
In practice, this precision issue means that you need to use rounding functions to round floating point numbers to the number of decimal places you are interested in before displaying them.
You also need to replace the equality test with a comparison that allows some degree of tolerance, which means:
Don'tDon'tDoif (x == y) { ... }
Instead execute if (abs(x - y) .
Where abs is the absolute value. myToleranceValue The choice needs to be based on your specific application - it has a lot to do with the "wiggle room" you are prepared to allow, and what the largest number you are comparing might be (due to precision loss issues). Note the "epsilon" style constants in the language of your choice. These can be used as tolerance values, but their effectiveness depends on the size (size) of the numbers you are working with, since calculations on large numbers may exceed the epsilon threshold.
Hardware Designer’s Perspective
I believe I should add a hardware designer's perspective, since I design and build floating point hardware. Knowing the source of the error may help understand what's going on in the software, and ultimately, I hope this helps explain why floating point errors occur and seem to accumulate over time.
1. Overview
From an engineering perspective, most floating point operations will have some error, because the hardware doing the floating point calculations only needs to have an error of less than one-half of a unit in the last bit. Therefore, many hardware will stop at a precision that only requires an error of less than one unit on the last bit of a single operation, which is especially problematic in floating point division. What constitutes a single operation depends on how many operands are required for that unit. For most, it's two, but some units require 3 or more operands. Therefore, there is no guarantee that repeated operations will result in the desired error, as errors accumulate over time. 2. standard
Most processors follow the
IEEE-754standard, but some processors use denormalized or different standards . For example, there is a denormalization mode in IEEE-754 that allows very small floating point numbers to be represented at the expense of precision. However, the standardized mode of IEEE-754 is described below, which is the typical mode of operation. In the IEEE-754 standard, hardware designers are allowed to use any error/epsilon value as long as it is less than one-half of a unit of the last digit, and the result only needs to be less than one-half of a unit. At the last position of an operation. This explains why errors accumulate when operations are repeated. For IEEE-754 double precision, this is bit 54 because 53 bits are used to represent the numeric portion (normalized) of the floating point number, also known as the mantissa (e.g. 5.3 in 5.3e5). The following sections go into more detail about the causes of hardware errors in various floating-point operations.
3. Causes of division rounding errors
The main cause of floating point division errors is the division algorithm used to calculate the quotient. Most computer systems use inverse multiplication to calculate divisions, primarily
Z=X/Y,
3.1 Division rounding error: Approximation of reciprocalZ = X * (1/Y)
. The division is calculated iteratively, i.e. a number of bits of the quotient are calculated every cycle until the required precision is reached, which for IEEE-754 is any value where the last bit error is less than one unit. The reciprocal table of Y (1/Y) is called a quotient selection table (QST) in slow division. The size of the quotient selection table (in bits) is usually the width of the base, or the number of digits in the base. The quotient is calculated in each iteration, plus some guard bits. For the IEEE-754 standard, double precision (64 bits), it is the size of the divider base, plus some guard bits k, wherek>=2
. For example, a typical quotient selection table for a divider that computes 2-bit quotients (base 4) at a time would be2 2= 4
bits (plus some optional bits).The reciprocals in the quotient selection table depend on the division: slow division like SRT division, or fast division like Goldschmidt division; each entry is modified according to the division algorithm to try to produce the lowest possible mistake. Regardless, all reciprocals are approximations of the actual reciprocals and will introduce some element of error. Both slow division and fast division calculate the quotient iteratively, i.e. each step calculates a number of digits in the quotient and then subtracts the result from the dividend. The divider repeats these steps until the error is less than one-half unit in the last place. Slow division methods compute the quotient of a fixed number of digits in each step and are generally less expensive to build, whereas fast division methods compute a variable number of digits in each step and are generally more expensive to build. The most important part about division is that most of them rely on repeated multiplication of a approximation of the reciprocal, so it's easy to make mistakes.
4. Rounding errors in other operations: Truncation
Another cause of rounding errors in all operations is the different truncation modes of the final answer allowed by IEEE-754. There are truncate, round toward zero, round to nearest (default), round-down, round up. For a single operation, all methods introduce an error element of less than one unit at the end. Truncation also cumulatively increases the final error over time and with repeated operations. This truncation error is particularly problematic when exponentiation involves some form of repeated multiplication.
5. Repeat operation
Because the hardware that performs floating point calculations only needs to produce a result with an error of less than one-half in the last bit in one operation, if you are not careful, the error will increase with repeated operations. This is why in calculations that require bounded errors, mathematicians use methods such as rounding to the nearest Interval arithmetic is combined with IEEE's variant 754 Rounding Mode to predict rounding errors and correct them. Rounding to the nearest even digit (last digit) is the default rounding mode for IEEE-754 due to the lower relative error compared to other rounding modes.
Please note that the default rounding mode is to round to the nearest even number a> of the last digit, ensuring that the error of the last digit in one operation is less than half. Truncation, rounding up, and rounding down alone may result in errors greater than one-half of the last digit but less than one unit of the last digit, so these modes are not recommended except for interval arithmetic.
6. Summary
In short, the root cause of errors in floating point operations is a combination of hardware truncation and reciprocal truncation during division. Since the IEEE-754 standard only requires that the error in the last bit of a single operation be less than one-half, floating-point errors from repeated operations will accumulate unless corrected.
binaryfloating pointThe math goes like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the matter is that numbers are represented in this format as integers multiplied by powers of 2; rational numbers whose denominators are not powers of 2 (such as
0.1
, which is1/10
) cannot be accurately represented .For the standard
binary64
format0.1
, its representation can be written exactly as0.10000000000000000055511151231257827021181583404541015625
(decimal), or0x1.999999999999ap-4
C99 hexadecimal floating point representation.In contrast, the rational number
0.1
, which is1/10
, can be written exactly as0.1
(decimal), or0x1.99999999999999...p-4
is similar to C99 hexadecimal floating point notation, where...
represents an endless sequence of 9s.The constants
0.2
and0.3
in the program will also be approximations of their true values. As it happens, thedouble
closest to0.2
is larger than the rational number0.2
, but the closestdouble
code>0.3 is smaller than the rational number0.3
. The sum of0.1
and0.2
ends up being greater than the rational number0.3
and is therefore inconsistent with the constants in the code.A fairly comprehensive treatment of floating-point arithmetic problems is Every computer scientist should know about floating-point arithmetic. For a more understandable explanation, see floating-point-gui.de.
The same problem exists with plain old decimal (base 10) numbers, which is why a number like 1/3 ends up as 0.333333333...
You have just stumbled upon a number (3/10) that is easily represented in decimal, but not in the binary system. It also goes both ways (to a certain extent): 1/16 is an ugly number in decimal (0.0625), but in binary it looks like the 10,000th decimal (0.0001)** - if we're in due We have a habit of using a base 2 number system in our daily lives, and you'll even see the number and instinctively understand that you can reach that number by halving something, and halving it again, and again and again.
Of course, this is not exactly how floating point numbers are stored in memory (they use scientific notation). However, it does illustrate that binary floating point precision errors tend to arise because the "real world" numbers we are usually interested in are usually powers of ten - but that's only because we use the decimal number system today. This is why we say 71% instead of "5 out of every 7" (71% is an approximation because 5/7 cannot be represented exactly by any decimal number).
So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other N-based number system :)
In practice, this precision issue means that you need to use rounding functions to round floating point numbers to the number of decimal places you are interested in before displaying them.
You also need to replace the equality test with a comparison that allows some degree of tolerance, which means:
Don'tDon'tDo
if (x == y) { ... }
Instead execute
if (abs(x - y) .
Where
abs
is the absolute value.myToleranceValue
The choice needs to be based on your specific application - it has a lot to do with the "wiggle room" you are prepared to allow, and what the largest number you are comparing might be (due to precision loss issues). Note the "epsilon" style constants in the language of your choice. These can be used as tolerance values, but their effectiveness depends on the size (size) of the numbers you are working with, since calculations on large numbers may exceed the epsilon threshold.