Is there a problem with floating point calculations?

Question

Consider the following code: 0.1+0.2==0.3-&gt;false0.1+0.2-&gt;0.30000000000000004 Why do these inaccuracies occur?

P粉347804896 · Answer

Hardware Designer’s Perspective

I thought I should add a hardware designer's perspective, since I design and build floating point hardware. Knowing the origin of the error may help understand what's going on in the software, and ultimately, I hope this will explain why floating point errors occur and seem to accumulate over time.

1 Overview

From an engineering perspective, most floating point operations will have some error because the hardware performing the floating point calculations only needs less than half a unit of error in the last bit. Therefore, most hardware will only stop at precision and only need to produce an error of less than half a unit of the last bit, which is especially problematic in floating point division. What constitutes a single operation depends on the number of operands accepted by the unit. For most units, this is two operands, but some units accept 3 or more operands. Therefore, there is no guarantee that repeated operations will produce ideal errors, as errors accumulate over time.

2. Standard

Most processors follow the IEEE-754 standard, but some use denormalized or different standards. For example, there is a denormalization mode in IEEE-754 that allows the representation of very small floating point numbers at the expense of precision. However, the following content will cover the normalized mode of IEEE-754, which is the typical mode of operation.

In the IEEE-754 standard, hardware designers can allow any value of error/ε as long as it is less than half a unit of the last bit, and the result must be less than half a unit of the last bit only for one operation. This explains why errors accumulate when operations are repeated. For IEEE-754 double precision, this is bit 54 because there are 53 bits used to represent the numeric part (normalization) of the floating point number, also called the mantissa (e.g. 5.3 in 5.3e5). The following sections discuss the causes of hardware errors in various floating-point operations in more detail.

3. Causes of rounding errors in division

The main cause of error in floating point division is the division algorithm used to calculate the quotient. Most computer systems use multiplicative inverses for division calculations, primarily in Z=X/Y and Z = X * (1/Y). The division is calculated iteratively, i.e. every cycle the number of digits of the quotient is calculated until the required accuracy is reached, which for IEEE-754 is any error less than one unit of the last digit. The reciprocal table of Y (1/Y) is called the quotient selection table (QST) in slow division. The number of digits in the quotient selection table is usually the width of the base, or the number of digits in the quotient calculated in each iteration plus how many a protection position. For double precision (64 bits) of the IEEE-754 standard, it will be the base size of the divider plus a few guard bits k, where k>=2. So, for example, the quotient selection table for a typical divider that computes a 2-bit quotient (base 4) at a time would be 2 2= 4 bits (plus some optional bits).

3.1 Division rounding error: Approximation of reciprocal

The reciprocals in the quotient selection table depend on the division method : slow division (such as SRT division) or fast division (such as Goldschmidt division); each entry is modified according to the division algorithm to produce the lowest possible error. Regardless, all reciprocals are approximations of the actual reciprocals and introduce a degree of error. Both the slow and fast division methods calculate the quotient iteratively, i.e. a certain number of quotient digits are calculated at each step, the result is then subtracted from the dividend and the divisor repeats these steps until the error is less than half a unit of the last digit. Slow division methods compute a fixed number of quotient digits at each step and are generally cheaper, whereas fast division methods compute a variable number of quotient digits at each step and are generally more expensive. The most important part of division methods is that most of them rely on repeated multiplications of approximations of the reciprocals, and are therefore error-prone.

4. Rounding errors in other operations: truncation

Another cause of rounding errors in all operations is the different truncation modes allowed by IEEE-754. There are truncate, round toward zero, round (default) , round down, and round up. All methods introduce an error less than the unit of the last bit, for a single operation. Over time and repeated operations, truncation also accumulates into the resulting error. This truncation error is particularly problematic in exponential arithmetic, which involves some form of repeated multiplication.

5. Repeat operation

Because the hardware performing floating point calculations only needs to produce a result in a single operation that is less than half a unit of error in the last bit, if not monitored, the error will grow as the operation is repeated. This is why in calculations that require bounded errors, mathematicians use methods such as rounding the last digit to even digits using IEEE-754, because over time the errors are more likely to interact with each other Offset, and combine changes in Interval arithmetic and IEEE 754 rounding mode to predict rounding errors and correct them. Rounding to the nearest even digit (last digit) is the default rounding mode of IEEE-754 due to lower relative error compared to other rounding modes. Please note that the default rounding mode, rounding to the nearest

even digit

, ensures that the error of an operation is less than half a unit of the last digit. Use only truncate, round up, and round down

P粉790187507 · Answer

BinaryFloating pointThat’s how math works. In most programming languages, it is based on the IEEE 754 standard. The crux of the matter is that numbers are represented in this format as an integer times a power of two; a rational number whose denominator is not a power of two (such as 0.1, which is 1/10) cannot Precise representation.

For 0.1 in the standard binary64 format, its representation can be accurately written as

Decimal: 0.1000000000000000055511151231257827021181583404541015625
C99 hexfloat notation：0x1.99999999999ap-4

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

Decimal: 0.1
is similar to the representation of C99 hexfloat notation: 0x1.99999999999999...p-4, where ... represents an endless sequence of 9s.

The constants 0.2 and 0.3 in your program will also approximate their true values. Exactly, the double closest to 0.2 is greater than the rational number 0.2, but the double closest to 0.3 is smaller than the rational number0.3. The sum of 0.1 and 0.2 ends up being larger than the rational number 0.3 and is therefore inconsistent with the constants in the code.

A fairly comprehensive treatment of floating point arithmetic is What Every Computer Scientist Should Know About Floating Point Arithmetic. For a more understandable explanation, see floating-point-gui.de.

The same problem exists with regular decimal (base 10) numbers, which is why a number like 1/3 ends up being 0.333333333...

You just encountered a number (3/10) that is easily representable in the decimal system, but cannot be represented in the binary system. The reverse is also true (to an extent): 1/16 is an ugly number in decimal (0.0625), but in binary it looks just as neat as one ten thousandth in decimal (0.0001)* * - If we were used to using the binary number system in our daily lives, you would even look at that number and instinctively understand that you could get it by constantly dividing by 2.

Of course, this is not how floating point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the problem that binary floating point precision errors tend to arise because the "real world" numbers we are usually interested in tend to be powers of ten - but that's only because we use the decimal number system on a day-to-day basis. This is why we say 71% instead of "5 out of every 7" (71% is an approximation because 5/7 cannot be represented exactly by any decimal number).

So no, there's nothing wrong with binary floating point numbers, they're just imperfect like any other base N number system :)

In practice, this precision issue means that you need to use rounding functions to round floating point numbers to the number of decimal places you are interested in before displaying them.

You also need to replace the equality test with a comparison that allows a certain tolerance, which means:

Don’t use if (x == y) { ... }

Instead use if (abs(x - y) < myToleranceValue) { ... }.

Where abs is the absolute value function. myToleranceValue needs to be chosen based on your specific application, it has a lot to do with how much "wiggle room" you are prepared to allow and the maximum number you are going to compare to (due to precision loss issues). Be aware of "epsilon" style constants in your chosen language. These constants can be used as tolerance values, but their effectiveness depends on the size of the numbers you are dealing with, since calculations with large numbers may exceed the epsilon threshold.