Why do double floating point operations lose precision?
Preface: At work, when it comes to addition, subtraction, multiplication and division with decimal points, they will think of using BigDecimal to solve it, but many people are confused as to why double or float lose precision. And how to solve BigDecimal? Without further ado, let’s get started.
1. What is a floating point number?
Floating point numbers are a data type used by computers to represent decimals, using scientific notation. In Java, double is a double precision, 64-bit, floating point number, and the default is 0.0d. float is single precision, 32 bits. Floating point number, the default is 0.0f;
Store in memory
float Sign bit (1bit) Exponent (8 bit) Mantissa (23 bit)
double Sign bit (1bit) Exponent (11 bit) Mantissa (52 bit)
The exponent of float in the memory is 8bit, because the exponent actually stores For the frameshift of the exponent, assuming that the true value of the exponent is e and the order code is E, then E=e (2^n-1 -1). Among them, 2^n-1 -1 is the exponential offset specified by the IEEE754 standard. According to this formula, we can get 2^8 -1=127. Therefore, the exponent range of float is -128 127, while the exponent range of double is -1024 1023. The negative exponent determines the non-zero number with the smallest absolute value that a floating-point number can express; while the positive exponent determines the number with the largest absolute value that a floating-point number can express, which also determines the value range of a floating-point number.
The range of float is -2^128 ~ 2^127, that is, -3.40E 38 ~ 3.40E 38;
The range of double is -2^1024 ~ 2^1023, also That is -1.79E 308 ~ 1.79E 308
2. Enter the scientific notation of distortion
Let’s talk about scientific notation first. Scientific notation is a method of simplifying counting. Use To approximately represent a very large or small number with a large number of digits, scientific notation has no advantage for values with a small number of digits, but for values with a large number of digits, the advantages of the counting method are very obvious. For example: the speed of light is 300000000 meters/second, and the world's population is approximately 6100000000. Large numbers like the speed of light and the world's population are inconvenient to read and write, so the speed of light can be written as 3*10^8, and the world's population can be written as 6.1*10^9. So the calculator uses scientific notation to indicate that the speed of light is 3E8, and the world's population is approximately 6.1E9.
When we were kids, we used to play with calculators and like to add or subtract like crazy. In the end, the calculator would display the picture below. This is the result displayed by scientific notation
The real value in the picture is -4.86*10^11=-486000000000. Decimal scientific notation requires that the integer part of the significant digit must be within the interval [1, 9].
3. Get into the precision of distortion
When computers process data, they involve data conversion and various complex operations, such as conversion of different units and different bases. (such as binary decimal) conversion, etc., many division operations cannot be divided, such as 10÷3=3.3333...infinite, and the accuracy is limited, 3.3333333x3 is not equal to 10, the decimal obtained after complex processing The data is not precise, and the higher the precision, the more accurate it is. The accuracy of float and double is determined by the number of digits in the mantissa. The integer part is always an implicit "1". Since it is unchanged, it cannot affect the accuracy. float: 2^23 = 8388608, a total of seven digits. Since the leftmost digit is omitted, it means that it can represent up to 8 digits: 28388608 = 16777216. There are 8 significant digits, but it is absolutely guaranteed to be 7 digits, that is, the precision of float is 7~8 significant digits; double: 2^52 = 4503599627370496, a total of 16 digits, similarly, the precision of double is 16~17 Bit.
When it reaches a certain value, it automatically starts using scientific notation and retains significant figures of relevant precision, so the result is an approximate number and the exponent is an integer. In the decimal system, some decimals cannot be fully expressed in binary. Therefore, it can only be represented by limited bits, so there may be errors during storage. To convert decimal decimals into binary, use the multiplication by 2 method to calculate. After removing the integer part, continue to multiply the remaining decimals by 2 until the decimal parts are all 0.
If you encounter the situation where
the output is 0.19999999999999998
double type 0.3-0.1. You need to convert 0.3 into binary in the operation
0.3 * 2 = 0.6 => .0 (.6), take 0 and leave 0.6
0.6 * 2 = 1.2 => .01 (. 2) Take 1 and leave 0.2
0.2 * 2 = 0.4 => .010 (.4) Take 0 and leave 0.4
0.4 * 2 = 0.8 => .0100 (.8) Take 0 and leave 0.8
0.8 * 2 = 1.6 => .01001 (.6) takes 1 and leaves 0.6
.............
3. Summary
After reading the above, it is probably clear why floating point numbers have precision problems. Simply put, the float and double types are mainly designed for scientific calculations and engineering calculations. They perform binary floating point operations, which are carefully designed to provide more accurate and fast near-sum calculations over a wide range of values. However, they do not provide completely accurate results and should not be used for precise results. Floating point numbers that reach a certain size will automatically use scientific notation. Such representation is only an approximation of the real number but not equal to the real number. Infinite loops or exceeding the length of the floating-point mantissa may also occur when converting decimal digits to binary.
4. So how do we use BigDecimal to solve it?
Look at the two outputs below
0.3
The above is the detailed content of Why do double floating point operations lose precision?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



PHP is a powerful programming language that is widely used in the field of web development. One of the very common situations is the need to convert a string to a decimal. This is very useful when doing data processing. In this article, we will explain how to convert string to decimal in PHP.

This article will explain in detail the PHP floating point number rounding method. The editor thinks it is very practical, so I share it with you as a reference. I hope you can gain something after reading this article. PHP Floating Point Rounding Overview Floating point numbers are represented in computers as a decimal point followed by an exponent, however, they are often stored in approximations with a limited number of digits. When you need to round a floating point number to a specific precision, there are several ways to do it. Method 1. round() function The round() function rounds a floating point number to the nearest integer. It accepts floating point numbers and optional precision parameters. For example: $num=1.55;echoround($num);//Output: 2echoround($num,1)

Use the strconv.FormatFloat function to convert floating point numbers into strings. In the Go language, we often need to convert floating point numbers into string types for output or storage needs. The strconv package is provided in the Go language, and the FormatFloat function in it can convert floating point numbers into string types. The FormatFloat function takes three parameters: f represents the floating point number to be converted, fmt represents the format, and prec represents the number of decimal places to retain. Among them, the f parameter

In C++, variables of type int can only hold positive or negative integer values; they cannot hold decimal values. There are float and double values available for this purpose. The double data type was created to store decimals up to seven digits after the decimal point. Conversion of an integer to a double data type can be done automatically by the compiler (called an "implicit" conversion), or it can be explicitly requested by the programmer from the compiler (called an "explicit" conversion). In the following sections, we'll cover various conversion methods. Implicit conversions The compiler performs implicit type conversions automatically. To achieve this, two variables are required - one of floating point type and the other of integer type. When we simply assign a floating point value or variable to an integer variable, the compiler takes care of all the other things

As a popular server-side scripting language, PHP often encounters problems of loss of precision or calculation errors when performing floating-point calculations. These problems may affect the accuracy and stability of the program. This article will explore the causes of PHP floating point calculation errors, propose some avoidance strategies, and give specific code examples for reference. 1. Reasons for PHP floating-point calculation errors. In computers, floating-point numbers are represented in binary form, and binary cannot accurately represent all decimal decimals, which leads to the inaccuracy of floating-point numbers.

:1. Introduction to BCMath BCMath is an extension library built into PHP, which is specially used to handle large integer and floating point number operations. It provides a wealth of functions to perform various mathematical operations such as addition, subtraction, multiplication, division, square, and square root, and supports digital representation in multiple bases. 2. Advantages of BCMath Compared with the arithmetic operators and functions natively provided by PHP, BCMath mainly has the following advantages: Higher precision: BCMath’s operation results can retain more significant digits, which is useful for calculations involving large numbers. scenes are particularly important. Wider range: BCMath can handle larger numbers than PHP's native data types, thus avoiding overflow or loss of precision issues. Richer features: BCMath provides

Converting a string to a floating point number is a common operation in PHP and can be accomplished through built-in methods. First make sure that the string is in a legal floating point format before it can be successfully converted to a floating point number. The following will detail how to convert a string to a floating point number in PHP and provide specific code examples. 1. Use (float) cast In PHP, the simplest way to convert a string into a floating point number is to use cast. The way to force conversion is to add (float) before the string, and PHP will automatically convert it

Using the Math.Round function in C# to round floating-point numbers requires specific code examples. In the C# programming language, sometimes we need to round floating-point numbers. At this time, we can use the Math.Round function to achieve this function. The Math.Round function is a built-in function in C# used for mathematical calculations. Its main function is to round the specified floating point number. The following is the common format of the Math.Round function: Math.Round(doub