Floating point numbers are organized according to the IEEE 754 standard. In single precision, the format consists of 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction. The exponent is biased by -127, so 0 represents 2^-126, and 1 represents 2^-125.
The "leading bit convention" assumes that every number except 0.0 begins with a 1 in binary. This avoids wasting a precision bit for the leading digit. However, it creates an exception for 0.0, which has both exponent and fraction bits equal to 0.
As a result, the smallest non-zero number that can be represented is 1.0 × 2^-126. To represent even smaller numbers, engineers introduced subnormal numbers, which have a leading bit of 0 and a fixed exponent of -126.
The largest subnormal number is 0.FFFFFE × 2^-126, which is very close to the smallest non-subnormal number. The smallest non-zero subnormal number is 0.000002 × 2^-126, which is even closer to 0.0.
Subnormal numbers are a trade-off between precision and representation length. For example, the smallest non-zero subnormal has a precision of only 1 bit, so dividing it by 2 results in 0.0 exactly.
In a visualization, subnormal numbers double the length of the exponent 0 range and halve the number of points in that range compared to a system without subnormals. This results in some gaps in the representable number space.
In C, float data type represents single precision IEEE 754 floating point numbers. Subnormal numbers can be identified using the isnormal() function, which returns false for subnormal numbers and true for normal numbers.
The above is the detailed content of Why Do We Have Subnormal Floating Point Numbers?. For more information, please follow other related articles on the PHP Chinese website!