How Can We Efficiently Implement log2(__m256d) in AVX2 for Both Intel and AMD Processors?-C++-php.cn

How Can We Efficiently Implement log2(__m256d) in AVX2 for Both Intel and AMD Processors?

Mary-Kate Olsen

Release： 2024-11-27 15:49:10

Original

740 people have browsed it

How Can We Efficiently Implement log2(__m256d) in AVX2 for Both Intel and AMD Processors?

Efficient Implementation of log2(__m256d) in AVX2

SVML's __m256d _mm256_log2_pd (__m256d a) is limited to Intel compilers and reportedly slower on AMD processors. Alternative implementations exist, but they often focus on SSE rather than AVX2. This discussion aims to provide an efficient implementation of log2() for vectors of four double numbers that is compatible with various compilers and performs well on both AMD and Intel processors.

Traditional Strategy

The usual approach leverages the formula log2(a*b) = log2(a) log2(b), which simplifies to exponent log2(mantissa) for double numbers. The mantissa has a limited range of 1.0 to 2.0, making it suitable for a polynomial approximation to obtain log2(mantissa).

Accuracy and Precision

The desired accuracy and range of inputs influence the implementation. Agner Fog's VCL aims for high precision using error avoidance techniques. However, for faster approximate float log(), consider JRF's polynomial implementation (found here: http://jrfonseca.blogspot.ca/2008/09/fast-sse2-pow-tables-or-polynomials.html).

VCL Algorithm

VCL's log float and double functions follow a two-part approach:

Extract exponent and mantissa: The exponent is converted back to a float, and the mantissa is adjusted with a check for values less than SQRT2*0.5. This is followed by a subtraction of 1.0 from the mantissa.
Polynomial approximation: A polynomial approximation is applied to the adjusted mantissa to calculate log(x) around x=1.0. For double precision, VCL uses a ratio of two 5th-order polynomials.

The final result is obtained by adding the exponent to the polynomial approximation. VCL includes extra steps to minimize rounding error.

Alternative Polynomial Approximations

For increased accuracy, you can use VCL directly. However, for a faster approximate log2() implementation for float, consider porting JRF's SSE2 function to AVX2 with FMA.

Avoiding Rounding Error

VCL uses various techniques to reduce rounding error. These include:

Splitting ln2 into smaller constants (ln2_lo and ln2_hi)
Adding the line res = nmul_add(x2, 0.5, x); to the polynomial evaluation

Stripping Unnecessary Steps

If your values are known to be finite and positive, you can significantly improve performance by commenting out the checks for underflow, overflow, or denormal.

Further Reading

[Polynomial approximation with minimax error](http://gallium.inria.fr/blog/fast-vectorizable-math-approx/)
[Fast approximate logarithm using bit manipulation](http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html)

The above is the detailed content of How Can We Efficiently Implement log2(__m256d) in AVX2 for Both Intel and AMD Processors?. For more information, please follow other related articles on the PHP Chinese website!