SVML's __m256d _mm256_log2_pd (__m256d a) is limited to Intel compilers and reportedly slower on AMD processors. Alternative implementations exist, but they often focus on SSE rather than AVX2. This discussion aims to provide an efficient implementation of log2() for vectors of four double numbers that is compatible with various compilers and performs well on both AMD and Intel processors.
Traditional Strategy
The usual approach leverages the formula log2(a*b) = log2(a) log2(b), which simplifies to exponent log2(mantissa) for double numbers. The mantissa has a limited range of 1.0 to 2.0, making it suitable for a polynomial approximation to obtain log2(mantissa).
Accuracy and Precision
The desired accuracy and range of inputs influence the implementation. Agner Fog's VCL aims for high precision using error avoidance techniques. However, for faster approximate float log(), consider JRF's polynomial implementation (found here: http://jrfonseca.blogspot.ca/2008/09/fast-sse2-pow-tables-or-polynomials.html).
VCL Algorithm
VCL's log float and double functions follow a two-part approach:
The final result is obtained by adding the exponent to the polynomial approximation. VCL includes extra steps to minimize rounding error.
Alternative Polynomial Approximations
For increased accuracy, you can use VCL directly. However, for a faster approximate log2() implementation for float, consider porting JRF's SSE2 function to AVX2 with FMA.
Avoiding Rounding Error
VCL uses various techniques to reduce rounding error. These include:
Stripping Unnecessary Steps
If your values are known to be finite and positive, you can significantly improve performance by commenting out the checks for underflow, overflow, or denormal.
Further Reading
The above is the detailed content of How Can We Efficiently Implement log2(__m256d) in AVX2 for Both Intel and AMD Processors?. For more information, please follow other related articles on the PHP Chinese website!