Left Packing Problem
Consider the scenario where there's an input array and an output array, but only certain elements satisfying a condition need to be written to the output array. What is the most efficient approach to achieve this using AVX2?
SSE Approach
The SSE approach involves using _mm_movemask_ps to extract a 4-bit mask from the input mask, and then using this mask to generate a shuffle control data with _mm_load_si128. Finally, _mm_shuffle_epi8 is employed to permute the values to align valid elements at the front of the SIMD register. This approach works well for 4-wide SSE vectors with a 16-entry look-up table (LUT).
AVX Limitations
However, for 8-wide AVX vectors, the LUT would require a significantly larger number of entries (256), each with 32 bytes, resulting in 8k of memory usage. It is surprising that AVX does not offer an instruction to simplify this process, such as a masked store with packing.
AVX2 Solution
Despite the lack of a dedicated instruction, it is possible to achieve efficient left packing in AVX2 using a combination of techniques:
Algorithm
The algorithm for left packing in AVX2 involves the following steps:
Conclusion
This approach provides a highly efficient solution for left packing in AVX2. By utilizing vpermps, pext, and other BMI2 instructions, it is possible to pack data based on a mask with minimal overhead and latency.
The above is the detailed content of How Can AVX2 Be Used Most Efficiently for Left Packing with a Mask?. For more information, please follow other related articles on the PHP Chinese website!