Using AVX2 and BMI2 for Efficient Left Packing Based on a Mask
In AVX2, we can leverage the vpermps (_mm256_permutevar8x32_ps) instruction to perform lane-crossing variable-shuffles. Additionally, BMI2 provides us with pext (Parallel Bits Extract), enabling us to perform bitwise extraction operations crucial for our problem.
Algorithm:
Code Implementation:
#include <stdint.h> #include <immintrin.h> __m256 compress256(__m256 src, unsigned int mask) { uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); expanded_mask *= 0xFF; const uint64_t identity_indices = 0x0706050403020100; uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask); __m128i bytevec = _mm_cvtsi64_si128(wanted_indices); __m256i shufmask = _mm256_cvtepu8_epi32(bytevec); return _mm256_permutevar8x32_ps(src, shufmask); }
Advantages:
Drawbacks:
The above is the detailed content of How Can AVX2 and BMI2 Instructions Optimize Left Packing Based on a Mask?. For more information, please follow other related articles on the PHP Chinese website!