Efficient AVX2 Implementation for Packing Left Based on a Mask
Unlike SSE, AVX lacks a dedicated instruction for packing left based on a mask. However, a combination of AVX2 and BMI2 instructions can be used to achieve this task efficiently.
Using AVX2 and BMI2
The approach leverages the vpermps (_mm256_permutevar8x32_ps) instruction, which performs a lane-crossing variable shuffle, and the pdep (_pdep_u64) instruction from BMI2, which provides bitwise extraction.
Algorithm Steps
Implementation Details
The code below provides an implementation in AVX2 BMI2:
#include <immintrin.h> __m256 compress256(__m256 src, unsigned int mask) { uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7; // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask); __m128i bytevec = _mm_cvtsi64_si128(wanted_indices); __m256i shufmask = _mm256_cvtepu8_epi32(bytevec); return _mm256_permutevar8x32_ps(src, shufmask); }
Performance Analysis
This implementation incurs 6 uops with 16c latency. It can potentially sustain a throughput of one iteration per 4 cycles, keeping multiple iterations in flight.
Alternative Approaches
For AMD CPUs prior to Zen 3, pext/pdep are very slow, so alternative approaches may be preferable. For 16-bit elements, a 128-bit vector approach could be employed. For 8-bit elements, a different technique involving multiple overlapping chunks can be used.
The above is the detailed content of How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?. For more information, please follow other related articles on the PHP Chinese website!