How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?-C++-php.cn

How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?

Susan Sarandon

Release： 2024-12-29 19:34:11

Original

321 people have browsed it

How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?

Efficient AVX2 Implementation for Packing Left Based on a Mask

Unlike SSE, AVX lacks a dedicated instruction for packing left based on a mask. However, a combination of AVX2 and BMI2 instructions can be used to achieve this task efficiently.

Using AVX2 and BMI2

The approach leverages the vpermps (_mm256_permutevar8x32_ps) instruction, which performs a lane-crossing variable shuffle, and the pdep (_pdep_u64) instruction from BMI2, which provides bitwise extraction.

Algorithm Steps

Create a constant with packed 3-bit indices representing the desired permutation.
Use pdep to extract the relevant indices from the mask.
Unpack the indices to one per byte.
Convert the unpacked indices to a control mask for vpermps.
Perform the variable shuffle using vpermps.

Implementation Details

The code below provides an implementation in AVX2 BMI2:

#include <immintrin.h>

__m256 compress256(__m256 src, unsigned int mask)
{
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);
}

Copy after login

Performance Analysis

This implementation incurs 6 uops with 16c latency. It can potentially sustain a throughput of one iteration per 4 cycles, keeping multiple iterations in flight.

Alternative Approaches

For AMD CPUs prior to Zen 3, pext/pdep are very slow, so alternative approaches may be preferable. For 16-bit elements, a 128-bit vector approach could be employed. For 8-bit elements, a different technique involving multiple overlapping chunks can be used.

The above is the detailed content of How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?. For more information, please follow other related articles on the PHP Chinese website!