Home > Backend Development > C++ > How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?

How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?

Susan Sarandon
Release: 2024-12-29 19:34:11
Original
246 people have browsed it

How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?

Efficient AVX2 Implementation for Packing Left Based on a Mask

Unlike SSE, AVX lacks a dedicated instruction for packing left based on a mask. However, a combination of AVX2 and BMI2 instructions can be used to achieve this task efficiently.

Using AVX2 and BMI2

The approach leverages the vpermps (_mm256_permutevar8x32_ps) instruction, which performs a lane-crossing variable shuffle, and the pdep (_pdep_u64) instruction from BMI2, which provides bitwise extraction.

Algorithm Steps

  1. Create a constant with packed 3-bit indices representing the desired permutation.
  2. Use pdep to extract the relevant indices from the mask.
  3. Unpack the indices to one per byte.
  4. Convert the unpacked indices to a control mask for vpermps.
  5. Perform the variable shuffle using vpermps.

Implementation Details

The code below provides an implementation in AVX2 BMI2:

#include <immintrin.h>

__m256 compress256(__m256 src, unsigned int mask)
{
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);
}
Copy after login

Performance Analysis

This implementation incurs 6 uops with 16c latency. It can potentially sustain a throughput of one iteration per 4 cycles, keeping multiple iterations in flight.

Alternative Approaches

For AMD CPUs prior to Zen 3, pext/pdep are very slow, so alternative approaches may be preferable. For 16-bit elements, a 128-bit vector approach could be employed. For 8-bit elements, a different technique involving multiple overlapping chunks can be used.

The above is the detailed content of How Can AVX2 and BMI2 Instructions Efficiently Implement Left Packing Based on a Mask?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template