How to Load 8 Chars into an __m256 Variable as Packed Single Precision Floats?-C++-php.cn

How to Load 8 Chars into an __m256 Variable as Packed Single Precision Floats?

Patricia Arquette

Release： 2024-11-03 13:21:30

Original

715 people have browsed it

How to Load 8 Chars into an __m256 Variable as Packed Single Precision Floats?

Loading 8 Chars from Memory into an __m256 Variable as Packed Single Precision Floats

In an effort to optimize an algorithm for Gaussian blur, you seek to replace the usage of a float buffer with an __m256 intrinsic variable. This question aims to determine the optimal instructions for this task.

Instruction for AVX2 Architecture:

Utilize PMOVZX to zero-extend your chars into 32-bit integers in a 256b register.
Convert to float in-place with VCVTDQ2PS.

; rsi = new_image
VPMOVZXBD   ymm0,  [rsi]   ; or SX to sign-extend  (Byte to DWord)
VCVTDQ2PS   ymm0, ymm0     ; convert to packed foat

Copy after login

Additional Strategies:

Consider using a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits. This approach reduces uop count and can be beneficial on Ryzen CPUs.
Avoid using extra shuffle instructions, as they may become a bottleneck when shuffling is already a limitation.

Instructions for AVX1 Architecture:

Perform the following steps:

VPMOVZXBD   xmm0,  [rsi]
VPMOVZXBD   xmm1,  [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1   ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS   ymm0, ymm0     ; convert to packed float

Copy after login

Intrinsics Considerations:

GCC and MSVC may require special handling to ensure optimal code generation when using intrinsics for VPMOVZXBD ymm,[mem].
Consider using the _mm_loadl_epi64 intrinsic instead, which can be folded into the memory operand for optimal asm at -O3 with GCC on GCC versions 9 and later.
For AVX1-only optimization, writing the intrinsics version is an un-fun exercise.

The above is the detailed content of How to Load 8 Chars into an __m256 Variable as Packed Single Precision Floats?. For more information, please follow other related articles on the PHP Chinese website!