Home > Backend Development > C++ > Why is 64-bit Popcount Slower Than 32-bit on Intel CPUs Due to False Data Dependencies?

Why is 64-bit Popcount Slower Than 32-bit on Intel CPUs Due to False Data Dependencies?

Susan Sarandon
Release: 2024-12-09 22:19:11
Original
869 people have browsed it

Why is 64-bit Popcount Slower Than 32-bit on Intel CPUs Due to False Data Dependencies?

False Data Dependency Impacts Popcount Performance on Intel CPUs

Issue:

You observed significant performance deviations between using a 32-bit and 64-bit loop counter for popcount operations on Intel CPUs. The performance dropped by 50% when using a 64-bit counter, initially attributed to a compiler bug.

Explanation: False Data Dependency

Intel CPUs have a false dependency on the destination register in popcnt instructions, which affects multiple iterations of a tight loop. This dependency stalls the instruction until the destination register is available. The number of instructions affected by this dependency depends on the locality of the loop, resulting in performance variations.

Consequences of the Dependency:

  • Different Registers: When the loop uses different registers for successive popcnt operations, the dependency is spread across loop iterations, significantly impacting performance.
  • Same Register: If all popcnt operations use the same register, the dependency remains within a single iteration, reducing the performance impact.
  • Broken Dependency Chain: Breaking the dependency by introducing an unrelated instruction (e.g., xor) improves performance by allowing the processor to parallelize loop iterations.

Compiler Behavior:

Neither GCC nor Visual Studio are aware of this false dependency, leading to unpredictable performance based on register allocation. Other compilers, such as Clang and ICC, also lack this knowledge.

AMD Performance:

AMD processors do not appear to have this false dependency, contributing to their higher performance in popcount operations.

Mitigations:

  • Inlining Assembly: Manually optimizing the assembly code using inline assembly can bypass the compiler's unawareness of the dependency.
  • Breaking the Dependency Chain: Inserting an unrelated instruction after each popcnt operation breaks the false dependency and improves performance.
  • Using Different Registers: Assigning different registers for consecutive popcnt operations can mitigate the issue but may not always be possible.

The above is the detailed content of Why is 64-bit Popcount Slower Than 32-bit on Intel CPUs Due to False Data Dependencies?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template