False Data Dependency Impacts Popcount Performance on Intel CPUs
Issue:
You observed significant performance deviations between using a 32-bit and 64-bit loop counter for popcount operations on Intel CPUs. The performance dropped by 50% when using a 64-bit counter, initially attributed to a compiler bug.
Explanation: False Data Dependency
Intel CPUs have a false dependency on the destination register in popcnt instructions, which affects multiple iterations of a tight loop. This dependency stalls the instruction until the destination register is available. The number of instructions affected by this dependency depends on the locality of the loop, resulting in performance variations.
Consequences of the Dependency:
Compiler Behavior:
Neither GCC nor Visual Studio are aware of this false dependency, leading to unpredictable performance based on register allocation. Other compilers, such as Clang and ICC, also lack this knowledge.
AMD Performance:
AMD processors do not appear to have this false dependency, contributing to their higher performance in popcount operations.
Mitigations:
The above is the detailed content of Why is 64-bit Popcount Slower Than 32-bit on Intel CPUs Due to False Data Dependencies?. For more information, please follow other related articles on the PHP Chinese website!