为什么将循环计数器从'unsigned”更改为'uint64_t”会显着影响 x86 CPU 上'_mm_popcnt_u64”的性能，以及编译器优化和变量声明如何影响-C++-PHP中文网

Why does changing a loop counter from `unsigned` to `uint64_t` significantly impact the performance of `_mm_popcnt_u64` on x86 CPUs, and how does compiler optimization and variable declaration affect this performance difference?

探究 u64 循环计数器与 x86 CPUs 上的 _mm_popcnt_u64 不同寻常的性能差异

简介

我在寻找快速对大型数据数组进行 popcount 的方法时，遇到了一个非常奇怪的现象：将循环变量从 unsigned 更改为 uint64_t 使我的 PC 上的性能下降了 50%。

基准测试

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

登录后复制

如您所见，我们创建了一个大小为 x MB 的随机数据缓冲区，其中 x 从命令行读取。然后，我们迭代缓冲区并使用 x86 popcount 内联函数的一个展开版本执行 popcount。为了获得更精确的结果，我们执行 10,000 次 popcount。我们测量 popcount 的时间。在第一种情况下，内部循环变量未签名，在第二种情况下，内部循环变量为 uint64_t。我认为这不应该有任何区别，但事实并非如此。

（绝对疯狂的）结果

我这样编译它（g 版本：Ubuntu 4.8.2-19ubuntu1）：

g++ -O3 -march=native -std=c++11 test.cpp -o test

登录后复制

这是我在我的 Haswell Core i7-4770K CPU @ 3.50GHz 上运行测试 1（所以 1MB 随机数据）的结果：

unsigned 41959360000 0.401554 秒 26.113 GB/秒
uint64_t 41959360000 0.759822 秒 13.8003 GB/秒

如您所见，uint64_t 版本的吞吐量只有 unsigned 版本的一半！该问题似乎是生成了不同的汇编，但原因是什么？首先，我认为这是一个编译器错误，所以我尝试了 clang （Ubuntu Clang 版本 3.4-1ubuntu3）：

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

登录后复制

测试结果 1：

unsigned 41959360000 0.398293 秒 26.3267 GB/秒
uint64_t 41959360000 0.680954 秒 15.3986 GB/秒

因此，几乎得到了相同的结果，仍然很奇怪。但现在变得非常奇怪。我将从输入中读取的缓冲区大小替换为常量 1，所以我在：

uint64_t size = atol(argv[1]) << 20;

登录后复制

改为：

uint64_t size = 1 << 20;

登录后复制

因此，编译器现在知道编译时的缓冲区大小。也许它可以添加一些优化！以下是在 g 中的数字：

unsigned 41959360000 0.509156 秒 20.5944 GB/秒
uint64_t 41959360000 0.508673 秒 20.6139 GB/秒

现在，两个版本的速度都一样快。然而，与 unsigned 相比， velocidade 甚至变得更慢了！它从 26 GB/秒下降到 20 GB/秒，因此用一个常量值替换一个非常规常量导致 去优化。严重的是，我在此处毫无头绪！但现在用 clang 和新版本：

uint64_t size = atol(argv[1]) << 20;

登录后复制

改为：

uint64_t size = 1 << 20;

登录后复制

结果：

unsigned 41959360000 0.677009 sec 15.4884 GB/s
uint64_t 41959360000 0.676909 sec 15.4906 GB/s

等等，发生了什么事？现在，两个版本都下降到了 15GB/s 的低速度。因此，用一个常量值替换一个非常规常量值甚至导致了两个版本的代码速度变慢对于 Clang！

我请一位使用 Ivy Bridge CPU 的同事编译我的基准测试。他得到了类似的结果，所以这似乎不是 Haswell 独有。由于有两个编译器在此处产生奇怪的结果，因此这似乎也不是编译器错误。由于我们这里没有 AMD CPU，所以我们只能使用 Intel 进行测试。

更多疯狂，拜托！

使用第一个示例（带有 atol(argv[1]) 的示例），在变量前面放置一个 static，即：

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

登录后复制

以下是她在 g 中的结果：