objective-c - How to speed up reading 50 million data from a file and storing it in a vector using C++?
PHP中文网
PHP中文网 2017-05-31 10:36:40
0
4
1053

I need to read 50 million double data from a txt file and store it in a vector. I initially thought that the file io might be too slow, so I used file memory mapping to read the file contents into memory as blocks. , and then push_back into the vector one by one, but it only takes 3 minutes to read the data one by one directly from the file. After I optimized it, it increased to 5 minutes.

My optimization plan is to read the entire file into memory, put it in the buffer of char*, and then use vec_name.reserve(50000000); to allocate 50 million capacity to avoid repeated memory allocation, but there seems to be nothing effect.

Is it because time is mainly spent on push_back?

Is there any good optimization method? Thank you all!
The optimized key code is as follows: (It takes five minutes to read all the data into the vector)

        
        ifstream iVecSim("input.txt");
        
        iVecSim.seekg(0, iVecSim.end);
        long long file_size = iVecSim.tellg();//文件大小
        iVecSim.seekg(0, iVecSim.beg);

        char *buffer = new char[file_size];
        iVecSim.read(buffer, file_size);

        string input(buffer);
        delete[]buffer;

        istringstream ss_sim(input);//string流

        string fVecSim;
        vec_similarity.reserve(50000000);
        while (ss_sim.good()) {//从string流中读入vector
            ss_sim >> fVecSim;
            vec_similarity.push_back(atof(fVecSim.c_str()));
        }
PHP中文网
PHP中文网

认证0级讲师

reply all(4)
漂亮男人

It makes no sense to run in debug mode. When I use your code to run in release mode, it only takes about 14 seconds.

To solve a problem, find the problem first. I modified the code like this and first find out where the time is spent

std::cout << "Start" << std::endl;
    auto n1 = ::GetTickCount();
    auto n2 = 0;
    auto n3 = 0;
    auto n4 = 0;

    while (ss_sim.good())
    {
        auto n = ::GetTickCount();
        ss_sim >> fVecSim;
        n2 += (::GetTickCount() - n);

        n = ::GetTickCount();
        auto v = atof(fVecSim.c_str());
        n3 += (::GetTickCount() - n);

        n = ::GetTickCount();
        vec_similarity.push_back(v);
        n4 += (::GetTickCount() - n);
    }
    n1 = ::GetTickCount() - n1;

    std::cout << "ss_sim >> fVecSim:" << n2 << "ms" << std::endl;
    std::cout << "atof:" << n3 << "ms" << std::endl;
    std::cout << "push_back:" << n4 << "ms" << std::endl;
    std::cout << "Total:" << n1 << "ms" << std::endl;

So the bottleneck lies in the sentence "ss_sim >> fVecSim". atof is fast enough.

So my conclusion is: the ultimate optimization solution is to start with the storage format and store your data as binary instead of string. This avoids the overhead of string IO and conversion functions and truly achieves fetching data in seconds.

phpcn_u1582

The most efficient way at present is to use streams, and it can be seen from your code implementation: you read all the file contents into the buffer at once, which is not the best way. It is recommended to read buffer[1024] on average each time, which is 1K, or other values. After reading, the pointer moves to the next line and continues reading until the end of the EOF position

Peter_Zhu

1. If there is no dependency between data, you can try multi-threaded reading in blocks;
2. In addition, the memory of vector is continuous. If the subsequent traversal is not random access, using list will be more efficient. Quite a few.

Peter_Zhu

You can switch to C-style scanfTry it


Wow, why are you treating my answer like this? The netizen who reported me would like to ask, why is there something wrong with this answer?

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template