I have many hundreds of M csv on hand to store some data, and I often need to use pandas and matplotlib to read and plot these data. Before drawing, it is usually necessary to perform preprocessing, slicing and other cleaning operations. Because figures need to be interacted with and reported frequently, I use %matplotlib notebook
in jupyter notebook to operate and interact. Should these intermediate data generated from the original data be saved in csv so that the csv can be directly read to obtain the intermediate data for the next display, or should it be saved using pickle, and reading pickle is faster for subsequent use?
CSV must be safe. It seems that changing pickle to another python version may cause reading failure. This is not a universal format. If it is a few hundred megabytes, the csv reading speed is actually not slow. What's more, there is hdf5, these are serious data exchange formats.
csv is enough, if you think it’s not fast enough, you can try hdf5 file