Splitting Large Pandas DataFrames
When working with large datasets in Pandas, it is often necessary to split the dataframe into smaller chunks for processing or distribution. However, using np.split directly can result in an error if the array cannot be divided equally.
Using np.array_split
The np.array_split function provides a more flexible approach for splitting arrays, including dataframes, into sections. Unlike np.split, it allows the number of sections to be an integer that does not evenly divide the axis.
Consider the following example with a dataframe containing 423244 rows, which we wish to split into 4 groups:
<code class="python">In [1]: import pandas as pd In [2]: df = pd.DataFrame({ 'A': ['foo', 'bar', 'foo', 'bar'], 'B': ['one', 'one', 'two', 'three'], 'C': np.array([rand() for i in range(4)]), 'D': np.array([rand() for i in range(4)]) }) In [3]: print(df)</code>
To split the dataframe into 4 groups using np.array_split, we can:
<code class="python">In [4]: import numpy as np In [5]: sections = np.array_split(df, 4)</code>
The sections variable now contains a list of 4 dataframes, each containing approximately 105811 rows.
When dealing with large dataframes, it is important to consider the computational cost and memory requirements of different splitting methods. np.array_split provides a versatile and efficient solution for dividing arrays into non-equal sections.
The above is the detailed content of How to Efficiently Split Large Pandas DataFrames into Non-Equal Sections?. For more information, please follow other related articles on the PHP Chinese website!