Pandas users commonly encounter the need to create new columns based on existing ones. Two popular methods for this task are Pandas' apply function and NumPy's vectorize. However, the speed difference between these approaches is a question that has not been thoroughly examined.
Based on observations and experiments, it is expected that np.vectorize is significantly faster than df.apply, particularly for larger datasets.
The primary reason for the performance gap lies in the nature of each approach.
df.apply works by iterating over each row in the DataFrame and evaluating the given function. This involves the creation and manipulation of Pandas series objects, which carry a significant overhead due to their index, values, and attributes.
On the other hand, np.vectorize converts the input function into a universal function (ufunc) that operates on NumPy arrays directly. This allows for vectorized calculations, which are highly optimized and avoid Python-level loops.
The question's experiment demonstrates the significant speed advantage of np.vectorize over df.apply for varying dataset sizes. For a DataFrame with 1 million rows, np.vectorize was found to be over 25 times faster.
While np.vectorize is generally faster, there are a few important caveats to consider:
The above is the detailed content of np.vectorize vs. Pandas apply: Which is Faster for Large Datasets?. For more information, please follow other related articles on the PHP Chinese website!