Performant Cartesian Product (CROSS JOIN) with Pandas
In Pandas, computing the cartesian product (cross join) of two DataFrames can be an essential operation. While the many-to-many JOIN trick works reasonably for smaller DataFrames, performance degrades with larger data.
Fast Implementation Using NumPy
A faster implementation utilizes NumPy for 1D cartesian product calculations:
def cartesian_product(*arrays): la = len(arrays) dtype = np.result_type(*arrays) arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype) for i, a in enumerate(np.ix_(*arrays)): arr[...,i] = a return arr.reshape(-1, la)
Generalized Solutions for Different DataFrames
The trick above works for DataFrames with non-mixed scalar dtypes. For mixed dtypes, use at your own risk.
Generalizing to Unique Indexed DataFrames:
def cartesian_product_generalized(left, right): la, lb = len(left), len(right) idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb]) return pd.DataFrame( np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
Multiple DataFrames:
Multiple DataFrames can be combined using:
def cartesian_product_multi(*dfs): idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs]) return pd.DataFrame( np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
Simplified Solution for Two DataFrames
When dealing with just two DataFrames, a simpler approach can be used:
def cartesian_product_simplified(left, right): la, lb = len(left), len(right) ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb]) return pd.DataFrame( np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
Performance Comparison
Benchmarking the solutions showed that the NumPy-based cartesian_product_generalized is the fastest, followed by cartesian_product_simplified for two DataFrames.
The above is the detailed content of How Can I Efficiently Perform a Cartesian Product (Cross Join) of Pandas DataFrames?. For more information, please follow other related articles on the PHP Chinese website!