Improving Performance for Multiple Substring Filtering in Pandas Series
When attempting to filter rows where a specific string column contains at least one substring from a given list, conventional methods using np.logical_or.reduce() can be inefficient for large datasets. This article explores an alternative approach leveraging regular expressions to enhance performance.
Proposed Solution
Instead of using regex=False in str.contains(), we employ regular expressions after properly escaping the provided substrings using re.escape(). This ensures literal matches rather than regex interpretation. The escaped substrings are then combined into a single pattern using a regex pipe (|).
Masking Process
The masking stage becomes a loop through the series, checking if each string matches the pattern:
df[col].str.contains(pattern, case=False)
Performance Comparison
Using a sample dataset with 100 substrings of length 5 and 50,000 strings of length 20, the proposed method took approximately 1 second. The original method took around 5 seconds for the same data.
Note
This solution assumes a "worst-case" scenario where there are no substring matches. In cases with matches, performance will be further improved. Moreover, this approach is more efficient than the initial method, reducing the number of checks required per row.
The above is the detailed content of How Can Regular Expressions Improve Pandas Series Substring Filtering Performance?. For more information, please follow other related articles on the PHP Chinese website!