Home > Backend Development > Python Tutorial > How Can I Efficiently Filter a Pandas Series for Multiple Substrings?

How Can I Efficiently Filter a Pandas Series for Multiple Substrings?

Linda Hamilton
Release: 2024-11-23 18:17:20
Original
347 people have browsed it

How Can I Efficiently Filter a Pandas Series for Multiple Substrings?

Efficient Pandas Filtering for Multiple Substrings in a Series

Determining whether a series contains any of several substrings is a common task in data analysis. While using logical or to combine individual str.contains operations offers a straightforward solution, it can be inefficient for long substrings lists and large dataframes.

To optimize this task, consider adopting a regular expression (regex) approach. By wrapping the substrings in a regex pattern, we can leverage pandas' efficient string matching functions. Specifically, after escaping any special characters in the substrings, we can construct a regex pattern by joining the substrings using the pipe character (|):

import re

esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
Copy after login

With this pattern, we can filter the series using str.contains and case-insensitive matching:

df[col].str.contains(pattern, case=False)
Copy after login

This approach offers improved performance, especially for large dataframes. Consider the following example:

from random import randint, seed

seed(321)

# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]

col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
Copy after login

Using this optimized approach, the filtering operation takes approximately 1 second for 50,000 rows and 100 substrings, significantly faster than the method described in the original question. The performance difference becomes even more pronounced for larger dataframes and substrings lists.

The above is the detailed content of How Can I Efficiently Filter a Pandas Series for Multiple Substrings?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template