What is the Fastest Way to Remove Punctuation from a Pandas DataFrame?

Susan Sarandon
Release: 2024-11-19 06:45:03
Original
332 people have browsed it

What is the Fastest Way to Remove Punctuation from a Pandas DataFrame?

Fast Punctuation Removal with Pandas

Punctuation removal is a common text cleaning task. While pandas str.replace is a widely used method, it may not be sufficiently performant for large datasets.

Alternatives to str.replace:

  • regex.sub: Uses the re module to perform regex-based substitution. This option offers improved performance over str.replace.
  • str.translate: Utilizes the C-implemented str.translate function, resulting in significant speed improvements.

Benchmarks:

  • str.translate exhibits the best performance, followed by regex.sub and then str.replace.
  • The gap in performance widens with increasing dataset size.

Considerations:

  • regex.sub and str.translate cannot handle NaN values in the DataFrame.
  • str.translate requires special handling when the data contains characters that may be excluded by the default punctuation exclusion.

Code:

import pandas as pd
import re

# Regex.sub
df['text'] = [re.compile(r'[^\w\s]+').sub('', x) for x in df['text'].tolist()]

# str.translate
punct = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['text'] = '|'.join(df['text'].tolist()).translate(transtab).split('|')
Copy after login

The above is the detailed content of What is the Fastest Way to Remove Punctuation from a Pandas DataFrame?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template