Removing Non-Printable Characters from Strings in Python
Question:
In Perl, non-printable characters can be removed using the regex expression s/[^[:print:]]//g. However, in Python, the [:print:] class is not supported. How can we achieve similar functionality in Python that handles both ASCII and Unicode characters?
Answer:
Due to Python's limitations in detecting printability, we can construct our own character class using the unicodedata module.
<code class="python">import unicodedata, re, itertools, sys # Generate a list of all characters all_chars = (chr(i) for i in range(sys.maxunicode)) # Category of control characters categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # Escape the control characters for regular expression matching control_char_re = re.compile('[%s]' % re.escape(control_chars)) # Function to remove control characters from a string def remove_control_chars(s): return control_char_re.sub('', s)</code>
For Python 2:
<code class="python">import unicodedata, re, sys # Generate a list of all characters all_chars = (unichr(i) for i in xrange(sys.maxunicode)) # Category of control characters categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # Escape the control characters for regular expression matching control_char_re = re.compile('[%s]' % re.escape(control_chars)) # Function to remove control characters from a string def remove_control_chars(s): return control_char_re.sub('', s)</code>
Extended Option:
For more comprehensive removal, additional categories can be included, though it may impact performance.
Character Categories and Counts:
The above is the detailed content of How to Remove Non-Printable Characters from Python Strings?. For more information, please follow other related articles on the PHP Chinese website!