How to Convert XML/HTML Entities to Unicode Strings in Python?-Python Tutorial-php.cn

How to Convert XML/HTML Entities to Unicode Strings in Python?

Susan Sarandon

Release： 2024-11-04 06:36:02

Original

653 people have browsed it

How to Convert XML/HTML Entities to Unicode Strings in Python?

Converting XML/HTML Entities to Unicode Strings in Python

In web scraping, entities are frequently used to represent non-ASCII characters. To decode these entities in Python and obtain the corresponding Unicode representation, you can utilize the unescape() function available in the standard library's HTMLParser module.

Example:

Suppose you have the following entity:

&amp;#x01ce;

Copy after login

which represents an "ǎ" with a tone mark. The binary equivalent of this is 01ce (16 bits). To convert this entity into the Unicode value u'u01ce':

Python 3.4 and earlier:

import HTMLParser
h = HTMLParser.HTMLParser()
unicode_string = h.unescape('&amp;copy; 2010') # u'\xa9 2010'
unicode_string = h.unescape('&amp;#169; 2010') # u'\xa9 2010'

Copy after login

Python 3.4 and later:

import html
unicode_string = html.unescape('&amp;copy; 2010') # u'\xa9 2010'
unicode_string = html.unescape('&amp;#169; 2010') # u'\xa9 2010'

Copy after login

The resulting unicode_string contains the desired Unicode representation of the string with the entities replaced with their actual Unicode values.

The above is the detailed content of How to Convert XML/HTML Entities to Unicode Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!