Converting XML/HTML Entities to Unicode Strings in Python
In web scraping, entities are frequently used to represent non-ASCII characters. To decode these entities in Python and obtain the corresponding Unicode representation, you can utilize the unescape() function available in the standard library's HTMLParser module.
Example:
Suppose you have the following entity:
ǎ
which represents an "ǎ" with a tone mark. The binary equivalent of this is 01ce (16 bits). To convert this entity into the Unicode value u'u01ce':
Python 3.4 and earlier:
import HTMLParser h = HTMLParser.HTMLParser() unicode_string = h.unescape('© 2010') # u'\xa9 2010' unicode_string = h.unescape('© 2010') # u'\xa9 2010'
Python 3.4 and later:
import html unicode_string = html.unescape('© 2010') # u'\xa9 2010' unicode_string = html.unescape('© 2010') # u'\xa9 2010'
The resulting unicode_string contains the desired Unicode representation of the string with the entities replaced with their actual Unicode values.
The above is the detailed content of How to Convert XML/HTML Entities to Unicode Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!