Converting XML/HTML Entities to Unicode Strings in Python
In web scraping, entities are frequently used to represent non-ASCII characters. To decode these entities in Python and obtain the corresponding Unicode representation, you can utilize the unescape() function available in the standard library's HTMLParser module.
Example:
Suppose you have the following entity:
1 |
|
which represents an "ǎ" with a tone mark. The binary equivalent of this is 01ce (16 bits). To convert this entity into the Unicode value u'u01ce':
Python 3.4 and earlier:
1 2 3 4 |
|
Python 3.4 and later:
1 2 3 |
|
The resulting unicode_string contains the desired Unicode representation of the string with the entities replaced with their actual Unicode values.
The above is the detailed content of How to Convert XML/HTML Entities to Unicode Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!