Converting XML/HTML Entities to Unicode in Python
Challenge:
In web scraping, HTML entities are commonly used to represent non-ASCII characters. Python needs a utility that can convert a string with these entities into a Unicode type.
Solution:
The Python standard library's HTMLParser possesses an undocumented function, unescape(), which can fulfill this requirement effectively.
Implementation:
For Python 3.4 and earlier:
<code class="python">import HTMLParser h = HTMLParser.HTMLParser() result = h.unescape('&copy; 2010') # u'\xa9 2010'</code>
For Python 3.4 and later:
<code class="python">import html result = html.unescape('&copy; 2010') # u'\xa9 2010'</code>
Example:
Consider the HTML entity ǎ, which corresponds to an "ǎ" with a tone mark in binary. Using unescape(), you can convert it to the Unicode value u'u01ce':
<code class="python">result = h.unescape('&#x01ce;') # u'\u01ce'</code>
The above is the detailed content of How to Convert XML/HTML Entities to Unicode in Python?. For more information, please follow other related articles on the PHP Chinese website!