How to Convert XML/HTML Entities to Unicode in Python?-Python Tutorial-php.cn

How to Convert XML/HTML Entities to Unicode in Python?

Barbara Streisand

Release： 2024-11-04 00:06:30

Original

535 people have browsed it

How to Convert XML/HTML Entities to Unicode in Python?

Converting XML/HTML Entities to Unicode in Python

Challenge:

In web scraping, HTML entities are commonly used to represent non-ASCII characters. Python needs a utility that can convert a string with these entities into a Unicode type.

Solution:

The Python standard library's HTMLParser possesses an undocumented function, unescape(), which can fulfill this requirement effectively.

Implementation:

For Python 3.4 and earlier:

<code class="python">import HTMLParser

h = HTMLParser.HTMLParser()
result = h.unescape('&amp;copy; 2010')  # u'\xa9 2010'</code>

Copy after login

For Python 3.4 and later:

<code class="python">import html

result = html.unescape('&amp;copy; 2010')  # u'\xa9 2010'</code>

Copy after login

Example:

Consider the HTML entity ǎ, which corresponds to an "ǎ" with a tone mark in binary. Using unescape(), you can convert it to the Unicode value u'u01ce':

<code class="python">result = h.unescape('&amp;#x01ce;')  # u'\u01ce'</code>

Copy after login

The above is the detailed content of How to Convert XML/HTML Entities to Unicode in Python?. For more information, please follow other related articles on the PHP Chinese website!