How to Convert XML/HTML Entities to Unicode in Python?

Barbara Streisand
Release: 2024-11-04 00:06:30
Original
371 people have browsed it

How to Convert XML/HTML Entities to Unicode in Python?

Converting XML/HTML Entities to Unicode in Python

Challenge:

In web scraping, HTML entities are commonly used to represent non-ASCII characters. Python needs a utility that can convert a string with these entities into a Unicode type.

Solution:

The Python standard library's HTMLParser possesses an undocumented function, unescape(), which can fulfill this requirement effectively.

Implementation:

For Python 3.4 and earlier:

<code class="python">import HTMLParser

h = HTMLParser.HTMLParser()
result = h.unescape('&amp;copy; 2010')  # u'\xa9 2010'</code>
Copy after login

For Python 3.4 and later:

<code class="python">import html

result = html.unescape('&amp;copy; 2010')  # u'\xa9 2010'</code>
Copy after login

Example:

Consider the HTML entity ǎ, which corresponds to an "ǎ" with a tone mark in binary. Using unescape(), you can convert it to the Unicode value u'u01ce':

<code class="python">result = h.unescape('&amp;#x01ce;')  # u'\u01ce'</code>
Copy after login

The above is the detailed content of How to Convert XML/HTML Entities to Unicode in Python?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!