How to Convert XML/HTML Entities to Unicode Strings in Python?

Susan Sarandon
Release: 2024-11-04 06:36:02
Original
507 people have browsed it

How to Convert XML/HTML Entities to Unicode Strings in Python?

Converting XML/HTML Entities to Unicode Strings in Python

In web scraping, entities are frequently used to represent non-ASCII characters. To decode these entities in Python and obtain the corresponding Unicode representation, you can utilize the unescape() function available in the standard library's HTMLParser module.

Example:

Suppose you have the following entity:

ǎ
Copy after login

which represents an "ǎ" with a tone mark. The binary equivalent of this is 01ce (16 bits). To convert this entity into the Unicode value u'u01ce':

Python 3.4 and earlier:

import HTMLParser
h = HTMLParser.HTMLParser()
unicode_string = h.unescape('© 2010') # u'\xa9 2010'
unicode_string = h.unescape('© 2010') # u'\xa9 2010'
Copy after login

Python 3.4 and later:

import html
unicode_string = html.unescape('© 2010') # u'\xa9 2010'
unicode_string = html.unescape('© 2010') # u'\xa9 2010'
Copy after login

The resulting unicode_string contains the desired Unicode representation of the string with the entities replaced with their actual Unicode values.

The above is the detailed content of How to Convert XML/HTML Entities to Unicode Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template