python handles html escape characters-Python Tutorial-php.cn

python handles html escape characters

高洛峰

Release： 2017-03-01 13:27:57

Original

2154 people have browsed it

The example in this article describes how Python handles HTML escape characters. Share it with everyone for your reference, the details are as follows:

When I use Python to process web page data recently, I often encounter some html escape characters (also called html character entities), such as <> etc. . Character entities are generally used to represent reserved characters in web pages. For example, > is represented by > to prevent the browser from thinking it is a tag. For details, please refer to w3school's HTML character entities. Although useful, they can greatly affect the parsing of web data. In order to handle these escape characters, there are the following solutions:

1. Use HTMLParser to process

import HTMLParser
html_cont = " asdfg>123<"
html_parser = HTMLParser.HTMLParser()
new_cont = html_parser.unescape(html_cont)
print new_cont #new_cont = " asdfg>123<"

Copy after login

convert back (It’s just that the spaces cannot be converted back):

import cgi
new_cont = cgi.escape(new_cont)
print new_cont #new_cont = " asdfg>123<"

Copy after login

2. Replace

html_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39; &#39;, &#39; &#39;)
print new_cont #new_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39;>&#39;, &#39;>&#39;)
print new_cont #new_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39;<&#39;, &#39;<&#39;)
print new_cont #new_cont = " asdfg>123<"

Copy after login

# directly.

##I don’t know if there is a better way.

In addition, stackoverflow provides an answer to handling escape characters in xml: python - What's the best way to handle -like entities in XML documents with lxml? - Stack Overflow.

For more articles related to python processing html escape characters, please pay attention to the PHP Chinese website!