The example in this article describes how Python handles HTML escape characters. Share it with everyone for your reference, the details are as follows:
When I use Python to process web page data recently, I often encounter some html escape characters (also called html character entities), such as <> etc. . Character entities are generally used to represent reserved characters in web pages. For example, > is represented by > to prevent the browser from thinking it is a tag. For details, please refer to w3school's HTML character entities. Although useful, they can greatly affect the parsing of web data. In order to handle these escape characters, there are the following solutions:
1. Use HTMLParser to process
import HTMLParser html_cont = " asdfg>123<" html_parser = HTMLParser.HTMLParser() new_cont = html_parser.unescape(html_cont) print new_cont #new_cont = " asdfg>123<"
convert back (It’s just that the spaces cannot be converted back):
import cgi new_cont = cgi.escape(new_cont) print new_cont #new_cont = " asdfg>123<"
2. Replace
html_cont = " asdfg>123<" new_cont = new_cont.replace(' ', ' ') print new_cont #new_cont = " asdfg>123<" new_cont = new_cont.replace('>', '>') print new_cont #new_cont = " asdfg>123<" new_cont = new_cont.replace('<', '<') print new_cont #new_cont = " asdfg>123<"