python處理html轉義字符-Python教學-PHP中文網

python處理html轉義字符

高洛峰

發布： 2017-03-01 13:27:57

原創

2156 人瀏覽過

本文實例講述了python處理html轉義字元的方法。分享給大家供大家參考，如下：

最近在用Python處理網頁資料時，常常遇到一些html轉義字元（也叫html字元實體），例如<> 等。字符實體一般是為了表示網頁中的預留字符，例如>用>表示，防止被瀏覽器認為是標籤，具體參考w3school的HTML 字符實體。雖然很有用，但是它們會極度影響對於網頁資料的解析。為了處理這些轉義字符，有以下解決方案：

1、使用HTMLParser處理

import HTMLParser
html_cont = " asdfg>123<"
html_parser = HTMLParser.HTMLParser()
new_cont = html_parser.unescape(html_cont)
print new_cont #new_cont = " asdfg>123<"

登入後複製

轉換回去（只是空格轉不回去了）：

import cgi
new_cont = cgi.escape(new_cont)
print new_cont #new_cont = " asdfg>123<"

登入後複製

#2、直接挨個替換

html_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39; &#39;, &#39; &#39;)
print new_cont #new_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39;>&#39;, &#39;>&#39;)
print new_cont #new_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39;<&#39;, &#39;<&#39;)
print new_cont #new_cont = " asdfg>123<"

登入後複製

不知道還有沒有更好的辦法。

另外stackoverflow上給了在xml中處理轉義字元的答案：python - What's the best way to handle -like entities in XML documents with lxml? - Stack Overflow。

更多python處理html轉義字元相關文章請關注PHP中文網！