Garbled code problem when python captures and saves html pages-Python Tutorial-php.cn

Garbled code problem when python captures and saves html pages

高洛峰

Release： 2017-03-01 13:25:22

Original

1754 people have browsed it

When using Python to capture html pages and save them, there is often a problem that the content of the captured web pages is garbled. The reason for this problem is that on the one hand, there is a problem with the encoding settings in your own code, and on the other hand, when the encoding settings are correct, the actual encoding of the web page does not match the marked encoding. The encoding marked on the html page is here:

Copy code The code is as follows:

Here is a simple solution: use chardet to determine the real encoding of the web page, and at the same time determine the marking encoding from the info returned by the url request. If the two encodings are different, use the bs module to expand to GB18030 encoding; if they are the same, write the file directly (the system default encoding is set here to utf-8).

import urllib2
import sys
import bs4
import chardet
reload(sys)
sys.setdefaultencoding(&#39;utf-8&#39;)
def download(url):
  htmlfile = open(&#39;test.html&#39;,&#39;w&#39;)
  try:
    result = urllib2.urlopen(url)
    content = result.read()
    info = result.info()
    result.close()
  except Exception,e:
    print &#39;download error!!!&#39;
    print e
  else:
    if content != None:
      charset1 = (chardet.detect(content))[&#39;encoding&#39;] #real encoding type
      charset2 = info.getparam(&#39;charset&#39;) #declared encoding type
      print charset1,&#39; &#39;, charset2
      # case1: charset is not None.
      if charset1 != None and charset2 != None and charset1.lower() != charset2.lower():
        newcont = bs4.BeautifulSoup(content, from_encoding=&#39;GB18030&#39;)  #coding: GB18030
        for cont in newcont:
          htmlfile.write(&#39;%s\n&#39;%cont)
      # case2: either charset is None, or charset is the same.
      else:
        #print sys.getdefaultencoding()
        htmlfile.write(content) #default coding: utf-8
  htmlfile.close()
if __name__ == "__main__":
  url = &#39;http://www.php.cn&#39;
  download(url)

Copy after login

The obtained test.html file is opened as follows. You can see that it is stored in UTF-8 BOM-free encoding format, which is the default we set. Encoding:

Garbled code problem when python captures and saves html pages