网页爬虫 - 如何确定一个python爬取得网页是否是被压缩的?
黄舟
黄舟 2017-04-17 15:37:41
0
1
702

我今天尝试爬去糗事百科的。F12后发现REquest headers中Accept-Encoding:gzip, deflate, sdch 我就以为是被压缩的,后来

response=urllib.request.urlopen(Request
print(response.info().get('Content-Encoding'))

返回的是None,请问到底如何确定否被压缩

黄舟
黄舟

人生最曼妙的风景,竟是内心的淡定与从容!

reply all(1)
洪涛

You need to set Accept-Encoding when crawling before this header will be compressed.

In the browser Accept-Encoding:gzip, deflate, sdch tells the website that the browser supports these three compression methods: gzip, deflate, and sdch. In other words, this does not represent the compression method supported by the website, but the compression method supported by the browser.

The website will choose one of the supported compression methods to return, and the compression method is the value of Content-Encoding. The browser will select the corresponding decompression method based on this value.

Yibai supports gzip, but if Accept-Encoding is not set, no compression will occur.

python3#!/usr/bin/env python3
from urllib import request

USER_AGENT = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36'

req = request.Request(r'http://www.qiushibaike.com/', headers={'User-Agent': USER_AGENT, 'Accept-Encoding': 'gzip'})
res = request.urlopen(req)

print(res.info().get('Content-Encoding'))

The output of the above script is

gzip
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template