python - 如何不刷新网页而监控网页变化？

Question

我在用python监控一个网页 这个网页不定时的会更新 我要寻找需要匹配的关键词比如‘ABC’ 大概的程序框架如下基本方法就是 用 selenium 获取源码 然后beautifulsoup解析 然后再去结果里面match 每2秒循环一次 {代码...

黄舟 · Answer

Http Last-Modified

　1) What is "Last-Modified"?
　When the browser requests a URL for the first time, the return status from the server will be 200, the content is the resource you requested, and there is a Last-Modified attribute marking this The last time the file was modified on the server side, the format is similar to this:
　Last-Modified: Fri, 12 May 2006 18:53:33 GMT When the client requests this URL for the second time, according to the provisions of the HTTP
protocol, the browser The If-Modified-Since header will be sent to the server to ask whether the file has been modified after this time:
　If-Modified-Since: Fri, 12 May 2006 18:53:33 GMT
　If the server-side resources have not changed, it will automatically Returns HTTP 304 (Not
Changed.) status code with empty content, thus saving the amount of data to be transmitted. When the server-side code changes or the server is restarted, the resource is reissued and the return is similar to the first request. This ensures that resources are not sent to the client repeatedly, and also ensures that when the server changes, the client can get the latest resources.

headers 'If-Modified-Since'

Status Code:304 Not Modified

Status code 304 means the page has not been changed

>>> import requests as req
>>> url='http://www.guancha.cn/'
>>> rsp=req.head(url,headers={'If-Modified-Since':'Sun, 05 Feb 2017 05:39:11 GMT'})
>>> rsp

>>> rsp.headers
{'Server': 'NWS_TCloud_S1', 'Content-Type': 'text/html', 'Date': 'Sun, 05 Feb 2017 05:45:20 GMT', 'Cache-Control': 'max-age=60', 'Expires': 'Sun, 05 Feb 2017 05:46:20 GMT', 'Content-Length': '0', 'Connection': 'keep-alive'}

Time changed to yesterday (4th)

The server returns status code 200

and there are 'Last-Modified': 'Sun, 05 Feb 2017 06:00:03 GMT'

indicates the time of last modification.

>>> hds={'If-Modified-Since':'Sat, 04 Feb 2017 05:39:11 GMT'} # 时间改为 昨天（4号）
>>> rsp=req.head(url,headers=hds)
>>> rsp

>>> rsp.headers
{'Last-Modified': 'Sun, 05 Feb 2017 06:00:03 GMT', 'Date': 'Sun, 05 Feb 2017 06:04:59 GMT', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'X-Daa-Tunnel': 'hop_count=2', 'X-Cache-Lookup': 'Hit From Disktank3 Gz, Hit From Inner Cluster, Hit From Upstream', 'Server': 'nws_ocmid_hy', 'Content-Type': 'text/html', 'Expires': 'Sun, 05 Feb 2017 06:05:59 GMT', 'Cache-Control': 'max-age=60', 'Content-Length': '62608'}
>>>

伊谢尔伦 · Answer

No matter what, you have to visit the source site to get the data. If you don’t capture the data, how will you know if there are changes?

大家讲道理 · Answer

This kind of update may be updated using ajax. Personally, I think you can look at the js code of the website to find the request URL and parameters. If possible, go to the request directly?