Table of Contents
回复内容:
Home Backend Development Python Tutorial 如何使用爬虫监控一系列网站的更新情况?

如何使用爬虫监控一系列网站的更新情况?

Jun 06, 2016 pm 04:22 PM
html

我现在想到的方法只有每天自动把网站爬下来 然后对比新旧网站的HTML文件 才决定有没有更新

回复内容:

1 第一次先请求某个网页,抓取到本地,假设文件名为 a.html。这时文件系统有个文件的修改时间。

2 第二次访问网页,如果发现本地已经有了 a.html,则向服务器发送一个 If-Modified-Since 的请求(w3.org/Protocols/rfc261)。 把 a.html 的修改时间写到请求里。

3 如果网页更新了,服务器会返回一个 200 的应答,这时就重新抓取网页,更新本地文件。

4 如果网页没有更新,服务器会返回一个304的应答。这时就不需要更新文件了。 这个问题已经有人做出现成产品了,你可以看一下:
sleepingspider.com
注册成为用户后,可以选择需要关注的网页,如有更新会收到邮件提醒。还有一些高级的设置,没用过,你可以看看 我的本科毕设就是这个。。
当时做了一套监控果库、想去、花瓣市集、暖岛的服务。

实现方式:
1. crontab 定时任务
2. node 读取配置并调用 phantomjs(内存型浏览器) 访问各链接并存图。
3. 所有图片用日期分文件夹命名,用 Bootstrap 做个对比显示。

如果有这样一套服务,我觉得挺好的。
不过付费率可能是个问题。 也许用git对扒下来网页做版本控制也行吧? 我歪个楼
chrome有个Page Monitor的插件 使用MD5数字签名
每次下载网页时,把服务器返回的数据流ResponseStream先放在内存缓冲区,然后对
ResponseStream生成MD5数字签名S1,下次下载同样生成签名S2,比较S2和S1,如果相同,则页面没有
跟新,否则网页就有跟新。 可以使用网站资讯监控工具,非常符合你的要求
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Table Border in HTML Table Border in HTML Sep 04, 2024 pm 04:49 PM

Guide to Table Border in HTML. Here we discuss multiple ways for defining table-border with examples of the Table Border in HTML.

HTML margin-left HTML margin-left Sep 04, 2024 pm 04:48 PM

Guide to HTML margin-left. Here we discuss a brief overview on HTML margin-left and its Examples along with its Code Implementation.

Nested Table in HTML Nested Table in HTML Sep 04, 2024 pm 04:49 PM

This is a guide to Nested Table in HTML. Here we discuss how to create a table within the table along with the respective examples.

HTML Table Layout HTML Table Layout Sep 04, 2024 pm 04:54 PM

Guide to HTML Table Layout. Here we discuss the Values of HTML Table Layout along with the examples and outputs n detail.

HTML Input Placeholder HTML Input Placeholder Sep 04, 2024 pm 04:54 PM

Guide to HTML Input Placeholder. Here we discuss the Examples of HTML Input Placeholder along with the codes and outputs.

HTML Ordered List HTML Ordered List Sep 04, 2024 pm 04:43 PM

Guide to the HTML Ordered List. Here we also discuss introduction of HTML Ordered list and types along with their example respectively

Moving Text in HTML Moving Text in HTML Sep 04, 2024 pm 04:45 PM

Guide to Moving Text in HTML. Here we discuss an introduction, how marquee tag work with syntax and examples to implement.

HTML onclick Button HTML onclick Button Sep 04, 2024 pm 04:49 PM

Guide to HTML onclick Button. Here we discuss their introduction, working, examples and onclick Event in various events respectively.

See all articles