Home Backend Development PHP Tutorial 一个采集得到信息不全的有关问题

一个采集得到信息不全的有关问题

Jun 13, 2016 am 10:17 AM
http referer request

求助一个采集得到信息不全的问题
我要采集这个网站
http://www.tvmao.com/drama/MGxYWA==/episode/0

刚开始的时候,得到的信息是全的,

当采集到一定时候的时候,采集得到的信息只有半了,少了一些文字。

(我然后拿到其它地方用IE打开看的时候,发现先加载了一半文字,过一小会,在加载一半的文字)
(用本地浏览器打开,只有一半的文字)
还请问一下,怎么处理一下。才能获取全部信息。
















------解决方案--------------------
有可能这个网站作了防采集处理,同一IP如果访问过频,针对此IP就启动防采集了,这也符合你说的刚开始可以完整采集,时间一长就不行的情况。不过这个还好了,有的网站变态到每次1K字节的间隔输出呢
------解决方案--------------------

探讨

这样啊,我该怎么做一下,才能不被防采集呢?
引用:

有可能这个网站作了防采集处理,同一IP如果访问过频,针对此IP就启动防采集了,这也符合你说的刚开始可以完整采集,时间一长就不行的情况。不过这个还好了,有的网站变态到每次1K字节的间隔输出呢

------解决方案--------------------
防止采集:
1:用户登录才能访问网站内容
2:利用脚本语言做分页(隐藏分页)
3:防盗链办法(只许可通过本站页面连接查看,如:Request.ServerVariables(“HTTP_REFERER“) )
4:全flash、图片或者pdf来浮现网站内容
5:网站随机接纳不同模版
6:接纳动态不规则的html标签
一旦要同时搜索引擎爬虫和采集器,这是很让人无奈的工作,因为搜索引擎第一步就是采集目标网页内容,这跟采集器原理同样,所以很多防止采集的方法同时也阻碍了搜索引擎对网站的收录,无奈,是吧?以上10条建议虽然不能百分之百防采集,可是几种方法一起适用已经拒绝了一大部分采集器了。
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What does http status code 520 mean? What does http status code 520 mean? Oct 13, 2023 pm 03:11 PM

HTTP status code 520 means that the server encountered an unknown error while processing the request and cannot provide more specific information. Used to indicate that an unknown error occurred when the server was processing the request, which may be caused by server configuration problems, network problems, or other unknown reasons. This is usually caused by server configuration issues, network issues, server overload, or coding errors. If you encounter a status code 520 error, it is best to contact the website administrator or technical support team for more information and assistance.

Understand common application scenarios of web page redirection and understand the HTTP 301 status code Understand common application scenarios of web page redirection and understand the HTTP 301 status code Feb 18, 2024 pm 08:41 PM

Understand the meaning of HTTP 301 status code: common application scenarios of web page redirection. With the rapid development of the Internet, people's requirements for web page interaction are becoming higher and higher. In the field of web design, web page redirection is a common and important technology, implemented through the HTTP 301 status code. This article will explore the meaning of HTTP 301 status code and common application scenarios in web page redirection. HTTP301 status code refers to permanent redirect (PermanentRedirect). When the server receives the client's

How to use Nginx Proxy Manager to implement automatic jump from HTTP to HTTPS How to use Nginx Proxy Manager to implement automatic jump from HTTP to HTTPS Sep 26, 2023 am 11:19 AM

How to use NginxProxyManager to implement automatic jump from HTTP to HTTPS. With the development of the Internet, more and more websites are beginning to use the HTTPS protocol to encrypt data transmission to improve data security and user privacy protection. Since the HTTPS protocol requires the support of an SSL certificate, certain technical support is required when deploying the HTTPS protocol. Nginx is a powerful and commonly used HTTP server and reverse proxy server, and NginxProxy

What is http status code 403? What is http status code 403? Oct 07, 2023 pm 02:04 PM

HTTP status code 403 means that the server rejected the client's request. The solution to http status code 403 is: 1. Check the authentication credentials. If the server requires authentication, ensure that the correct credentials are provided; 2. Check the IP address restrictions. If the server has restricted the IP address, ensure that the client's IP address is restricted. Whitelisted or not blacklisted; 3. Check the file permission settings. If the 403 status code is related to the permission settings of the file or directory, ensure that the client has sufficient permissions to access these files or directories, etc.

Quick Application: Practical Development Case Analysis of PHP Asynchronous HTTP Download of Multiple Files Quick Application: Practical Development Case Analysis of PHP Asynchronous HTTP Download of Multiple Files Sep 12, 2023 pm 01:15 PM

Quick Application: Practical Development Case Analysis of PHP Asynchronous HTTP Download of Multiple Files With the development of the Internet, the file download function has become one of the basic needs of many websites and applications. For scenarios where multiple files need to be downloaded at the same time, the traditional synchronous download method is often inefficient and time-consuming. For this reason, using PHP to download multiple files asynchronously over HTTP has become an increasingly common solution. This article will analyze in detail how to use PHP asynchronous HTTP through an actual development case.

How to use the urllib.request.urlopen() function to send a GET request in Python 3.x How to use the urllib.request.urlopen() function to send a GET request in Python 3.x Jul 30, 2023 am 11:28 AM

How to use the urllib.request.urlopen() function in Python3.x to send a GET request. In network programming, we often need to obtain data from a remote server by sending an HTTP request. In Python, we can use the urllib.request.urlopen() function in the urllib module to send an HTTP request and get the response returned by the server. This article will introduce how to use

http request 415 error solution http request 415 error solution Nov 14, 2023 am 10:49 AM

Solution: 1. Check the Content-Type in the request header; 2. Check the data format in the request body; 3. Use the appropriate encoding format; 4. Use the appropriate request method; 5. Check the server-side support.

Common network communication and security problems and solutions in C# Common network communication and security problems and solutions in C# Oct 09, 2023 pm 09:21 PM

Common network communication and security problems and solutions in C# In today's Internet era, network communication has become an indispensable part of software development. In C#, we usually encounter some network communication problems, such as data transmission security, network connection stability, etc. This article will discuss in detail common network communication and security issues in C# and provide corresponding solutions and code examples. 1. Network communication problems Network connection interruption: During the network communication process, the network connection may be interrupted, which may cause

See all articles