Home php教程 php手册 PHP 实现小偷程序

PHP 实现小偷程序

Jun 06, 2016 pm 07:41 PM
php Why use accomplish crawl program Remotely

为什么使用“小偷程序”? 远程抓取文章资讯或商品信息是很多企业要求程序员实现的功能,也就是俗说的 小偷程序 。其最主要的优点是:解决了公司网编繁重的工作,大大提高了效率。只需要一运行就能快速的抓取别人网站的信息。 “小偷程序”在哪里运行? “小

为什么使用“小偷程序”?

        远程抓取文章资讯或商品信息是很多企业要求程序员实现的功能,也就是俗说的小偷程序。其最主要的优点是:解决了公司网编繁重的工作,大大提高了效率。只需要一运行就能快速的抓取别人网站的信息。

“小偷程序”在哪里运行?

        “小偷程序” 应该在 Windows 下的 DOS(参考文章:http://blog.csdn.net/liruxing1715/article/details/7079488) 或 Linux 下通过 PHP 命令运行为最佳,因为,网页运行会超时。

        比如图(Windows 下 DOS 为例):

        PHP 实现小偷程序

“小偷程序”的实现

        这里主要通过一个实例来讲解,我们来抓取下“华强电子网”的资讯信息,请先看观察这个链接 http://www.hqew.com/info-c10.html,当您打开这个页面的时候发现这个页面会发现一些现象:

        1、资讯列表有 500 页(2012-01-03);

        2、每页的 url 链接都有规律,比如:第1页为http://www.hqew.com/info-c10-1.html;第2页为http://www.hqew.com/info-c10-2.html;……第500页为http://www.hqew.com/info-c10-500.html;

        3、由第二点就可以知道,“华强电子网” 的资讯是伪静态或者是生成的静态页面

        其实,基本上大部分的网站都有这样的规律,比如:中关村在线、慧聪网、新浪、淘宝……。

        这样,我们可以通过这样的思路来实现页面内容的抓取:

        1、先获取文章列表页内容;

        2、根据文章列表页内容循环获取文章的 url 地址;

        3、根据文章的 url 地址获取文章的详细内容

        这里,我们主要抓取资讯页里面的:标题(title)、发布如期(date)、作者(author)、来源(source)、内容(content)

“华强电子网”资讯抓取

        首先,先建数据表结构,如下所示:

CREATE TABLE `article`.`article` (
`id` MEDIUMINT( 8 ) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`title` VARCHAR( 255 ) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
`date` VARCHAR( 50 ) NOT NULL ,
`author` VARCHAR( 100 ) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
`source` VARCHAR( 100 ) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
`content` TEXT NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;
Copy after login

        抓取程序:
Copy after login

通过上面的程序,就可以实现抓取华强电子网的资讯信息。

入口方法 init($min, $max) 如果想抓取 1-500 页面内容,那么 init(1, 500) 即可!这样,用不了多长时间,华强电子网的资讯就会全部抓取到数据库里面了。^_^

执行界面:

PHP 实现小偷程序

数据库:

PHP 实现小偷程序

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What is Cross-Site Request Forgery (CSRF) and how do you implement CSRF protection in PHP? What is Cross-Site Request Forgery (CSRF) and how do you implement CSRF protection in PHP? Apr 07, 2025 am 12:02 AM

In PHP, you can effectively prevent CSRF attacks by using unpredictable tokens. Specific methods include: 1. Generate and embed CSRF tokens in the form; 2. Verify the validity of the token when processing the request.

How can you prevent a class from being extended or a method from being overridden in PHP? (final keyword) How can you prevent a class from being extended or a method from being overridden in PHP? (final keyword) Apr 08, 2025 am 12:03 AM

In PHP, the final keyword is used to prevent classes from being inherited and methods being overwritten. 1) When marking the class as final, the class cannot be inherited. 2) When marking the method as final, the method cannot be rewritten by the subclass. Using final keywords ensures the stability and security of your code.

Explain strict types (declare(strict_types=1);) in PHP. Explain strict types (declare(strict_types=1);) in PHP. Apr 07, 2025 am 12:05 AM

Strict types in PHP are enabled by adding declare(strict_types=1); at the top of the file. 1) It forces type checking of function parameters and return values ​​to prevent implicit type conversion. 2) Using strict types can improve the reliability and predictability of the code, reduce bugs, and improve maintainability and readability.

Unable to log in to mysql as root Unable to log in to mysql as root Apr 08, 2025 pm 04:54 PM

The main reasons why you cannot log in to MySQL as root are permission problems, configuration file errors, password inconsistent, socket file problems, or firewall interception. The solution includes: check whether the bind-address parameter in the configuration file is configured correctly. Check whether the root user permissions have been modified or deleted and reset. Verify that the password is accurate, including case and special characters. Check socket file permission settings and paths. Check that the firewall blocks connections to the MySQL server.

The Future of PHP: Adaptations and Innovations The Future of PHP: Adaptations and Innovations Apr 11, 2025 am 12:01 AM

The future of PHP will be achieved by adapting to new technology trends and introducing innovative features: 1) Adapting to cloud computing, containerization and microservice architectures, supporting Docker and Kubernetes; 2) introducing JIT compilers and enumeration types to improve performance and data processing efficiency; 3) Continuously optimize performance and promote best practices.

PHP vs. Python: Understanding the Differences PHP vs. Python: Understanding the Differences Apr 11, 2025 am 12:15 AM

PHP and Python each have their own advantages, and the choice should be based on project requirements. 1.PHP is suitable for web development, with simple syntax and high execution efficiency. 2. Python is suitable for data science and machine learning, with concise syntax and rich libraries.

The relationship between Bootstrap Table garbled and page encoding The relationship between Bootstrap Table garbled and page encoding Apr 07, 2025 pm 12:03 PM

Bootstrap Table garbled is usually because the page encoding is inconsistent with the table data encoding. To solve this problem, you need to make sure they are consistent. The specific steps include: checking page and table data encoding, setting page encoding, and verifying the encoding. If UTF-8 is used, the server should also support it. If it cannot be resolved, try using the JavaScript encoding library.

How to view database password in Navicat for MariaDB? How to view database password in Navicat for MariaDB? Apr 08, 2025 pm 09:18 PM

Navicat for MariaDB cannot view the database password directly because the password is stored in encrypted form. To ensure the database security, there are three ways to reset your password: reset your password through Navicat and set a complex password. View the configuration file (not recommended, high risk). Use system command line tools (not recommended, you need to be proficient in command line tools).

See all articles