Summary of PHP crawler technology knowledge points
There are many crawler frameworks, the more popular ones are based on python, nodejs, java, C#, PHP, among which crawlers based on python are the most popular. Others are already operated by a set of fool-like software, such as Octopus, Locomotive and other software.
The first thing we try today is to use PHP to implement a crawler program. First, we practice without using the crawler framework to understand the principles of crawlers, and then use PHP's lib and framework. and extensions for practice.
##1.PHP simple crawler – prototype
Principle of crawler:
Given the original url;
Analyze the link and obtain the content in the link according to the set regular expression;
Some will update the original url before proceeding Links are analyzed for specific content, and the cycle repeats.
Save the obtained content in the database (mysql) or local file
The following is an example from the Internet. Let’s list it down and analyze it
Start from the <span style="margin:0px;padding:0px;max-width:100%;font-size:15px;">main</span>
function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
2. Use crul lib
Curl is a relatively mature lib that does a good job in exception handling, http header, POST, etc. , the important thing is that it is more worry-free to operate MySQL under PHP for warehousing operations. For specific instructions on curl, you can check the official PHP documentation, but it is more troublesome in terms of multi-threaded Curl (Curl_multi).
Open crul
For winow system:
- Modify in php.in (comment; just remove it)
extension =php_curl.dll
Move the libeay32.dll, ssleay32.dll, libssh2.dll and php_curl files under php/ext to windows. /system32
Steps to use crul crawler:
- The basic idea of using cURL function is to first use curl_init() to initialize one cURL session;
- Then you can set all the options you need through curl_setopt();
- Then use curl_exec() to execute the session;
- When the session is finished, use curl_close() to close the session.
Example
$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");curl_setopt($ch, CURLOPT_FILE, $fp);curl_setopt($ch, CURLOPT_HEADER, 0);curl_exec($ch);curl_close($ch);fclose($fp);?>
一个完整点的例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
要对https支持,需要在_getUrlContent
函数中加入下面的设置:
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC ) ; curl_setopt($ch, CURLOPT_USERPWD, "username:password"); curl_setopt($ch, CURLOPT_SSLVERSION,3); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
结果疑惑:
我们通过1和2部分得到的结果差异很大,第1部分能得到四千多条url数据,而第2部分却一直是45条数据。
还有我们获得url数据可能会有重复的,这部分处理在我的github上,对应demo2-01.php,或者demo2-02.php
3.file_get_contents/stream_get_contents与curl对比
3.1 file_get_contents/stream_get_contents对比
stream_get_contents — 读取资源流到一个字符串
与 [file_get_contents()]一样,但是 stream_get_contents() 是对一个已经打开的资源流进行操作,并将其内容写入一个字符串返回
$handle = fopen($url, "r");
$content = stream_get_contents($handle, -1);
//Read the resource stream to a string, the second parameter needs to read the maximum number of bytes. The default is -1 (read all buffered data)
file_get_contents — 将整个文件读入一个字符串
<code style="margin:0px;padding:0px;max-width:100%;font-family:Consolas, Inconsolata, Courier, monospace;white-space:pre;"><span style="color:#4f4f4f;margin:0px;padding:0px;max-width:100%;">$content</span> = file_get_contents(<span style="color:#4f4f4f;margin:0px;padding:0px;max-width:100%;">$url</span>, <span style="margin:0px;padding:0px;max-width:100%;">1024</span> * <span style="margin:0px;padding:0px;max-width:100%;">1024</span>);<br/><span style="font-family:'PingFang SC', 'Microsoft YaHei', SimHei, Arial, SimSun;color:#999999;margin:0px;padding:0px;max-width:100%;text-align:justify;background-color:rgb(238,240,244);">【注】 如果要打开有特殊字符的 URL (比如说有空格),就需要使用进行 URL 编码。</span></code>
3.2 file_get_contents/stream_get_contents与curl对比
- fopen /file_get_contents 每次请求都会重新做DNS查询,并不对 DNS信息进行缓存。但是CURL会自动对DNS信息进行缓存。对同一域名下的网页或者图片的请求只需要一次DNS查询。这大大减少了DNS查询的次数。所以CURL的性能比fopen /file_get_contents 好很多。
fopen /file_get_contents 在请求HTTP时,使用的是http_fopen_wrapper,不会keeplive。而curl却可以。这样在多次请求多个链接时,curl效率会好一些。
fopen / file_get_contents 函数会受到php.ini文件中allow_url_open选项配置的影响。如果该配置关闭了,则该函数也就失效了。而curl不受该配置的影响。
curl 可以模拟多种请求,例如:POST数据,表单提交等,用户可以按照自己的需求来定制请求。而fopen / file_get_contents只能使用get方式获取数据
相关推荐:
The above is the detailed content of Summary of PHP crawler technology knowledge points. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics





PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

If you are an experienced PHP developer, you might have the feeling that you’ve been there and done that already.You have developed a significant number of applications, debugged millions of lines of code, and tweaked a bunch of scripts to achieve op

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.
