There are many crawler frameworks, the more popular ones are based on python, nodejs, java, C#, PHP, among which crawlers based on python are the most popular. Others are already operated by a set of fool-like software, such as Octopus, Locomotive and other software.
The first thing we try today is to use PHP to implement a crawler program. First, we practice without using the crawler framework to understand the principles of crawlers, and then use PHP's lib and framework. and extensions for practice.
Principle of crawler:
Given the original url;
Analyze the link and obtain the content in the link according to the set regular expression;
Some will update the original url before proceeding Links are analyzed for specific content, and the cycle repeats.
Save the obtained content in the database (mysql) or local file
The following is an example from the Internet. Let’s list it down and analyze it
Start from the <span style="margin:0px;padding:0px;max-width:100%;font-size:15px;">main</span>
function
<?php/** * 爬虫程序 -- 原型 * 从给定的url获取html内容 * @param string $url * @return string */function _getUrlContent($url) { $handle = fopen($url, "r"); if ($handle) { $content = stream_get_contents($handle, -1); //读取资源流到一个字符串,第二个参数需要读取的最大的字节数。默认是-1(读取全部的缓冲数据) // $content = file_get_contents($url, 1024 * 1024); return $content; } else { return false; } } /** * 从html内容中筛选链接 * @param string $web_content * @return array */function _filterUrl($web_content) { $reg_tag_a = '/<[a|A].*?href=[\'\"]{0,1}([^>\'\"\ ]*).*?>/'; $result = preg_match_all($reg_tag_a, $web_content, $match_result); if ($result) { return $match_result[1]; } } /** * 修正相对路径 * @param string $base_url * @param array $url_list * @return array */function _reviseUrl($base_url, $url_list) { $url_info = parse_url($base_url);//解析url $base_url = $url_info["scheme"] . '://'; if ($url_info["user"] && $url_info["pass"]) { $base_url .= $url_info["user"] . ":" . $url_info["pass"] . "@"; } $base_url .= $url_info["host"]; if ($url_info["port"]) { $base_url .= ":" . $url_info["port"]; } $base_url .= $url_info["path"]; if (is_array($url_list)) { foreach ($url_list as $url_item) { if (preg_match('/^http/', $url_item)) { // 已经是完整的url $result[] = $url_item; } else { // 不完整的url $real_url = $base_url . '/' . $url_item; $result[] = $real_url; } } return $result; } else { return; } } /** * 爬虫 * @param string $url * @return array */function crawler($url) { $content = _getUrlContent($url); if ($content) { $url_list = _reviseUrl($url, _filterUrl($content)); if ($url_list) { return $url_list; } else { return ; } } else { return ; } } /** * 测试用主程序 */function main() { $file_path = "url-01.txt"; $current_url = "http://www.baidu.com/"; //初始url if(file_exists($file_path)){ unlink($file_path); } $fp_puts = fopen($file_path, "ab"); //记录url列表 $fp_gets = fopen($file_path, "r"); //保存url列表 do { $result_url_arr = crawler($current_url); if ($result_url_arr) { foreach ($result_url_arr as $url) { fputs($fp_puts, $url . "\r\n"); } } } while ($current_url = fgets($fp_gets, 1024)); //不断获得url} main();?>
Curl is a relatively mature lib that does a good job in exception handling, http header, POST, etc. , the important thing is that it is more worry-free to operate MySQL under PHP for warehousing operations. For specific instructions on curl, you can check the official PHP documentation, but it is more troublesome in terms of multi-threaded Curl (Curl_multi).
Open crul
For winow system:
- Modify in php.in (comment; just remove it)
extension =php_curl.dll
Move the libeay32.dll, ssleay32.dll, libssh2.dll and php_curl files under php/ext to windows. /system32
Steps to use crul crawler:
- The basic idea of using cURL function is to first use curl_init() to initialize one cURL session;
- Then you can set all the options you need through curl_setopt();
- Then use curl_exec() to execute the session;
- When the session is finished, use curl_close() to close the session.
Example
$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");curl_setopt($ch, CURLOPT_FILE, $fp);curl_setopt($ch, CURLOPT_HEADER, 0);curl_exec($ch);curl_close($ch);fclose($fp);?>
一个完整点的例子:
<?php/** * 将demo1-01换成curl爬虫 * 爬虫程序 -- 原型 * 从给定的url获取html内容 * @param string $url * @return string */function _getUrlContent($url) { $ch=curl_init(); //初始化一个cURL会话 /*curl_setopt 设置一个cURL传输选项*/ //设置需要获取的 URL 地址 curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch,CURLOPT_HEADER,1); // 设置浏览器的特定header curl_setopt($ch, CURLOPT_HTTPHEADER, array( "Host: www.baidu.com", "Connection: keep-alive", "Accept: text/html, application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Upgrade-Insecure-Requests: 1", "DNT:1", "Accept-Language: zh-CN,zh;q=0.8,en-GB;q=0.6,en;q=0.4,en-US;q=0.2", )); $result=curl_exec($ch);//执行一个cURL会话 $code=curl_getinfo($ch,CURLINFO_HTTP_CODE);// 最后一个收到的HTTP代码 if($code!='404' && $result){ return $result; } curl_close($ch);//关闭cURL} /** * 从html内容中筛选链接 * @param string $web_content * @return array */function _filterUrl($web_content) { $reg_tag_a = '/<[a|A].*?href=[\'\"]{0,1}([^>\'\"\ ]*).*?>/'; $result = preg_match_all($reg_tag_a, $web_content, $match_result); if ($result) { return $match_result[1]; } } /** * 修正相对路径 * @param string $base_url * @param array $url_list * @return array */function _reviseUrl($base_url, $url_list) { $url_info = parse_url($base_url);//解析url $base_url = $url_info["scheme"] . '://'; if ($url_info["user"] && $url_info["pass"]) { $base_url .= $url_info["user"] . ":" . $url_info["pass"] . "@"; } $base_url .= $url_info["host"]; if ($url_info["port"]) { $base_url .= ":" . $url_info["port"]; } $base_url .= $url_info["path"]; if (is_array($url_list)) { foreach ($url_list as $url_item) { if (preg_match('/^http/', $url_item)) { // 已经是完整的url $result[] = $url_item; } else { // 不完整的url $real_url = $base_url . '/' . $url_item; $result[] = $real_url; } } return $result; } else { return; } } /** * 爬虫 * @param string $url * @return array */function crawler($url) { $content = _getUrlContent($url); if ($content) { $url_list = _reviseUrl($url, _filterUrl($content)); if ($url_list) { return $url_list; } else { return ; } } else { return ; } } /** * 测试用主程序 */function main() { $file_path = "./url-03.txt"; if(file_exists($file_path)){ unlink($file_path); } $current_url = "http://www.baidu.com"; //初始url //记录url列表 ab- 追加打开一个二进制文件,并在文件末尾写数据 $fp_puts = fopen($file_path, "ab"); //保存url列表 r-只读方式打开,将文件指针指向文件头 $fp_gets = fopen($file_path, "r"); do { $result_url_arr = crawler($current_url); echo "<p>$current_url</p>"; if ($result_url_arr) { foreach ($result_url_arr as $url) { fputs($fp_puts, $url . "\r\n"); } } } while ($current_url = fgets($fp_gets, 1024)); //不断获得url} main();?>
要对https支持,需要在_getUrlContent
函数中加入下面的设置:
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC ) ; curl_setopt($ch, CURLOPT_USERPWD, "username:password"); curl_setopt($ch, CURLOPT_SSLVERSION,3); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
结果疑惑:
我们通过1和2部分得到的结果差异很大,第1部分能得到四千多条url数据,而第2部分却一直是45条数据。
还有我们获得url数据可能会有重复的,这部分处理在我的github上,对应demo2-01.php,或者demo2-02.php
stream_get_contents — 读取资源流到一个字符串
与 [file_get_contents()]一样,但是 stream_get_contents() 是对一个已经打开的资源流进行操作,并将其内容写入一个字符串返回
$handle = fopen($url, "r");
$content = stream_get_contents($handle, -1);
//Read the resource stream to a string, the second parameter needs to read the maximum number of bytes. The default is -1 (read all buffered data)
file_get_contents — 将整个文件读入一个字符串
<code style="margin:0px;padding:0px;max-width:100%;font-family:Consolas, Inconsolata, Courier, monospace;white-space:pre;"><span style="color:#4f4f4f;margin:0px;padding:0px;max-width:100%;">$content</span> = file_get_contents(<span style="color:#4f4f4f;margin:0px;padding:0px;max-width:100%;">$url</span>, <span style="margin:0px;padding:0px;max-width:100%;">1024</span> * <span style="margin:0px;padding:0px;max-width:100%;">1024</span>);<br/><span style="font-family:'PingFang SC', 'Microsoft YaHei', SimHei, Arial, SimSun;color:#999999;margin:0px;padding:0px;max-width:100%;text-align:justify;background-color:rgb(238,240,244);">【注】 如果要打开有特殊字符的 URL (比如说有空格),就需要使用进行 URL 编码。</span></code>
- fopen /file_get_contents 每次请求都会重新做DNS查询,并不对 DNS信息进行缓存。但是CURL会自动对DNS信息进行缓存。对同一域名下的网页或者图片的请求只需要一次DNS查询。这大大减少了DNS查询的次数。所以CURL的性能比fopen /file_get_contents 好很多。
fopen /file_get_contents 在请求HTTP时,使用的是http_fopen_wrapper,不会keeplive。而curl却可以。这样在多次请求多个链接时,curl效率会好一些。
fopen / file_get_contents 函数会受到php.ini文件中allow_url_open选项配置的影响。如果该配置关闭了,则该函数也就失效了。而curl不受该配置的影响。
curl 可以模拟多种请求,例如:POST数据,表单提交等,用户可以按照自己的需求来定制请求。而fopen / file_get_contents只能使用get方式获取数据
相关推荐:
The above is the detailed content of Summary of PHP crawler technology knowledge points. For more information, please follow other related articles on the PHP Chinese website!