Home php教程 php手册 php结合curl实现多线程抓取

php结合curl实现多线程抓取

Jun 06, 2016 pm 07:51 PM
php

PHP利用 Curl可以完成各种传送文件操作,比如模拟浏览器发送GET,POST请求等等,然而因为php语言本身不支持多线程,所以开发爬虫程序效率并不高,因此经常需要借

php结合curl实现多线程抓取

$url){ $conn[$k]=curl_init($url); curl_setopt($conn[$k], CURLOPT_TIMEOUT, $timeout);//设置超时时间 curl_setopt($conn[$k], CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; MSIE 5.01; Windows NT 5.0)'); curl_setopt($conn[$k], CURLOPT_MAXREDIRS, 7);//HTTp定向级别 curl_setopt($conn[$k], CURLOPT_HEADER, 0);//这里不要header,加块效率 curl_setopt($conn[$k], CURLOPT_FOLLOWLOCATION, 1); // 302 redirect curl_setopt($conn[$k],CURLOPT_RETURNTRANSFER,1); curl_multi_add_handle ($mh,$conn[$k]); } //防止死循环耗死cpu 这段是根据网上的写法 do { $mrc = curl_multi_exec($mh,$active);//当无数据,active=true } while ($mrc == CURLM_CALL_MULTI_PERFORM);//当正在接受数据时 while ($active and $mrc == CURLM_OK) {//当无数据时或请求暂停时,active=true if (curl_multi_select($mh) != -1) { do { $mrc = curl_multi_exec($mh, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM); } } foreach ($array as $k => $url) { curl_error($conn[$k]); $res[$k]=curl_multi_getcontent($conn[$k]);//获得返回信息 $header[$k]=curl_getinfo($conn[$k]);//返回头信息 curl_close($conn[$k]);//关闭语柄 curl_multi_remove_handle($mh , $conn[$k]); //释放资源 } curl_multi_close($mh); $endtime = getmicrotime(); $diff_time = $endtime - $startime; return array('diff_time'=>$diff_time, 'return'=>$res, 'header'=>$header ); } //计算当前时间 function getmicrotime() { list($usec, $sec) = explode(" ",microtime()); return ((float)$usec + (float)$sec); } //测试一下,curl 三个网址 $array = array( "http://www.weibo.com/", "http://www.renren.com/", "http://www.qq.com/" ); $data = Curl_http($array,'10');//调用 var_dump($data);//输出 //如果POST的数据大于1024字节,curl并不会直接就发起POST请求 //发送请求时,header中包含一个空的Expect。curl_setopt($ch, CURLOPT_HTTPHEADER, array("Expect:")); ?>

我们再来看几个例子

(1)下面这段代码是实现抓取多个URL,,然后将抓取的URL的页面代码写入指定的文件

$urls = array( 'http://www.jb51.net/', 'http://www.google.com/', 'http://www.example.com/' ); // 设置要抓取的页面URL $save_to='/test.txt'; // 把抓取的代码写入该文件 $st = fopen($save_to,"a"); $mh = curl_multi_init(); foreach ($urls as $i => $url) { $conn[$i] = curl_init($url); curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); curl_setopt($conn[$i], CURLOPT_HEADER ,0); curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); curl_setopt($conn[$i], CURLOPT_FILE,$st); // 将爬取的代码写入文件 curl_multi_add_handle ($mh,$conn[$i]); } // 初始化 do { curl_multi_exec($mh,$active); } while ($active); // 执行 foreach ($urls as $i => $url) { curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); } // 结束清理 curl_multi_close($mh); fclose($st);

(2)下面这段代码和上面差不多意思,只不过这个地方是将获得的代码先放入变量,然后再将获取到的内容写入指定的文件

$urls = array( 'http://www.jb51.net/', 'http://www.google.com/', 'http://www.example.com/' ); $save_to='/test.txt'; // 把抓取的代码写入该文件 $st = fopen($save_to,"a"); $mh = curl_multi_init(); foreach ($urls as $i => $url) { $conn[$i] = curl_init($url); curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); curl_setopt($conn[$i], CURLOPT_HEADER ,0); curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true); // 不将爬取代码写到浏览器,而是转化为字符串 curl_multi_add_handle ($mh,$conn[$i]); } do { curl_multi_exec($mh,$active); } while ($active); foreach ($urls as $i => $url) { $data = curl_multi_getcontent($conn[$i]); // 获得爬取的代码字符串 fwrite($st,$data); // 将字符串写入文件 } // 获得数据变量,并写入文件 foreach ($urls as $i => $url) { curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); } curl_multi_close($mh); fclose($st);

(3)下面这段代码实现的是利用 PHP 的 Curl Functions 实现并发多线程下载文件

$urls=array( 'http://www.jb51.net/5w.zip', 'http://www.jb51.net/5w.zip', 'http://www.jb51.net/5w.zip' ); $save_to='./home/'; $mh=curl_multi_init(); foreach($urls as $i=>$url){ $g=$save_to.basename($url); if(!is_file($g)){ $conn[$i]=curl_init($url); $fp[$i]=fopen($g,"w"); curl_setopt($conn[$i],CURLOPT_USERAGENT,"Mozilla/4.0(compatible; MSIE 7.0; Windows NT 6.0)"); curl_setopt($conn[$i],CURLOPT_FILE,$fp[$i]); curl_setopt($conn[$i],CURLOPT_HEADER ,0); curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60); curl_multi_add_handle($mh,$conn[$i]); } } do{ $n=curl_multi_exec($mh,$active); }while($active); foreach($urls as $i=>$url){ curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); fclose($fp[$i]); } curl_multi_close($mh);$urls=array( 'http://www.jb51.net/5w.zip', 'http://www.jb51.net/5w.zip', 'http://www.jb51.net/5w.zip' ); $save_to='./home/'; $mh=curl_multi_init(); foreach($urls as $i=>$url){ $g=$save_to.basename($url); if(!is_file($g)){ $conn[$i]=curl_init($url); $fp[$i]=fopen($g,"w"); curl_setopt($conn[$i],CURLOPT_USERAGENT,"Mozilla/4.0(compatible; MSIE 7.0; Windows NT 6.0)"); curl_setopt($conn[$i],CURLOPT_FILE,$fp[$i]); curl_setopt($conn[$i],CURLOPT_HEADER ,0); curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60); curl_multi_add_handle($mh,$conn[$i]); } } do{ $n=curl_multi_exec($mh,$active); }while($active); foreach($urls as $i=>$url){ curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); fclose($fp[$i]); } curl_multi_close($mh);

以上所述就是本文的全部内容了,希望大家能够喜欢。

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

CakePHP Project Configuration CakePHP Project Configuration Sep 10, 2024 pm 05:25 PM

In this chapter, we will understand the Environment Variables, General Configuration, Database Configuration and Email Configuration in CakePHP.

PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian Dec 24, 2024 pm 04:42 PM

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

CakePHP Date and Time CakePHP Date and Time Sep 10, 2024 pm 05:27 PM

To work with date and time in cakephp4, we are going to make use of the available FrozenTime class.

CakePHP File upload CakePHP File upload Sep 10, 2024 pm 05:27 PM

To work on file upload we are going to use the form helper. Here, is an example for file upload.

CakePHP Routing CakePHP Routing Sep 10, 2024 pm 05:25 PM

In this chapter, we are going to learn the following topics related to routing ?

Discuss CakePHP Discuss CakePHP Sep 10, 2024 pm 05:28 PM

CakePHP is an open-source framework for PHP. It is intended to make developing, deploying and maintaining applications much easier. CakePHP is based on a MVC-like architecture that is both powerful and easy to grasp. Models, Views, and Controllers gu

CakePHP Working with Database CakePHP Working with Database Sep 10, 2024 pm 05:25 PM

Working with database in CakePHP is very easy. We will understand the CRUD (Create, Read, Update, Delete) operations in this chapter.

CakePHP Creating Validators CakePHP Creating Validators Sep 10, 2024 pm 05:26 PM

Validator can be created by adding the following two lines in the controller.

See all articles