Code sharing for php multi-threaded web page crawling

WBOY
Release: 2016-07-25 08:57:16
Original
893 people have browsed it
This article introduces the code for using PHP to implement multi-threaded web page crawling. Friends in need can refer to it.

In PHP, you can use Curl to complete various file transfer operations, such as simulating a browser to send GET, POST requests, etc. The PHP language itself does not support multi-threading, so the efficiency of developing crawler programs is not high. Therefore, Curl Multi Functions is used to achieve concurrent multi-threaded access to multiple URL addresses.

For the basic content of curl, you can refer to the following articles: php curl application example analysis Example code of php curl usage php curl learning summary

This section introduces examples of using Curl Multi Functions to download files concurrently with multiple threads.

Example 1, get the content and write it directly to the file

<?php
/**
* 多线程抓取网页内容
* edit by bbs.it-home.org
*/
$urls = array(   
 'http://www.sina.com.cn/',   
 'http://www.sohu.com/',   
 'http://bbs.it-home.org/' 
); //要抓取的页面URL   
     
$save_to='/test.txt';   //抓取内容的写入文件    
   
$st = fopen($save_to,"a");   
$mh = curl_multi_init();    
   
foreach ($urls as $i => $url) {   
  $conn[$i] = curl_init($url);   
  curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)");   
  curl_setopt($conn[$i], CURLOPT_HEADER ,0);   
  curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60);   
  curl_setopt($conn[$i], CURLOPT_FILE,$st); //将爬取的代码写入文件   
  curl_multi_add_handle ($mh,$conn[$i]);   
} // 初始化   
     
do {   
  curl_multi_exec($mh,$active);   
} while ($active);  // 执行   
     
foreach ($urls as $i => $url) {   
  curl_multi_remove_handle($mh,$conn[$i]);   
  curl_close($conn[$i]);   
} // 结束清理   
     
curl_multi_close($mh);   
fclose($st); 
?>
Copy after login

Example 2, get the content into a variable and then write it to a file

<?php
$urls = array(   
 'http://www.sina.com.cn/',   
 'http://www.sohu.com/',   
 'http://bbs.it-home.org/' 
);   
   
$save_to='/test.txt';   //写入该文件   
$st = fopen($save_to,"a");   
   
$mh = curl_multi_init();   
foreach ($urls as $i => $url) {   
  $conn[$i] = curl_init($url);   
  curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)");   
  curl_setopt($conn[$i], CURLOPT_HEADER ,0);   
  curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60);   
  curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true);  // 设置将爬取代码转化为字符串,不输出至浏览器   
  curl_multi_add_handle ($mh,$conn[$i]);   
}   
   
do {   
  curl_multi_exec($mh,$active);   
} while ($active);   
     
foreach ($urls as $i => $url) {   
  $data = curl_multi_getcontent($conn[$i]); // 获得爬取的代码字符串   
  fwrite($st,$data);  // 将字符串写入文件。存入数据库也是可以的。   
} // 获得数据变量,并写入文件   
   
foreach ($urls as $i => $url) {   
  curl_multi_remove_handle($mh,$conn[$i]);   
  curl_close($conn[$i]);   
}   
   
curl_multi_close($mh);   
fclose($st);  
?>
Copy after login


source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template