PHP multi-threaded web page crawling implementation code

PHP multi-threaded web page crawling implementation code_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-07-21 15:35:58

Original

849 people have browsed it

Limited by the fact that the PHP language itself does not support multi-threading, the efficiency of developing crawler programs is not high. At this time, it is often necessary to use Curl Multi Functions, which can achieve concurrent multi-threaded access to multiple URL addresses. Since Curl Multi Function is so powerful, can Curl Multi Functions be used to write concurrent multi-threaded file downloads? Of course, my code is given below:

Code 1: Write the obtained code directly into a certain File

Copy code The code is as follows:

 
$urls = array( 
'http ://www.sina.com.cn/', 
'http://www.sohu.com/', 
'http://www.163.com/' 
); / / Set the page URL to be crawled 

$save_to='/test.txt'; // Write the crawled code into the file 

$st = fopen($save_to," a"); 
$mh = curl_multi_init(); 

foreach ($urls as $i => $url) { 
$conn[$i] = curl_init($url); 
curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); 
curl_setopt($conn[$i], CURLOPT_HEADER ,0); 
curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); 
curl_setopt($conn[$i], CURLOPT_FILE,$st); // Set to write the crawled code to the file
curl_multi_add_handle ($ mh,$conn[$i]); 
} // Initialization

do { 
curl_multi_exec($mh,$active); 
} while ($active); // Execute 

foreach ($urls as $i => $url) { 
curl_multi_remove_handle($mh,$conn[$i]); 
curl_close($conn[$i]); 
} // End cleanup 

curl_multi_close($mh); 
fclose($st); 
?> 

Code 2: The code that will be obtained First put the variables, then write to a file

Copy the code The code is as follows:

 
 $urls = array( 
'http://www.sina.com.cn/', 
'http://www.sohu.com/', 
'http://www.163 .com/' 
); 

$save_to='/test.txt'; // Write the captured code into the file
$st = fopen($save_to,"a" ); 

$mh = curl_multi_init(); 
foreach ($urls as $i => $url) { 
$conn[$i] = curl_init($url); 
curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); 
curl_setopt($conn[$i], CURLOPT_HEADER ,0); 
curl_setopt ($conn[$i], CURLOPT_CONNECTTIMEOUT,60); 
curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true); // Set the crawling code not to be written to the browser, but converted to a string
curl_multi_add_handle ($mh,$conn[$i]); 
} 

do { 
curl_multi_exec($mh,$active); 
} while ($active); 

foreach ($urls as $i => $url) { 
$data = curl_multi_getcontent($conn[$i]); // Get the crawled code string 
fwrite($ st,$data); //Write string to file. Of course, it is also possible not to write to a file, such as storing it in a database 
} // Obtain data variables and write to the file 

foreach ($urls as $i => $url) { 
 curl_multi_remove_handle($mh,$conn[$i]); 
curl_close($conn[$i]); 
} 

curl_multi_close($mh); 
fclose($st) ; 
?>