Home > Backend Development > PHP Tutorial > PHP single thread implements parallel crawling of web pages_PHP tutorial

PHP single thread implements parallel crawling of web pages_PHP tutorial

WBOY
Release: 2016-07-13 10:22:38
Original
1375 people have browsed it

PHP single-threaded implementation of parallel crawling of web pages

This PHP tutorial will simulate the process of parallel crawling of multiple page information. The key lies in single-threaded parallel processing.

Under normal circumstances, everyone uses a serial solution when writing programs to capture information from multiple pages, but the acquisition cycle is too long and is not practical. So I thought of using curl to crawl in parallel. However, it was finally discovered that there was no curl on that virtual server, which was really confusing. So, I decided to change my thinking and use a single thread to achieve the effect of multiple threads. I want to know a little bit about network programming

Those who know it must know the concept of IO reuse. Of course, it is also supported on PHP. Moreover, it is supported internally and does not require any extension.

People who have many years of programming experience may not know much about PHP’s stream function. PHP's compressed file stream, file stream, and applications under the tcp protocol are all encapsulated into a stream. So, read local file

There is no difference from reading network files. Having said so much, I think everyone basically understands it. Let’s just paste the code:

The code is relatively rough. If you want to actually use it, you still need to deal with some details.

Code

Function http_get_open($url)

 {

 $url = parse_url($url);

 if (empty($url['host'])) {

return false;

 }

$host = $url['host'];

 if (empty($url['path'])) {

 $url['path'] = "/";

 }

$get = $url['path'] . "?" . @$url['query'];

 $fp = stream_socket_client("tcp://{$host}:80", $errno, $errstr, 30);

 if (!$fp) {

echo "$errstr ($errno)
n";

return false;

 } else {

fwrite($fp, "GET {$get} HTTP/1.0rnHost: {$host}rnAccept: */*rnrn");

 }

return $fp;

 }

Function http_multi_get($urls)

 {

 $result = array();

 $fps = array();

foreach ($urls as $key => $url)

 {

 $fp = http_get_open($url);

 if ($fp === false) {

 $result[$key] = false;

 } else {

$result[$key] = '';

 $fps[$key] = $fp;

 }

 }

while (1)

 {

$reads = $fps;

 if (empty($reads)) {

break;

 }

 if (($num = stream_select($reads, $w = null, $e = null, 30)) === false ) {

echo "error";

return false;

 } else if ($num > 0) {//can read

 foreach ($reads as $value)

 {

 $key = array_search($value, $fps);

 if (!feof($value)) {

 $result[$key] .= fread($value, 128);

 } else {

unset($fps[$key]);

 }

 }

 } else {//time out

echo "timeout";

return false;

 }

 }

foreach ($result as $key => &$value)

 {

 if ($value) {

 $value = explode("rnrn", $value, 2);

 }

 }

return $result;

 }

 $urls = array();

 $urls[] = "http://www.qq.com";

 $urls[] = "http://www.sina.com.cn";

 $urls[] = "http://www.sohu.com";

 $urls[] = "http://www.blue1000.com";

// Parallel crawling

 $t1 = microtime(true);

$result = http_multi_get($urls);

 $t1 = microtime(true) - $t1;

 var_dump("cost: " . $t1);

//Serial capture

 $t1 = microtime(true);

foreach ($urls as $value)

 {

file_get_contents($value);

 }

 $t1 = microtime(true) - $t1;

 var_dump("cost: " . $t1);

 ?>

The final running result:

String 'cost: 3.2403128147125' (length=21)

 string 'cost: 6.2333900928497' (length=21)

It is basically twice the efficiency. Of course, I found that Sina is very slow, taking about 2.5s.

Basically I was dragged down by him, 360 only takes 0.2s

If all websites have similar speeds and the number of parallels is larger, then the multiple of the difference will be larger.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/847205.htmlTechArticlePHP single-threaded implementation of parallel crawling of web pages This PHP tutorial will simulate the process of parallel crawling of multiple page information. The key It lies in single-threaded parallel processing. Normally, everyone writes to crawl multiple pages...
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template