This PHP tutorial will simulate the process of parallel crawling of multiple page information. The key lies in single-threaded parallel processing.
Under normal circumstances, everyone uses a serial solution when writing programs to capture information from multiple pages, but the acquisition cycle is too long and is not practical. So I thought of using curl to crawl in parallel. However, it was finally discovered that there was no curl on that virtual server, which was really confusing. So, I decided to change my thinking and use a single thread to achieve the effect of multiple threads. I want to know a little bit about network programming
Those who know it must know the concept of IO reuse. Of course, it is also supported on PHP. Moreover, it is supported internally and does not require any extension.
People who have many years of programming experience may not know much about PHP’s stream function. PHP's compressed file stream, file stream, and applications under the tcp protocol are all encapsulated into a stream. So, read local file
There is no difference from reading network files. Having said so much, I think everyone basically understands it. Let’s just paste the code:
The code is relatively rough. If you want to actually use it, you still need to deal with some details.
Code
Function http_get_open($url)
{
$url = parse_url($url);
if (empty($url['host'])) {
return false;
}
$host = $url['host'];
if (empty($url['path'])) {
$url['path'] = "/";
}
$get = $url['path'] . "?" . @$url['query'];
$fp = stream_socket_client("tcp://{$host}:80", $errno, $errstr, 30);
if (!$fp) {
echo "$errstr ($errno)
n";
return false;
} else {
fwrite($fp, "GET {$get} HTTP/1.0rnHost: {$host}rnAccept: */*rnrn");
}
return $fp;
}
Function http_multi_get($urls)
{
$result = array();
$fps = array();
foreach ($urls as $key => $url)
{
$fp = http_get_open($url);
if ($fp === false) {
$result[$key] = false;
} else {
$result[$key] = '';
$fps[$key] = $fp;
}
}
while (1)
{
$reads = $fps;
if (empty($reads)) {
break;
}
if (($num = stream_select($reads, $w = null, $e = null, 30)) === false ) {
echo "error";
return false;
} else if ($num > 0) {//can read
foreach ($reads as $value)
{
$key = array_search($value, $fps);
if (!feof($value)) {
$result[$key] .= fread($value, 128);
} else {
unset($fps[$key]);
}
}
} else {//time out
echo "timeout";
return false;
}
}
foreach ($result as $key => &$value)
{
if ($value) {
$value = explode("rnrn", $value, 2);
}
}
return $result;
}
$urls = array();
$urls[] = "http://www.qq.com";
$urls[] = "http://www.sina.com.cn";
$urls[] = "http://www.sohu.com";
$urls[] = "http://www.blue1000.com";
// Parallel crawling
$t1 = microtime(true);
$result = http_multi_get($urls);
$t1 = microtime(true) - $t1;
var_dump("cost: " . $t1);
//Serial capture
$t1 = microtime(true);
foreach ($urls as $value)
{
file_get_contents($value);
}
$t1 = microtime(true) - $t1;
var_dump("cost: " . $t1);
?>
The final running result:
String 'cost: 3.2403128147125' (length=21)
string 'cost: 6.2333900928497' (length=21)
It is basically twice the efficiency. Of course, I found that Sina is very slow, taking about 2.5s.
Basically I was dragged down by him, 360 only takes 0.2s
If all websites have similar speeds and the number of parallels is larger, then the multiple of the difference will be larger.