Use get_html() in the first article to implement simple data collection. Since the data is collected one by one, the transmission time will be the total download time of all pages. If one page is 1 second, then 10 pages will be 10 Seconds. Fortunately, curl also provides parallel processing capabilities.
To write a function for parallel collection, you must first understand what kind of pages you want to collect and what requests to use for the collected pages. Only then can you write a relatively commonly used function.
When writing get_html(), we learned that we can use the options array to pass more curl parameters, so the feature of writing simultaneous collection functions for multiple pages must be retained.
Whether it is requesting the HTML of a web page or calling the Internet API interface, the parameters passed by get and post always request the same page or interface, but the parameters are different. Then the parameter type is:
$options is a two-dimensional array, and the parameters of each page are an array.
In this case, the problem seems to be solved. But I searched all over the curl manual and couldn't see where the get parameters are passed, so I can only pass $url in the form of an array and add a method parameter
function get_htmls($urls, $options = array(), $method = 'get'){
$mh = curl_multi_init();
if($method == 'get'){//The get method is most commonly used to pass values
foreach($urls as $key=>$url){
$ch = curl_init($url);
$options[CURLOPT_RETURNTRANSFER] = true;
$options[CURLOPT_TIMEOUT] = 5;
curl_setopt_array($ch,$options);
$cur ls[$key] = $ch;
curl_multi_add_handle( $mh,$curls[$key]);
option){
$option[CURLOPT_POST] = true;
curl_setopt_array($ch,$option); }else{
exit("Parameter error! n");
}
do{
$mrc = curl_multi_exec($mh,$active);
curl_multi_select($mh);//Reduce CPU pressure Comment out the CPU pressure to increase
}while($active);
foreach($curls as $key=>$ch){
$html = curl_multi_getcontent($ch);
curl_multi_remove_handle( $mh,$ch);
curl_close($ch);
$htmls[$key] = $html;
}
curl_multi_close($mh);
return $htmls;
}
Commonly used get requests are implemented by changing url parameters, and because our function is aimed at data collection. It must be collected by category, so the URL is similar to this:
http://www.baidu.com/s?wd=shili&pn=0&ie=utf-8
http://www.baidu.com/s?wd=shili&pn=10&ie=utf-8
http://www.baidu.com/s?wd=shili&pn=20&ie=utf-8
http://www.baidu.com/s?wd=shili&pn=30&ie=utf-8
http://www.baidu.com/s?wd=shili&pn=50&ie=utf-8
The above five pages are very regular, and only the value of pn changes.
Copy code The code is as follows:
$urls = array();
for($i= 1; $i<=5; $i++){
$urls[] = 'http://www.baidu.com/s?wd=shili&pn='.(($i-1)*10). '&ie=utf-8';
}
$option[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0';
$htmls = get_htmls( $urls,$option);
foreach($htmls as $html){
echo $html;//Get html here and you can perform data processing
}
Simulate common post requests:
Write a post.php file as follows:
Copy the code The code is as follows:
if(isset($_POST[ 'username']) && isset($_POST['password'])){
echo 'The username is: '.$_POST['username'].' The password is: '.$_POST['password'] ;
}else{
echo 'Request error!';
}
Then call as follows:
Copy code The code is as follows:
$url = 'http://localhost/yourpath/post.php';//Here is your path
$options = array();
for($i=1; $i<=5; $i++){
$option[CURLOPT_POSTFIELDS] = 'username=user'.$i.'&password=pass'.$i;
$options[] = $option;
}
$htmls = get_htmls($url,$options,'post');
foreach($htmls as $html){
echo $html; //Get the html here and you can perform data processing
}
In this way, the get_htmls function can basically implement some data collection functions
That’s it for today’s sharing. If it’s not well written or unclear, please give me some advice
http://www.bkjia.com/PHPjc/326892.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/326892.htmlTechArticleUse the get_html() in the first article to implement simple data collection, because the data is collected and transmitted one by one. The time will be the total download time of all pages, assuming 1 second for a page, then...