PHP curl implements off-site collection (recommended)

PHPz
Release: 2023-03-07 09:32:01
Original
1651 people have browsed it

curl is a library specifically used for network interaction. It provides a bunch of custom options to deal with different environments. The stability is naturally greater than file_get_contents

Reason for choosing curl

Regarding curl and file_get_contents, here is an easy-to-understand comparison:
file_get_contents is actually a merged version of a bunch of built-in file operation functions, such as file_exists, fopen, fread, fclose, specially provided for lazy users It is used by people, and it is mainly used to deal with local files, but because of lazy people, it also adds support for network files;
curl is a library specially used for network interaction, providing a Heap custom options are used to deal with different environments, and their stability is naturally greater than file_get_contents.

How to use

1. Turn on curl support

Since the curl support is not turned on by default after the PHP environment is installed, you need to modify the php.ini file , find;extension=php_curl.dll, remove the colon in front, and restart the service;

2. Use curl to capture data

The code is as follows:

// 初始化一个 cURL 对象 
$curl = curl_init(); 
// 设置你需要抓取的URL 
curl_setopt($curl, CURLOPT_URL, 'http://www.cmx8.cn'); 
// 设置header 
curl_setopt($curl, CURLOPT_HEADER, 1); 
// 设置cURL 参数,要求结果保存到字符串中还是输出到屏幕上。 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
// 运行cURL,请求网页 
$data = curl_exec($curl); 
// 关闭URL请求 
curl_close($curl);
Copy after login

3. Find key data through regular matching

The code is as follows:

//$data是curl_exec返回的的值,即采集的目标内容 
preg_match_all("/<li class=\"item\">(.*?)<\/li>/",$data, $out, PREG_SET_ORDER); 
foreach($out as $key => $value){ 
    //此处$value是数组,同时记录找到带匹配字符的整句和单独匹配的字符 
    echo &#39;匹配到的整句:&#39;.$value[0].&#39;
&#39;; 
    echo &#39;单独匹配到的:&#39;.$value[1].&#39;
&#39;; 
}
Copy after login

Tips

1. Timeout related settings

Through curl_setopt($ch, opt ) You can set some timeout settings, mainly including:

CURLOPT_TIMEOUT Set the maximum number of seconds that cURL is allowed to execute.
CURLOPT_TIMEOUT_MS Sets the maximum number of milliseconds cURL is allowed to execute. (Added in cURL 7.16.2. Available from PHP 5.2.3.)
CURLOPT_CONNECTTIMEOUT The time to wait before initiating a connection. If set to 0, it will wait indefinitely.
CURLOPT_CONNECTTIMEOUT_MS The time to wait for a connection attempt, in milliseconds. If set to 0, wait infinitely. Added in cURL 7.16.2. Available starting with PHP 5.2.3.
CURLOPT_DNS_CACHE_TIMEOUT Set the time to save DNS information in memory, the default is 120 seconds.

The code is as follows:

curl_setopt($ch, CURLOPT_TIMEOUT, 60);   //只需要设置一个秒的数量就可以 
curl_setopt($ch, CURLOPT_NOSIGNAL, 1);    //注意,毫秒超时一定要设置这个 
curl_setopt($ch, CURLOPT_TIMEOUT_MS, 200);  //超时毫秒,cURL 7.16.2中被加入。从PHP 5.2.3起可使用
Copy after login

2. Submit data through post and retain cookies

The code is as follows:

//以下摘抄一个例子过来,用于学习借鉴: 
//Curl 模拟登录 discuz 程序,适合DZ7.0 
!extension_loaded(&#39;curl&#39;) && die(&#39;The curl extension is not loaded.&#39;);    
$discuz_url = &#39;http://www.lxvoip.com&#39;;//论坛地址    
$login_url = $discuz_url .&#39;/logging.php?action=login&#39;;//登录页地址    
$get_url = $discuz_url .&#39;/my.php?item=threads&#39;; //我的帖子    
$post_fields = array();    
//以下两项不需要修改    
$post_fields[&#39;loginfield&#39;] = &#39;username&#39;;    
$post_fields[&#39;loginsubmit&#39;] = &#39;true&#39;;    
//用户名和密码,必须填写    
$post_fields[&#39;username&#39;] = &#39;lxvoip&#39;;    
$post_fields[&#39;password&#39;] = &#39;88888888&#39;;    
//安全提问    
$post_fields[&#39;questionid&#39;] = 0;    
$post_fields[&#39;answer&#39;] = &#39;&#39;;    
//@todo验证码    
$post_fields[&#39;seccodeverify&#39;] = &#39;&#39;;    
//获取表单FORMHASH    
$ch = curl_init($login_url);    
curl_setopt($ch, CURLOPT_HEADER, 0);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);    
$contents = curl_exec($ch);    
curl_close($ch);    
preg_match(&#39;/<input\s*type="hidden"\s*name="formhash"\s*value="(.*?)"\s*\/>/i&#39;, $contents, $matches);    
if(!empty($matches)) {    
    $formhash = $matches[1];    
} else {    
    die(&#39;Not found the forumhash.&#39;);    
}    
//POST数据,获取COOKIE    
$cookie_file = dirname(__FILE__) . &#39;/cookie.txt&#39;;    
//$cookie_file = tempnam(&#39;/tmp&#39;);    
$ch = curl_init($login_url);    
curl_setopt($ch, CURLOPT_HEADER, 0);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);    
curl_setopt($ch, CURLOPT_POST, 1);    
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);    
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);    
curl_exec($ch);    
curl_close($ch);    
//带着上面得到的COOKIE获取需要登录后才能查看的页面内容    
$ch = curl_init($get_url);    
curl_setopt($ch, CURLOPT_HEADER, 0);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);    
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);    
$contents = curl_exec($ch);    
curl_close($ch);    
var_dump($contents);
Copy after login


The above is the detailed content of PHP curl implements off-site collection (recommended). For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template