php利用fopen实现简单的网页采集程序 -php手册-php.cn

Home

php教程

php手册

php利用fopen实现简单的网页采集程序

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 02, 2016 am 09:13 AM

这个采集程序是一个非常简单的程序了,个人认为不适合于大量数据采集了单页还是没有问题了,因为fopen函数对于远程文件操作与多线程时是非常的不理想的,这个只是一个作者写的觉得好玩合出来了,代码如下:

<?php
/** 
 * 根据URL采集网页内容
 *
 * @param string $url 链接地址
 * @return string
 */
private function fetchbyurl($url) {
    $handle = fopen($url, &#39;r&#39;);
    $content = "; 
while (!feof($handle)){ 
$content .= fgets($handle, 10000); 
} 
return $content; 
//?$this->utf8_iconv($content):";
}
/*获取所有匹配的内容
 * @param string $str 内容
 * @param string $start 起始匹配
 * @param string $end 中止匹配
 * @return array
*/
private function utf8_iconv($content) {
    return iconv(&#39;GBK&#39;, &#39;UTF-8&#39;, $content);
}
private function strCutAll($str, $start, $end) {
    $content = explode($start, $str);
    $matchs = array();
    $sum = count($content);
    for ($i = 1; $i < $sum; $i++) {
        $tmp = explode($end, $content[$i]);
        $matchs[] = $tmp[0];
        unset($tmp);
    }
    return $matchs;
}
/*获取第一个匹配的内容
 * @param string $str 内容
 * @param string $start 起始匹配
 * @param string $end 中止匹配
 * @return string
*/
private function strCut($str, $start, $end) {
    $content = strstr($str, $start);
    $content = substr($content, strlen($start) , strpos($content, $end) - strlen($start));
    return $content;
}
?>

Copy after login

/*采集程序*/
header("content-Type: text/html; charset=utf-8");
//$nr = file_get_contents(‘/webback/php/php-yi-ju-hua-hou-men-zhuan’);
$nr = $this->fetchbyurl(‘/webback/php/php-yi-ju-hua-hou-men-zhuan’);
//推荐，还可以用curl dump($this->strCut($nr,’<div class="context">’,&#39;<div class="betterrelated">’));
//得到内容。需要进一步过滤用（preg_match_all）
dump($this->strCutAll($nr,’<title>’,&#39;</title>’));
得到标题

Copy after login

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn