PHP采集抓取-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

PHP采集抓取

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 23, 2016 pm 02:30 PM

今天被安排做一下搜狐首页新闻部分抓取。本来很简单的事情，谁知到了搜狐页面抓过来的一直是乱码，怎么转都不行。只好深入研究了一下，也学到不少东西，写下来分享一下。

一、什么是php采集程序？

二、为什么要采集？

三、采集些什么？

四、如何采集？

五、采集思路

六、采集范例程序

七、采集心得

什么是php采集程序？

php采集程序，也叫php小偷，主要是用于自动搜集网络上web页里特定内容，用php语言写的web程序，运行于支持php的平台上。谈到“自动搜集”，你可能联想到百度goole，联想到搜索引擎所做的事情。php采集程序，正是做类似的工作。

为什么要采集？

互联网正以飞快的速度在发展，web数据每天以几何级数据量递增，面对这庞大的数据，作为一个网站管理员的你，该如何搜集自己所需要的信息呢？特别对某个或某几个同类网站，你需要它们的大量信息，来充实你的网站内容，难道就只能复制粘贴的过日子吗？一个网站管理员，你真的就得花大量时间去搞原创内容，而与整个互联网信息量的发展速度脱节吗？这些问题的解决方法只有一个：采集。如果有那么一个程序，你帮你的网站自动或半自动的采集你所需要的特定内容，即时更新你网站的信息，是否是你梦寐以求的呢？这就是采集程序出现的原因所在。

采集些什么？

这要看你做的什么类型的网站了。如果你做图片站，就采集图片；做音乐站，就采集mp3，做新闻站，就采集新闻等等。一切根据你网站的内容架构需要而定。确定你要采集的东西，才好写出相应的采集程序。

如何采集？

通常采集程序，都是有的放矢的。也就是需要有目标网站，搜集一些你需要的采集内容的网站，分别对其html代码进行分析，找出规律性的东西，依据你要采集的特定内容，写出php代码。采集到你要的东西以后，你可以选择自己需要的存放方式。比如直接生成html页面，或是放进数据库，作进一步处理或是存放成特定的形式，以备后用。

采集思路

采集程序的思路很简单大体可以分为以下几个步骤：

1. 获取远程文件源代码（file_get_contents或用fopen）.

　　 2.分析代码得到自己想要的内容（这里用正则匹配，一般是得到分页）。

　　 3.跟根得到的内容进行下载入库等操作。

　　在这里第二步有可能要重复的操作好几次，比如说要先分析一下分页地址，在分析一下内页的内容才能取得我们想要的东西。

/* ***获取远程文件源代码常用三种方法** */
/* **方法一、 fopen()，stream_context_create()方法*** */
$opts = array(
  'http'=> array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);
$context = stream_context_create( $opts);
$fp = fopen('http://www.example.com', 'r', false, $context);
fpassthru( $fp);
fclose( $fp);

/******方法二、 socket*******/
function get_content_by_socket($url, $host){
    $fp = fsockopen($host, 80) or die("Open ". $url ." failed");
    $header = "GET /".$url ." HTTP/1.1\r\n";
    $header .= "Accept: */*\r\n";
    $header .= "Accept-Language: zh-cn\r\n";
    $header .= "Accept-Encoding: gzip, deflate\r\n";
    $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)\r\n";
    $header .= "Host: ". $host ."\r\n";
    $header .= "Connection: Keep-Alive\r\n";
    //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-\r\n\r\n";
    $header .= "Connection: Close\r\n\r\n";
    fwrite($fp, $header);
    while (!feof($fp)) {
        $contents .= fgets($fp, 8192);
    }
    fclose($fp);
    return $contents;
}

/******方法三、file_get_contents ()，stream_context_create() 方法三********/
$opts = array(
        'http'=>array(
        'method'=>"GET",
        'header'=>"Content-Type: text/html; charset=utf-8"
            )
        );
$context = stream_context_create($opts);
$file = file_get_contents('http://www.sohu.com/', false, $context);

/******方法四、 PHP的cURL http://www.chinaz.com/program/2010/0119/104346.shtml*******/
$ch = curl_init();
// 2. 设置选项，包括URL
curl_setopt($ch, CURLOPT_URL, "http://www.sohu.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_HTTPHEADER,array ("Content-Type: text/xml; charset=utf-8","Expect: 100-continue"));
// 3. 执行并获取HTML文档内容
$output = curl_exec($ch);
var_dump($output);
// 4. 释放curl句柄
curl_close($ch);

/*注意
1.使用file_get_contents和fopen必须空间开启allow_url_fopen。方法：编辑php.ini，设置 allow_url_fopen = On，allow_url_fopen关闭时fopen和file_get_contents都不能打开远程文件。
2. 使用curl必须空间开启curl。方法：windows下修改php.ini，将extension=php_curl.dll前面的分号去掉，而且需要拷贝ssleay32.dll和libeay32.dll到C:/WINDOWS/system32下；Linux下要安装curl扩展。
*/
?>

采集范例程序

/* 一个图片下载函数 */
function getimg( $url, $filename){
　　　　 /* 判断图片的url是否为空，如果为空停止函数 */
　　　　 if( $url==""){
　　　　　　　　 return false;
　　　　}
　　　　 /* 取得图片的扩展名，存入变量$ext中 */
　　　　 $ext= strrchr( $url,".");
　　　　 /* 判断是否是合法的图片文件 */
　　 if( $ext!=".gif" && $ext!=".jpg"){
　　　　　　　　 return false;
　　　　}
　　　　 /* 读取图片 */
　　　　 $img= file_get_contents( $url);
　　　　 /* 打开指定的文件 */
　　　　 $fp=@ fopen( $filename. $ext,"a");
　　　　 /* 写入图片到指点的文件 */
　　　　 fwrite( $fp, $img);
　　　　 /* 关闭文件 */
　　　　 fclose( $fp);
　　　　 /* 返回图片的新文件名 */
　　　　 return $filename. $ext;
}

采集图片 php 程序

View Code

/* *
*  采集图片php程序
*
*  Copyright(c) 2008 by 小超(ccxxcc) All rights reserved
*
*  To contact the author write to {@link mailto:ucitmc@163.com}
*
* @author ccxxcc
* @version $Id: {filename},v 1.0 {time} $
* @package system
*/

set_time_limit(0);
/* *
* 写文件
* @param    string  $file   文件路径
* @param    string  $str    写入内容
* @param    char    $mode   写入模式
*/
function wfile( $file, $str, $mode='w')
{
     $oldmask = @ umask(0);
     $fp = @ fopen( $file, $mode);
    @ flock( $fp, 3);
     if(! $fp)
    {
         Return false;
    }
     else
    {
        @ fwrite( $fp, $str);
        @ fclose( $fp);
        @ umask( $oldmask);
         Return true;
    }
}

function savetofile( $path_get, $path_save)
{
        @ $hdl_read = fopen( $path_get,'rb');
         if( $hdl_read == false)
        {
                 echo(" $path_get can not get");
                 Return ;
        }
         if( $hdl_read)
        {
                @ $hdl_write = fopen( $path_save,'wb');
                 if( $hdl_write)
                {
                         while(! feof( $hdl_read))
                        {
                                 fwrite( $hdl_write, fread( $hdl_read,8192));
                        }
                         fclose( $hdl_write);
                         fclose( $hdl_read);
                         return 1;
                }
                 else
                         return 0;
        }
         else
                 return -1;
}

function getExt( $path)
{
         $path = pathinfo( $path);
         return strtolower( $path['extension']);
}

/* *
* 按指定路径生成目录
*
* @param    string     $path    路径
*/
function mkDirs( $path)
{
     $adir = explode('/', $path);
     $dirlist = '';
     $rootdir = array_shift( $adir);
     if(( $rootdir!='.'|| $rootdir!='..')&&! file_exists( $rootdir))
    {
        @ mkdir( $rootdir);
    }
     foreach( $adir as $key=> $val)
    {
         if( $val!='.'&& $val!='..')
        {
             $dirlist .= "/". $val;
             $dirpath = $rootdir. $dirlist;
             if(! file_exists( $dirpath))
            {
                @ mkdir( $dirpath);
                @ chmod( $dirpath,0777);
            }
        }
    }
}

/* *
* 从文本中取得一维数组
*
* @param    string     $file_path    文本路径
*/
function getFileListData( $file_path)
{
     $arr = @ file( $file_path);
     $data = array();
     if( is_array( $arr) && ! empty( $arr))
    {
         foreach( $arr as $val)
        {
             $item = trim( $val);
             if(! empty( $item))
            {
                 $data[] = $item;
            }
        }
    }
     Return $data;
}

// 采集开始

//传入自己的需要采集的图片url列表文本文件每个图片url写一行
$url_file = isset( $_GET['file'])&&! empty( $_GET['file'])? $_GET['file']: null;
$txt_url = "txt/". $url_file;

$urls = array_unique(getFileListData( $txt_url));
if( empty( $urls))
{
         echo('

无链接地址

');
         die();
}
$save_url = "images/". date("y_m_d", time())."/";
mkDirs( $save_url);   // 按日期建立文件夹
$i = 1;
if( is_array( $urls)&& count( $urls))
{
         foreach( $urls as $val)
        {
                savetofile( $val, $save_url. date("His", time())."_". $i.".".getExt( $val));
                 echo( $i.".".getExt( $val)." got\n");
                 $i++;
        }
}

echo('

finish

');

?>

除了以上方法还可以用Snoopy，也不错。

Snoopy是什么? （下载 snoopy）

Snoopy是一个php类，用来模仿web浏览器的功能，它能完成获取网页内容和发送表单的任务。

Snoopy的一些特点:

* 方便抓取网页的内容

* 方便抓取网页的文本内容 (去除HTML标签)

* 方便抓取网页的链接

* 支持代理主机

* 支持基本的用户名/密码验证

* 支持设置 user_agent, referer(来路), cookies 和 header content(头文件)

* 支持浏览器转向，并能控制转向深度

* 能把网页中的链接扩展成高质量的url(默认)

* 方便提交数据并且获取返回值

* 支持跟踪HTML框架(v0.92增加)

* 支持再转向的时候传递cookies (v0.92增加)

采集心得

共享一下个人的采集心德：

　　1.不采那些作防盗链了的站，其实可以作假来路但是这样的站采集成本太高

　　2.采集尽量快的站，最好在本地进行采集

　　3.采集时有很多时候可以先把一部分数据存入数据库，等以后进行下一步的处理。

　　4.采集的时候一定要作好出错处理，我一般都是如果采集三次没有成功就跳过。以前经常就因为一条内容不能采就卡在那里一直的采。

　　5.入库前一定要作好判断，检查内容的合法，过滤不必要的字符串。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Two Point Museum: All Exhibits And Where To Find Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7378

Java Tutorial

1628

CakePHP Tutorial

1357

Laravel Tutorial

1267

PHP Tutorial

1216

Related knowledge

Working with Flash Session Data in Laravel Mar 12, 2025 pm 05:08 PM

Laravel simplifies handling temporary session data using its intuitive flash methods. This is perfect for displaying brief messages, alerts, or notifications within your application. Data persists only for the subsequent request by default: $request-

cURL in PHP: How to Use the PHP cURL Extension in REST APIs Mar 14, 2025 am 11:42 AM

The PHP Client URL (cURL) extension is a powerful tool for developers, enabling seamless interaction with remote servers and REST APIs. By leveraging libcurl, a well-respected multi-protocol file transfer library, PHP cURL facilitates efficient execution of various network protocols, including HTTP, HTTPS, and FTP. This extension offers granular control over HTTP requests, supports multiple concurrent operations, and provides built-in security features.

Simplified HTTP Response Mocking in Laravel Tests Mar 12, 2025 pm 05:09 PM

Laravel provides concise HTTP response simulation syntax, simplifying HTTP interaction testing. This approach significantly reduces code redundancy while making your test simulation more intuitive. The basic implementation provides a variety of response type shortcuts: use Illuminate\Support\Facades\Http; Http::fake([ 'google.com' => 'Hello World', 'github.com' => ['foo' => 'bar'], 'forge.laravel.com' =>

12 Best PHP Chat Scripts on CodeCanyon Mar 13, 2025 pm 12:08 PM

Do you want to provide real-time, instant solutions to your customers' most pressing problems? Live chat lets you have real-time conversations with customers and resolve their problems instantly. It allows you to provide faster service to your custom

Explain the concept of late static binding in PHP. Mar 21, 2025 pm 01:33 PM

Article discusses late static binding (LSB) in PHP, introduced in PHP 5.3, allowing runtime resolution of static method calls for more flexible inheritance.Main issue: LSB vs. traditional polymorphism; LSB's practical applications and potential perfo

PHP Logging: Best Practices for PHP Log Analysis Mar 10, 2025 pm 02:32 PM

PHP logging is essential for monitoring and debugging web applications, as well as capturing critical events, errors, and runtime behavior. It provides valuable insights into system performance, helps identify issues, and supports faster troubleshoot

HTTP Method Verification in Laravel Mar 05, 2025 pm 04:14 PM

Laravel simplifies HTTP verb handling in incoming requests, streamlining diverse operation management within your applications. The method() and isMethod() methods efficiently identify and validate request types. This feature is crucial for building

Discover File Downloads in Laravel with Storage::download Mar 06, 2025 am 02:22 AM

The Storage::download method of the Laravel framework provides a concise API for safely handling file downloads while managing abstractions of file storage. Here is an example of using Storage::download() in the example controller:

See all articles

PHP采集 抓取