Home > Backend Development > PHP Tutorial > How to implement a simple crawler in PHP, implement a crawler in PHP_PHP tutorial

How to implement a simple crawler in PHP, implement a crawler in PHP_PHP tutorial

WBOY
Release: 2016-07-13 09:45:47
Original
890 people have browsed it

How to implement a simple crawler in PHP, how to implement a crawler in PHP

The example in this article describes how to implement a simple crawler in PHP. Share it with everyone for your reference. The details are as follows:

<&#63;php
/**
 * 爬虫程序 -- 原型
 *
 * 从给定的url获取html内容
 * 
 * @param string $url 
 * @return string 
 */
function _getUrlContent($url) {
  $handle = fopen($url, "r");
  if ($handle) {
    $content = stream_get_contents($handle, 1024 * 1024);
    return $content;
  } else {
    return false;
  } 
} 
/**
 * 从html内容中筛选链接
 * 
 * @param string $web_content 
 * @return array 
 */
function _filterUrl($web_content) {
  $reg_tag_a = '/<[a|A].*&#63;href=[\'\"]{0,1}([^>\'\"\ ]*).*&#63;>/';
  $result = preg_match_all($reg_tag_a, $web_content, $match_result);
  if ($result) {
    return $match_result[1];
  } 
} 
/**
 * 修正相对路径
 * 
 * @param string $base_url 
 * @param array $url_list 
 * @return array 
 */
function _reviseUrl($base_url, $url_list) {
  $url_info = parse_url($base_url);
  $base_url = $url_info["scheme"] . '://';
  if ($url_info["user"] && $url_info["pass"]) {
    $base_url .= $url_info["user"] . ":" . $url_info["pass"] . "@";
  } 
  $base_url .= $url_info["host"];
  if ($url_info["port"]) {
    $base_url .= ":" . $url_info["port"];
  } 
  $base_url .= $url_info["path"];
  print_r($base_url);
  if (is_array($url_list)) {
    foreach ($url_list as $url_item) {
      if (preg_match('/^http/', $url_item)) {
        // 已经是完整的url
        $result[] = $url_item;
      } else {
        // 不完整的url
        $real_url = $base_url . '/' . $url_item;
        $result[] = $real_url;
      } 
    } 
    return $result;
  } else {
    return;
  } 
} 
/**
 * 爬虫
 * 
 * @param string $url 
 * @return array 
 */
function crawler($url) {
  $content = _getUrlContent($url);
  if ($content) {
    $url_list = _reviseUrl($url, _filterUrl($content));
    if ($url_list) {
      return $url_list;
    } else {
      return ;
    } 
  } else {
    return ;
  } 
} 
/**
 * 测试用主程序
 */
function main() {
  $current_url = "http://hao123.com/"; //初始url
  $fp_puts = fopen("url.txt", "ab"); //记录url列表
  $fp_gets = fopen("url.txt", "r"); //保存url列表
  do {
    $result_url_arr = crawler($current_url);
    if ($result_url_arr) {
      foreach ($result_url_arr as $url) {
        fputs($fp_puts, $url . "\r\n");
      } 
    } 
  } while ($current_url = fgets($fp_gets, 1024)); //不断获得url
} 
main();
&#63;>

Copy after login

I hope this article will be helpful to everyone’s PHP programming design.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/1039190.htmlTechArticleHow to implement a simple crawler in PHP, how to implement a crawler in PHP. This article describes how to implement a simple crawler in PHP. Share it with everyone for your reference. The details are as follows: php/** * Crawler program -- prototype...
Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template