Example of how to use PHP to crawl Baidu Reading-PHP Tutorial-php.cn

Example of how to use PHP to crawl Baidu Reading

黄舟

Release： 2023-03-05 22:40:02

Original

1995 people have browsed it

Preface

This article mainly introduces how to use PHP to capture Baidu Reading. Not much to say below, let’s take a look.

The crawling method is as follows

First open the reading page in the browser, check the source code and find that the content of the novel is not written directly on the page, that is It is said that the content of the novel is loaded asynchronously.

So I switched Chrome's developer tools to the network column and refreshed the reading page. The main focus was on the two categories of XHR and script.

After investigation, it was found that there was a jsonp request under the script category that looked more like the content of a novel. The requested address was
http://www.php.cn/
The response was a

jsonp

Copy after login

string, and then I found that if you remove the

callback=wenku7

Copy after login

in the address, a

json

Copy after login

string will be returned, which makes it much easier to parse, and you can directly Convert to array in php.

Let’s analyze the structure of the returned data. The returned

json

Copy after login

string is followed by a tree-like structure. Each node has a t attribute and a c attribute. The t attribute is used to indicate The label of this node, such as h2 p, etc., the c attribute is the content, but there are two possibilities, one is a string, the other is an array, and each element of the array is a node.

This kind of structure is best parsed, and it can be done with one recursion

The final code is as follows:

<?php
class BaiduYuedu {
 protected $bookId;
 protected $bookToken;
 protected $cookie;
 protected $result;
 public function __construct($bookId, $bookToken, $cookie){
  $this->bookId = $bookId;
  $this->bookToken = $bookToken;
  $this->cookie = $cookie;
 }
 public static function parseNode($node){
  $str = &#39;&#39;;
  if(is_string($node[&#39;c&#39;])){
   $str .= $node[&#39;c&#39;];
  }else if(is_array($node[&#39;c&#39;])){
   foreach($node[&#39;c&#39;] as $d){
    $str .= self::parseNode($d);
   }
  }
  switch($node[&#39;t&#39;]){
   case &#39;h2&#39;:
    $str .= "\n\n";
    break;
   case &#39;br&#39;:
   case &#39;p&#39;:
   case &#39;p&#39;:
    $str .= "\n";
    break;
   case &#39;img&#39;:
   case &#39;span&#39;:
    break;
   case &#39;obj&#39;:
    $tmp = &#39;(&#39; . self::parseNode($node[&#39;data&#39;][0]) . &#39;)&#39;;
    $str .= str_replace("\n", &#39;&#39;, $tmp);
    break;
   default:
    trigger_error(&#39;Unkown type:&#39;.$node[&#39;t&#39;], E_USER_WARNING);
    break;
  }
  return $str;
 }
 public function get($page = 1){
  echo "getting page {$page}...\n";
  $ch = curl_init();
  $url = sprintf(&#39;http://wenku.baidu.com/content/%s/?m=%s&type=json&cn=%d&#39;, $this->bookId, $this->token, $page);
  curl_setopt_array($ch, array(
   CURLOPT_URL   => $url,
   CURLOPT_RETURNTRANSFER => 1,
   CURLOPT_HEADER   => 0,
   CURLOPT_HTTPHEADER  => array(&#39;Cookie: &#39;. $this->cookie)
  ));
  $ret = json_decode(curl_exec($ch), true);
  curl_close($ch);
  $str = &#39;&#39;;
  if(!empty($ret)){
   $str .= self::parseNode($ret);
   $str .= $this->get($page + 1);
  }
  return $str;
 }
 public function start(){
  $this->result = $this->get();
 }
 public function getResult(){
  return $this->result;
 }
 public function saveTo($path){
  if(empty($this->result)){
   trigger_error(&#39;Result is empty&#39;, E_USER_ERROR);
   return;
  }
  file_put_contents($path, $this->result);
  echo "save to {$path}\n";
 }
}
//使用示例
$yuedu = new BaiduYuedu(&#39;49422a3769eae009581becba&#39;, &#39;8ed1dedb240b11bf0731336eff95093f&#39;, &#39;你的百度域cookie&#39;);
$yuedu->start();
$yuedu->saveTo(&#39;result.txt&#39;);

Copy after login

The first two parameters of this class can be obtained from the introduction page of the novel. The first parameter

bookId

Copy after login

is the string followed by

url

Copy after login

ebook

Copy after login

, the second parameter

bookToken

Copy after login

is searched for

bdjsonUrl

Copy after login

in the page source code, and the string after the

Copy after login

parameter is.

Note: If Baidu

cookie

Copy after login

is not passed in or Baidu

cookie

Copy after login

is invalid, only the free reading part can be captured, and the complete part must be captured The content must ensure that

cookie

Copy after login

can be used normally.

Summary

The above is an example of how to use PHP to crawl Baidu Reading. For more related content, please pay attention to the PHP Chinese website (www .php.cn)!