Example of how to use PHP to crawl Baidu Reading

黄舟
Release: 2023-03-05 22:40:02
Original
1894 people have browsed it

Preface

This article mainly introduces how to use PHP to capture Baidu Reading. Not much to say below, let’s take a look.

The crawling method is as follows

First open the reading page in the browser, check the source code and find that the content of the novel is not written directly on the page, that is It is said that the content of the novel is loaded asynchronously.

So I switched Chrome's developer tools to the network column and refreshed the reading page. The main focus was on the two categories of XHR and script.

After investigation, it was found that there was a jsonp request under the script category that looked more like the content of a novel. The requested address was
http://www.php.cn/
The response was a

jsonp
Copy after login

string, and then I found that if you remove the

callback=wenku7
Copy after login

in the address, a

json
Copy after login
Copy after login

string will be returned, which makes it much easier to parse, and you can directly Convert to array in php.

Let’s analyze the structure of the returned data. The returned

json
Copy after login
Copy after login

string is followed by a tree-like structure. Each node has a t attribute and a c attribute. The t attribute is used to indicate The label of this node, such as h2 p, etc., the c attribute is the content, but there are two possibilities, one is a string, the other is an array, and each element of the array is a node.

This kind of structure is best parsed, and it can be done with one recursion

The final code is as follows:

<?php
class BaiduYuedu {
 protected $bookId;
 protected $bookToken;
 protected $cookie;
 protected $result;
 public function __construct($bookId, $bookToken, $cookie){
  $this->bookId = $bookId;
  $this->bookToken = $bookToken;
  $this->cookie = $cookie;
 }
 public static function parseNode($node){
  $str = &#39;&#39;;
  if(is_string($node[&#39;c&#39;])){
   $str .= $node[&#39;c&#39;];
  }else if(is_array($node[&#39;c&#39;])){
   foreach($node[&#39;c&#39;] as $d){
    $str .= self::parseNode($d);
   }
  }
  switch($node[&#39;t&#39;]){
   case &#39;h2&#39;:
    $str .= "\n\n";
    break;
   case &#39;br&#39;:
   case &#39;p&#39;:
   case &#39;p&#39;:
    $str .= "\n";
    break;
   case &#39;img&#39;:
   case &#39;span&#39;:
    break;
   case &#39;obj&#39;:
    $tmp = &#39;(&#39; . self::parseNode($node[&#39;data&#39;][0]) . &#39;)&#39;;
    $str .= str_replace("\n", &#39;&#39;, $tmp);
    break;
   default:
    trigger_error(&#39;Unkown type:&#39;.$node[&#39;t&#39;], E_USER_WARNING);
    break;
  }
  return $str;
 }
 public function get($page = 1){
  echo "getting page {$page}...\n";
  $ch = curl_init();
  $url = sprintf(&#39;http://wenku.baidu.com/content/%s/?m=%s&type=json&cn=%d&#39;, $this->bookId, $this->token, $page);
  curl_setopt_array($ch, array(
   CURLOPT_URL   => $url,
   CURLOPT_RETURNTRANSFER => 1,
   CURLOPT_HEADER   => 0,
   CURLOPT_HTTPHEADER  => array(&#39;Cookie: &#39;. $this->cookie)
  ));
  $ret = json_decode(curl_exec($ch), true);
  curl_close($ch);
  $str = &#39;&#39;;
  if(!empty($ret)){
   $str .= self::parseNode($ret);
   $str .= $this->get($page + 1);
  }
  return $str;
 }
 public function start(){
  $this->result = $this->get();
 }
 public function getResult(){
  return $this->result;
 }
 public function saveTo($path){
  if(empty($this->result)){
   trigger_error(&#39;Result is empty&#39;, E_USER_ERROR);
   return;
  }
  file_put_contents($path, $this->result);
  echo "save to {$path}\n";
 }
}
//使用示例
$yuedu = new BaiduYuedu(&#39;49422a3769eae009581becba&#39;, &#39;8ed1dedb240b11bf0731336eff95093f&#39;, &#39;你的百度域cookie&#39;);
$yuedu->start();
$yuedu->saveTo(&#39;result.txt&#39;);
Copy after login



The first two parameters of this class can be obtained from the introduction page of the novel. The first parameter

bookId
Copy after login

is the string followed by

url
Copy after login

in

ebook
Copy after login

, the second parameter

bookToken
Copy after login

is searched for

bdjsonUrl
Copy after login

in the page source code, and the string after the

m
Copy after login

parameter is.

Note: If Baidu

cookie
Copy after login
Copy after login
Copy after login

is not passed in or Baidu

cookie
Copy after login
Copy after login
Copy after login

is invalid, only the free reading part can be captured, and the complete part must be captured The content must ensure that

cookie
Copy after login
Copy after login
Copy after login

can be used normally.

Summary

The above is an example of how to use PHP to crawl Baidu Reading. For more related content, please pay attention to the PHP Chinese website (www .php.cn)!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template