Preface
This article mainly introduces how to use PHP to capture Baidu Reading. Not much to say below, let’s take a look.
The crawling method is as follows
First open the reading page in the browser, check the source code and find that the content of the novel is not written directly on the page, that is It is said that the content of the novel is loaded asynchronously.
So I switched Chrome's developer tools to the network column and refreshed the reading page. The main focus was on the two categories of XHR and script.
After investigation, it was found that there was a jsonp request under the script category that looked more like the content of a novel. The requested address was
http://www.php.cn/
The response was a
jsonp
string, and then I found that if you remove the
callback=wenku7
in the address, a
json
string will be returned, which makes it much easier to parse, and you can directly Convert to array in php.
Let’s analyze the structure of the returned data. The returned
json
string is followed by a tree-like structure. Each node has a t attribute and a c attribute. The t attribute is used to indicate The label of this node, such as h2 p, etc., the c attribute is the content, but there are two possibilities, one is a string, the other is an array, and each element of the array is a node.
This kind of structure is best parsed, and it can be done with one recursion
The final code is as follows:
<?php class BaiduYuedu { protected $bookId; protected $bookToken; protected $cookie; protected $result; public function __construct($bookId, $bookToken, $cookie){ $this->bookId = $bookId; $this->bookToken = $bookToken; $this->cookie = $cookie; } public static function parseNode($node){ $str = ''; if(is_string($node['c'])){ $str .= $node['c']; }else if(is_array($node['c'])){ foreach($node['c'] as $d){ $str .= self::parseNode($d); } } switch($node['t']){ case 'h2': $str .= "\n\n"; break; case 'br': case 'p': case 'p': $str .= "\n"; break; case 'img': case 'span': break; case 'obj': $tmp = '(' . self::parseNode($node['data'][0]) . ')'; $str .= str_replace("\n", '', $tmp); break; default: trigger_error('Unkown type:'.$node['t'], E_USER_WARNING); break; } return $str; } public function get($page = 1){ echo "getting page {$page}...\n"; $ch = curl_init(); $url = sprintf('http://wenku.baidu.com/content/%s/?m=%s&type=json&cn=%d', $this->bookId, $this->token, $page); curl_setopt_array($ch, array( CURLOPT_URL => $url, CURLOPT_RETURNTRANSFER => 1, CURLOPT_HEADER => 0, CURLOPT_HTTPHEADER => array('Cookie: '. $this->cookie) )); $ret = json_decode(curl_exec($ch), true); curl_close($ch); $str = ''; if(!empty($ret)){ $str .= self::parseNode($ret); $str .= $this->get($page + 1); } return $str; } public function start(){ $this->result = $this->get(); } public function getResult(){ return $this->result; } public function saveTo($path){ if(empty($this->result)){ trigger_error('Result is empty', E_USER_ERROR); return; } file_put_contents($path, $this->result); echo "save to {$path}\n"; } } //使用示例 $yuedu = new BaiduYuedu('49422a3769eae009581becba', '8ed1dedb240b11bf0731336eff95093f', '你的百度域cookie'); $yuedu->start(); $yuedu->saveTo('result.txt');
The first two parameters of this class can be obtained from the introduction page of the novel. The first parameter
bookId
is the string followed by
url
in
ebook
, the second parameter
bookToken
is searched for
bdjsonUrl
in the page source code, and the string after the
m
parameter is.
Note: If Baidu
cookie
is not passed in or Baidu
cookie
is invalid, only the free reading part can be captured, and the complete part must be captured The content must ensure that
cookie
can be used normally.
Summary
The above is an example of how to use PHP to crawl Baidu Reading. For more related content, please pay attention to the PHP Chinese website (www .php.cn)!