Thinkphp5 and QueryList implement the page collection function (crawler)-ThinkPHP-php.cn

Thinkphp5 and QueryList implement the page collection function (crawler)

藏色散人

Release： 2020-01-28 13:57:27

forward

4021 people have browsed it

What is QueryList?

QueryList is a set of PHP tools for content collection, which uses a more modern Development ideas, simple and elegant syntax, and strong scalability. Compared with the traditional use of obscure regular expressions for collection, QueryList uses a more powerful and elegant CSS selector for collection, which greatly lowers the threshold for PHP collection, and also makes the collection code easy to read and maintain, allowing you to Say goodbye to obscure and difficult-to-maintain regular expressions.

QueryList provides a complete set of content collection solutions

● DOM content selection: CSS selector

● HTTP client Terminal: GuzzleHTTP

● Content filtering: CSS selector

● Solving garbled characters: Built-in multiple sets of garbled code solutions

● Additional features: Rich extension plug-ins

Premise

The project mainly uses the thinkphp5 framework, and mainly uses the two files `QueryList.php` and `phpQuery.php`. We can switch to the project directory, create a new QL in extend, and then execute the composer command in the QL directory to install QueryList:

composer require jaeger/querylist

Copy after login

Then add use QL\QueryList to the controller that needs to be used; and then in the controller The code has been written. The following is an example

//需要采集的目标页面
$page = &#39;http://cms.querylist.cc/news/566.html&#39;;
//采集规则
$reg = array(
   //采集文章标题
   &#39;title&#39; => array(&#39;h1&#39;,&#39;text&#39;),
   //采集文章发布日期,这里用到了QueryList的过滤功能，过滤掉span标签和a标签
   &#39;date&#39; => array(&#39;.pt_info&#39;,&#39;text&#39;,&#39;-span -a&#39;,function($content){
       //用回调函数进一步过滤出日期
       $arr = explode(&#39; &#39;,$content);
       return $arr[0];
   }),
   //采集文章正文内容,利用过滤功能去掉文章中的超链接，但保留超链接的文字，并去掉版权、JS代码等无用信息
   &#39;content&#39; => array(&#39;.post_content&#39;,&#39;html&#39;,&#39;a -.content_copyright -script&#39;,function($content){
       //利用回调函数下载文章中的图片并替换图片路径为本地路径
       //使用本例请确保当前目录下有image文件夹，并有写入权限
       //由于QueryList是基于phpQuery的，所以可以随时随地使用phpQuery，当然在这里也可以使用正则或者其它方式达到同样的目的

       $doc=\phpQuery::newDocumentHTML($content);
       $imgs = pq($doc)->find(&#39;img&#39;);
       foreach ($imgs as $img) {
           $src = &#39;http://cms.querylist.cc&#39;.pq($img)->attr(&#39;src&#39;);
           $localSrc = md5($src).&#39;.jpg&#39;;
           $stream = file_get_contents($src);
           file_put_contents($localSrc,$stream);
           pq($img)->attr(&#39;src&#39;,$localSrc);
       }
       return $doc->htmlOuter();
   })
);
$rang = &#39;.content&#39;;
$ql = QueryList::Query($page,$reg,$rang);
$data = $ql->getData();
//打印结果
print_r($data);

Copy after login

Note:

needs to be added in front when using the phpQuery class on \, because the namespace is not used in phpQuery.php, because after using the namespace, QueryList.php cannot use the phpQuery class.

For more related ThinkPHP knowledge, please visit ThinkPHP Tutorial!

The above is the detailed content of Thinkphp5 and QueryList implement the page collection function (crawler). For more information, please follow other related articles on the PHP Chinese website!