


Thinkphp5 and QueryList implement the page collection function (crawler)
What is QueryList?
QueryList is a set of PHP tools for content collection, which uses a more modern Development ideas, simple and elegant syntax, and strong scalability. Compared with the traditional use of obscure regular expressions for collection, QueryList uses a more powerful and elegant CSS selector for collection, which greatly lowers the threshold for PHP collection, and also makes the collection code easy to read and maintain, allowing you to Say goodbye to obscure and difficult-to-maintain regular expressions.
QueryList provides a complete set of content collection solutions
● DOM content selection: CSS selector
● HTTP client Terminal: GuzzleHTTP
● Content filtering: CSS selector
● Solving garbled characters: Built-in multiple sets of garbled code solutions
● Additional features: Rich extension plug-ins
Premise
The project mainly uses the thinkphp5 framework, and mainly uses the two files `QueryList.php` and `phpQuery.php`. We can switch to the project directory, create a new QL in extend, and then execute the composer command in the QL directory to install QueryList:
composer require jaeger/querylist
Then add use QL\QueryList to the controller that needs to be used; and then in the controller The code has been written. The following is an example
//需要采集的目标页面 $page = 'http://cms.querylist.cc/news/566.html'; //采集规则 $reg = array( //采集文章标题 'title' => array('h1','text'), //采集文章发布日期,这里用到了QueryList的过滤功能,过滤掉span标签和a标签 'date' => array('.pt_info','text','-span -a',function($content){ //用回调函数进一步过滤出日期 $arr = explode(' ',$content); return $arr[0]; }), //采集文章正文内容,利用过滤功能去掉文章中的超链接,但保留超链接的文字,并去掉版权、JS代码等无用信息 'content' => array('.post_content','html','a -.content_copyright -script',function($content){ //利用回调函数下载文章中的图片并替换图片路径为本地路径 //使用本例请确保当前目录下有image文件夹,并有写入权限 //由于QueryList是基于phpQuery的,所以可以随时随地使用phpQuery,当然在这里也可以使用正则或者其它方式达到同样的目的 $doc=\phpQuery::newDocumentHTML($content); $imgs = pq($doc)->find('img'); foreach ($imgs as $img) { $src = 'http://cms.querylist.cc'.pq($img)->attr('src'); $localSrc = md5($src).'.jpg'; $stream = file_get_contents($src); file_put_contents($localSrc,$stream); pq($img)->attr('src',$localSrc); } return $doc->htmlOuter(); }) ); $rang = '.content'; $ql = QueryList::Query($page,$reg,$rang); $data = $ql->getData(); //打印结果 print_r($data);
Note:
needs to be added in front when using the phpQuery class on \, because the namespace is not used in phpQuery.php, because after using the namespace, QueryList.php cannot use the phpQuery class.
For more related ThinkPHP knowledge, please visit ThinkPHP Tutorial!
The above is the detailed content of Thinkphp5 and QueryList implement the page collection function (crawler). For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Solution to the error reported when deploying thinkphp5 in Pagoda: 1. Open the Pagoda server, install the php pathinfo extension and enable it; 2. Configure the ".access" file with the content "RewriteRule ^(.*)$ index.php?s=/$1 [QSA ,PT,L]”; 3. In website management, just enable thinkphp’s pseudo-static.

Solution to thinkphp5 url rewriting not working: 1. Check whether the mod_rewrite.so module is loaded in the httpd.conf configuration file; 2. Change None in AllowOverride None to All; 3. Modify the Apache configuration file .htaccess to "RewriteRule ^ (.*)$ index.php [L,E=PATH_INFO:$1]" and save it.

thinkphp5 post cannot get a value because TP5 uses the strpos function to find the app/json string in the content-type value of the Header. The solution is to set the content-type value of the Header to app/json.

Methods for thinkphp5 to obtain the requested URL: 1. Use the "$request = Request::instance();" method of the "\think\Request" class to obtain the current URL information; 2. Use the built-in helper function "$request-> url()" to obtain the complete URL address including the domain name.

How to remove the thinkphp5 title bar icon: 1. Find the favicon.ico file under the thinkphp5 framework public; 2. Delete the file or choose another picture to rename it to favicon.ico and replace the original favicon.ico file.

Solution to thinkphp5 prompting that the controller does not exist: 1. Check whether the namespace in the corresponding controller is written correctly and change it to the correct namespace; 2. Open the corresponding tp file and modify the class name.

How to query yesterday's data in ThinkPHP5: 1. Open ThinkPHP5 related files; 2. Query yesterday's data through the expression "db('table')->whereTime('c_time', 'yesterday')->select();" Can.

How to set error prompts in thinkphp5: 1. Enter the public folder in the project root directory and open the index.php entry file; 2. View the comments on the debug mode switch; 3. Adjust the value of the "APP_DEBUG" constant to true to display Error message prompt.
