Home PHP Framework ThinkPHP Thinkphp5 and QueryList implement the page collection function (crawler)

Thinkphp5 and QueryList implement the page collection function (crawler)

Jan 28, 2020 pm 01:57 PM
querylist thinkphp5

Thinkphp5 and QueryList implement the page collection function (crawler)

What is QueryList?

QueryList is a set of PHP tools for content collection, which uses a more modern Development ideas, simple and elegant syntax, and strong scalability. Compared with the traditional use of obscure regular expressions for collection, QueryList uses a more powerful and elegant CSS selector for collection, which greatly lowers the threshold for PHP collection, and also makes the collection code easy to read and maintain, allowing you to Say goodbye to obscure and difficult-to-maintain regular expressions.

QueryList provides a complete set of content collection solutions

● DOM content selection: CSS selector

● HTTP client Terminal: GuzzleHTTP

● Content filtering: CSS selector

● Solving garbled characters: Built-in multiple sets of garbled code solutions

● Additional features: Rich extension plug-ins

Premise

The project mainly uses the thinkphp5 framework, and mainly uses the two files `QueryList.php` and `phpQuery.php`. We can switch to the project directory, create a new QL in extend, and then execute the composer command in the QL directory to install QueryList:

composer require jaeger/querylist
Copy after login

Then add use QL\QueryList to the controller that needs to be used; and then in the controller The code has been written. The following is an example

//需要采集的目标页面
$page = 'http://cms.querylist.cc/news/566.html';
//采集规则
$reg = array(
   //采集文章标题
   'title' => array('h1','text'),
   //采集文章发布日期,这里用到了QueryList的过滤功能,过滤掉span标签和a标签
   'date' => array('.pt_info','text','-span -a',function($content){
       //用回调函数进一步过滤出日期
       $arr = explode(' ',$content);
       return $arr[0];
   }),
   //采集文章正文内容,利用过滤功能去掉文章中的超链接,但保留超链接的文字,并去掉版权、JS代码等无用信息
   'content' => array('.post_content','html','a -.content_copyright -script',function($content){
       //利用回调函数下载文章中的图片并替换图片路径为本地路径
       //使用本例请确保当前目录下有image文件夹,并有写入权限
       //由于QueryList是基于phpQuery的,所以可以随时随地使用phpQuery,当然在这里也可以使用正则或者其它方式达到同样的目的

       $doc=\phpQuery::newDocumentHTML($content);
       $imgs = pq($doc)->find('img');
       foreach ($imgs as $img) {
           $src = 'http://cms.querylist.cc'.pq($img)->attr('src');
           $localSrc = md5($src).'.jpg';
           $stream = file_get_contents($src);
           file_put_contents($localSrc,$stream);
           pq($img)->attr('src',$localSrc);
       }
       return $doc->htmlOuter();
   })
);
$rang = '.content';
$ql = QueryList::Query($page,$reg,$rang);
$data = $ql->getData();
//打印结果
print_r($data);
Copy after login

Note:

needs to be added in front when using the phpQuery class on \, because the namespace is not used in phpQuery.php, because after using the namespace, QueryList.php cannot use the phpQuery class.

For more related ThinkPHP knowledge, please visit ThinkPHP Tutorial!

The above is the detailed content of Thinkphp5 and QueryList implement the page collection function (crawler). For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What should I do if I get an error when deploying thinkphp5 in Pagoda? What should I do if I get an error when deploying thinkphp5 in Pagoda? Dec 19, 2022 am 11:04 AM

Solution to the error reported when deploying thinkphp5 in Pagoda: 1. Open the Pagoda server, install the php pathinfo extension and enable it; 2. Configure the ".access" file with the content "RewriteRule ^(.*)$ index.php?s=/$1 [QSA ,PT,L]”; 3. In website management, just enable thinkphp’s pseudo-static.

What should I do if thinkphp5 url rewriting fails? What should I do if thinkphp5 url rewriting fails? Dec 12, 2022 am 09:31 AM

Solution to thinkphp5 url rewriting not working: 1. Check whether the mod_rewrite.so module is loaded in the httpd.conf configuration file; 2. Change None in AllowOverride None to All; 3. Modify the Apache configuration file .htaccess to "RewriteRule ^ (.*)$ index.php [L,E=PATH_INFO:$1]" and save it.

What should I do if thinkphp5 post cannot get the value? What should I do if thinkphp5 post cannot get the value? Dec 06, 2022 am 09:29 AM

thinkphp5 post cannot get a value because TP5 uses the strpos function to find the app/json string in the content-type value of the Header. The solution is to set the content-type value of the Header to app/json.

How to get the requested URL in thinkphp5 How to get the requested URL in thinkphp5 Dec 20, 2022 am 09:48 AM

Methods for thinkphp5 to obtain the requested URL: 1. Use the "$request = Request::instance();" method of the "\think\Request" class to obtain the current URL information; 2. Use the built-in helper function "$request-> url()" to obtain the complete URL address including the domain name.

How to remove thinkphp5 title bar icon How to remove thinkphp5 title bar icon Dec 20, 2022 am 09:24 AM

How to remove the thinkphp5 title bar icon: 1. Find the favicon.ico file under the thinkphp5 framework public; 2. Delete the file or choose another picture to rename it to favicon.ico and replace the original favicon.ico file.

What should I do if thinkphp5 prompts that the controller does not exist? What should I do if thinkphp5 prompts that the controller does not exist? Dec 06, 2022 am 10:43 AM

Solution to thinkphp5 prompting that the controller does not exist: 1. Check whether the namespace in the corresponding controller is written correctly and change it to the correct namespace; 2. Open the corresponding tp file and modify the class name.

How to query yesterday's data in ThinkPHP5 How to query yesterday's data in ThinkPHP5 Dec 05, 2022 am 09:20 AM

How to query yesterday's data in ThinkPHP5: 1. Open ThinkPHP5 related files; 2. Query yesterday's data through the expression "db('table')->whereTime('c_time', 'yesterday')->select();" Can.

How to set error prompts in thinkphp5 How to set error prompts in thinkphp5 Dec 07, 2022 am 10:31 AM

How to set error prompts in thinkphp5: 1. Enter the public folder in the project root directory and open the index.php entry file; 2. View the comments on the debug mode switch; 3. Adjust the value of the "APP_DEBUG" constant to true to display Error message prompt.

See all articles