Home > Backend Development > PHP Tutorial > Web crawling: Summary of ways to implement web crawlers in PHP, crawling crawlers_PHP tutorial

Web crawling: Summary of ways to implement web crawlers in PHP, crawling crawlers_PHP tutorial

WBOY
Release: 2016-07-13 10:14:55
Original
1204 people have browsed it

Web crawling: Summary of ways to implement web crawlers with PHP, crawling crawlers

Source: http://www.ido321.com/1158.html

To capture the content in a certain web page, we need to parse the DOM tree. After finding the specified node, we can then capture the content we need. The process is a bit cumbersome. LZ has summarized several commonly used and easy-to-implement web crawling methods. If you are familiar with JQuery selectors, these frameworks will be quite simple.

1. Ganon

Project address: http://code.google.com/p/ganon/

Documentation: http://code.google.com/p/ganon/w/list

Test: Grab all the div elements whose class attribute value is focus on the homepage of my website, and output the class value

<span><?php
 <span>include</span> <span>'ganon.php'</span>;
 $html = file_get_dom(<span>'http://www.ido321.com/'</span>);
 <span>foreach</span>($html(<span>'div[class="focus"]'</span>) <span>as</span> $element) {
   <span>echo</span> $element-><span>class</span>, <span>"<br>\n"</span>; 
 }
?></span>
Copy after login

Result:

2. phpQuery

Project address: http://code.google.com/p/phpquery/

Documentation: https://code.google.com/p/phpquery/wiki/Manual

Test: Grab the article tag element on the homepage of my website, and then print the html value of the h2 tag below it

<span><?php
<span>include</span> <span>'phpQuery/phpQuery.php'</span>; 
phpQuery::newDocumentFile(<span>'http://www.ido321.com/'</span>); 
$artlist = pq(<span>"article"</span>); 
<span>foreach</span>($artlist <span>as</span> $title){ 
   <span>echo</span> pq($title)->find(<span>'h2'</span>)->html().<span>"<br/>"</span>; 
} 
?></span>
Copy after login

Result:

3. Simple-Html-Dom

Project address: http://simplehtmldom.sourceforge.net/
Document: http://simplehtmldom.sourceforge.net/manual.htm

Test: crawl all links on the homepage of my website

<span><?php
<span>include</span> <span>'simple_html_dom.php'</span>;
<span>//使用url和file都可以创建DOM</span>
$html = file_get_html(<span>'http://www.ido321.com/'</span>);

<span>//找到所有图片</span>
<span>// foreach($html->find('img') as $element)</span>
<span>//        echo $element->src . '<br>';</span>

<span>//找到所有链接</span>
<span>foreach</span>($html->find(<span>'a'</span>) <span>as</span> $element)
       <span>echo</span> $element->href . <span>'<br>'</span>; 
?></span>
Copy after login

Result: (Screenshot is part)

4. Snoopy

Project address: http://code.google.com/p/phpquery/

Documentation: http://code.google.com/p/phpquery/wiki/Manual

Test: crawl my website homepage

<span><?php
<span>include</span>(<span>"Snoopy.class.php"</span>);
$url = <span>"http://www.ido321.com"</span>;
$snoopy = <span>new</span> Snoopy;
$snoopy->fetch($url); <span>//获取所有内容</span>
 <span>echo</span> $snoopy->results; <span>//显示结果</span>
<span>// echo $snoopy->fetchtext ;//获取文本内容(去掉html代码)</span>
<span>// echo $snoopy->fetchlinks($url) ;//获取链接</span>
<span>// $snoopy->fetchform ;//获取表单 </span>
?></span>
Copy after login

Result:

5. Manually write crawlers

If you have good writing skills, you can handwrite a web crawler to crawl web pages. There are countless articles on the Internet that introduce this method, so I won’t go into details. If you are interested in knowing more, you can crawl the Baidu php web page.

ps: resource sharing

For common open source crawler projects, please visit: http://blog.chinaunix.net/uid-22414998-id-3774291.html

Next article: The father-in-law’s “ass theory”



PHP web crawler collects part of the content of a website

Owner, you can use the simpl_html_dom class to collect data. How to use it specifically? If you know jquery, I believe you will understand it after just a look. Good luck.

Crawler crawls web page keywords and abstracts for search

strip_tags($string)

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/907659.htmlTechArticleWeb crawling: Summary of how to implement web crawlers in PHP, crawler source: http://www.ido321. com/1158.html To capture the content of a certain web page, you need to parse the DOM tree and find the specified...
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template