Source: http://www.ido321.com/1158.html
To capture the content in a certain web page, we need to parse the DOM tree. After finding the specified node, we can then capture the content we need. The process is a bit cumbersome. LZ has summarized several commonly used and easy-to-implement web crawling methods. If you are familiar with JQuery selectors, these frameworks will be quite simple.
1. Ganon
Project address: http://code.google.com/p/ganon/
Documentation: http://code.google.com/p/ganon/w/list
Test: Grab all the div elements whose class attribute value is focus on the homepage of my website, and output the class value
<span><?php <span>include</span> <span>'ganon.php'</span>; $html = file_get_dom(<span>'http://www.ido321.com/'</span>); <span>foreach</span>($html(<span>'div[class="focus"]'</span>) <span>as</span> $element) { <span>echo</span> $element-><span>class</span>, <span>"<br>\n"</span>; } ?></span>
Result:
2. phpQuery
Project address: http://code.google.com/p/phpquery/
Documentation: https://code.google.com/p/phpquery/wiki/Manual
Test: Grab the article tag element on the homepage of my website, and then print the html value of the h2 tag below it
<span><?php <span>include</span> <span>'phpQuery/phpQuery.php'</span>; phpQuery::newDocumentFile(<span>'http://www.ido321.com/'</span>); $artlist = pq(<span>"article"</span>); <span>foreach</span>($artlist <span>as</span> $title){ <span>echo</span> pq($title)->find(<span>'h2'</span>)->html().<span>"<br/>"</span>; } ?></span>
Result:
3. Simple-Html-Dom
Project address: http://simplehtmldom.sourceforge.net/
Document: http://simplehtmldom.sourceforge.net/manual.htm
Test: crawl all links on the homepage of my website
<span><?php <span>include</span> <span>'simple_html_dom.php'</span>; <span>//使用url和file都可以创建DOM</span> $html = file_get_html(<span>'http://www.ido321.com/'</span>); <span>//找到所有图片</span> <span>// foreach($html->find('img') as $element)</span> <span>// echo $element->src . '<br>';</span> <span>//找到所有链接</span> <span>foreach</span>($html->find(<span>'a'</span>) <span>as</span> $element) <span>echo</span> $element->href . <span>'<br>'</span>; ?></span>
Result: (Screenshot is part)
4. Snoopy
Project address: http://code.google.com/p/phpquery/
Documentation: http://code.google.com/p/phpquery/wiki/Manual
Test: crawl my website homepage
<span><?php <span>include</span>(<span>"Snoopy.class.php"</span>); $url = <span>"http://www.ido321.com"</span>; $snoopy = <span>new</span> Snoopy; $snoopy->fetch($url); <span>//获取所有内容</span> <span>echo</span> $snoopy->results; <span>//显示结果</span> <span>// echo $snoopy->fetchtext ;//获取文本内容(去掉html代码)</span> <span>// echo $snoopy->fetchlinks($url) ;//获取链接</span> <span>// $snoopy->fetchform ;//获取表单 </span> ?></span>
Result:
5. Manually write crawlers
If you have good writing skills, you can handwrite a web crawler to crawl web pages. There are countless articles on the Internet that introduce this method, so I won’t go into details. If you are interested in knowing more, you can crawl the Baidu php web page.
ps: resource sharing
For common open source crawler projects, please visit: http://blog.chinaunix.net/uid-22414998-id-3774291.html
Next article: The father-in-law’s “ass theory”
Owner, you can use the simpl_html_dom class to collect data. How to use it specifically? If you know jquery, I believe you will understand it after just a look. Good luck.
strip_tags($string)