Crawling with PHP: A Comprehensive Guide
To extract data from a web page containing several links, PHP offers various possibilities. One approach involves utilizing regular expressions, but it's essential to avoid relying solely on them for HTML parsing.
DOM-Based Crawler Implementation
Tatu's DOM-based crawler provides a reliable alternative. Here's an improved version:
function crawl_page($url, $depth = 5) { static $seen = array(); if (isset($seen[$url]) || $depth === 0) { return; } $seen[$url] = true; $dom = new DOMDocument('1.0'); @$dom->loadHTMLFile($url); $anchors = $dom->getElementsByTagName('a'); foreach ($anchors as $element) { $path = $element->getAttribute('href'); if (0 !== strpos($path, 'http')) { $path = '/' . ltrim($path, '/'); if (extension_loaded('http')) { $href = http_build_url($url, array('path' => $path)); } else { $parts = parse_url($url); $href = $parts['scheme'] . '://'; if (isset($parts['user']) && isset($parts['pass'])) { $href .= $parts['user'] . ':' . $parts['pass'] . '@'; } $href .= $parts['host']; if (isset($parts['port'])) { $href .= ':' . $parts['port']; } $href .= dirname($parts['path'], 1).$path; } } crawl_page($href, $depth - 1); } echo "URL:", $url, PHP_EOL, "CONTENT:", PHP_EOL, $dom->saveHTML(), PHP_EOL, PHP_EOL; }
This improved version accounts for various url scenarios, including https, user, pass, and port.
Enhancements
George pointed out a bug in the original version, which appends relative urls to the end of the url path instead of overwriting it. Consequently, this issue has been addressed, ensuring that relative urls behave as expected.
Saving Output
The modified version of the crawler echoes its output to STDOUT, allowing you to conveniently redirect it to a file of your choice.
By incorporating these enhancements, this DOM-based crawler provides a robust solution for extracting data from web pages with multiple links in PHP.
The above is the detailed content of How can I build a robust PHP crawler using DOM manipulation for extracting data from web pages with multiple links?. For more information, please follow other related articles on the PHP Chinese website!