Home > Backend Development > PHP Tutorial > How to Build a PHP Web Crawler to Gather Data from Multiple Links?

How to Build a PHP Web Crawler to Gather Data from Multiple Links?

Susan Sarandon
Release: 2024-11-08 06:50:02
Original
540 people have browsed it

How to Build a PHP Web Crawler to Gather Data from Multiple Links?

PHP Web Crawler: Harvesting Data from Multiple Links

Question:

Create a PHP script to retrieve data from multiple links on a web page and store it in a local file.

Answer:

Using DOM and Depth Control:

function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        // Handle relative URLs
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1).$path;
            }
        }
        crawl_page($href, $depth - 1);
    }

    // Output data
    echo "URL:", $url, PHP_EOL, "CONTENT:", PHP_EOL, $dom->saveHTML(), PHP_EOL, PHP_EOL;
}

// Usage
crawl_page("http://hobodave.com", 2);
Copy after login

Notes:

  • This version uses DOM parsing, which is more robust than RegEx parsing.
  • It handles relative URLs correctly.
  • It employs a depth control to prevent infinite looping.
  • The output is echoed to STDOUT, allowing you to redirect it to a file.

The above is the detailed content of How to Build a PHP Web Crawler to Gather Data from Multiple Links?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template