Home > Backend Development > PHP Tutorial > How do I build a simple PHP crawler to extract links and content from a website?

How do I build a simple PHP crawler to extract links and content from a website?

Linda Hamilton
Release: 2024-11-07 19:04:02
Original
959 people have browsed it

How do I build a simple PHP crawler to extract links and content from a website?

Creating a Simple PHP Crawler

Crawling websites and extracting data is a common task in web programming. PHP provides a flexible framework for building crawlers, allowing you to access and retrieve information from remote web pages.

To create a simple PHP crawler that collects links and content from a given web page, you can utilize the following approach:

Using a DOM Parser:

<?php
function crawl_page($url, $depth = 5)
{
    // Prevent endless recursion and circular references
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    // Mark the URL as seen
    $seen[$url] = true;

    // Load the web page using DOM
    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    // Iterate over all anchor tags (<a>)
    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');

        // Convert relative URLs to absolute URLs
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) &amp;&amp; isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1) . $path;
            }
        }

        // Recursively crawl the linked page
        crawl_page($href, $depth - 1);
    }

    // Output the crawled page's URL and content
    echo "URL: " . $url . PHP_EOL . "CONTENT: " . PHP_EOL . $dom->saveHTML() . PHP_EOL . PHP_EOL;
}
crawl_page("http://example.com", 2);
?>
Copy after login

This crawler uses a DOM parser to navigate through the web page's HTML, identifies all anchor tags, and follows any links they contain. It collects the content of the linked pages and dumps it into the standard output. You can redirect this output to a text file to save the collected data locally.

Additional Features:

  • Prevents crawling the same URL multiple times.
  • Handles relative URLs correctly.
  • Supports HTTPS, user authentication, and port numbers when using the http PECL extension.

The above is the detailed content of How do I build a simple PHP crawler to extract links and content from a website?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template