How to quickly build your own web crawler system using PHP and Selenium-PHP Tutorial-php.cn

How to quickly build your own web crawler system using PHP and Selenium

王林

Release： 2023-06-16 06:14:01

Original

1878 people have browsed it

In recent years, with the popularization of the Internet, web crawlers have gradually become one of the main means of information collection. However, conventional crawler technology is unstable and difficult to maintain. Pure Web web crawlers on the market can only search on static pages. to operate on. PHP combined with Selenium can achieve the effect of dynamic crawlers. It has the advantages of high stability and comprehensive data collection, and is widely used in crawler development. This article will introduce how to quickly build your own web crawler system using PHP and Selenium.

1. Installation of Selenium and ChromeDriver

Selenium is an automated testing tool that can conduct automated testing of web applications. It handles the browser and operating system separately without forced insertion. Code implements page rendering. ChromeDriver is the driver in Selenium that calls the Chrome browser, which allows Selenium to directly operate Chrome to crawl dynamic pages.

First you need to install the Chrome browser and PHP environment locally. Next, we need to install the corresponding version of Selenium and ChromeDriver. Enter the following code in the command line to install:

composer require facebook/webdriver

Copy after login

Then place the ChromeDriver binary file (download the corresponding version of ChromeDrive according to your local Chrome version) in In the system Path variable environment, the code is as follows:

$webdriver = FacebookWebDriverRemoteRemoteWebDriver::create(
    'http://localhost:9515', FacebookWebDriverChromeChromeOptions::class
);

Copy after login

2. Build the encapsulation class of Selenium and ChromeDriver

The Selenium encapsulation class is mainly used to maintain Selenium and ChromeDriver to avoid repeated creation and destruction. The code is as follows :

class Selenium
{
    private static $driver;
    private static $selenium;

    public static function getInstance()
    {
        if (null === self::$selenium) {
            $options = new ChromeOptions();
            $options->addArguments(['--no-sandbox','--disable-extensions','--headless','--disable-gpu']);
            self::$driver = RemoteWebDriver::create(
                'http://localhost:9515',
                DesiredCapabilities::chrome()->setCapability(
                    ChromeOptions::CAPABILITY,
                    $options
                )
            );
            self::$selenium = new self();
        }

        return self::$selenium;
    }

    public function __destruct()
    {
        self::$driver->quit();
        self::$selenium = null;
    }

    public function getDriver()
    {
        return self::$driver;
    }
}

Copy after login

Note that the ChromeOptions in the parameters are mainly for stable operation without GUI (graphical interface), and the --no-sandbox parameter is for preventing errors when running under Linux systems.

3. Create a web page source code parsing class

The core of the crawler system is to parse non-static pages. Here you need to create a source code parsing class and use regular expressions or XPath expressions to locate and obtain target nodes. information.

class PageParser
{
    private $pageSource;

    public function __construct(string $pageSource)
    {
        $this->pageSource = $pageSource;
    }

    public function parse(string $expression, $list = false)
    {
        if ($list) {
            return $this->parseList($expression);
        }
        return $this->parseSingle($expression);
    }

    private function parseList(string $expression)
    {
        $domXpath = new DOMXPath(@DOMDocument::loadHTML($this->pageSource));
        $items = $domXpath->query($expression);
        $result = [];
        foreach ($items as $item) {
            array_push($result,trim($item->nodeValue));
        }
        return $result;
    }

    private function parseSingle(string $expression)
    {
        $domXpath = new DOMXPath(@DOMDocument::loadHTML($this->pageSource));
        $item = $domXpath->query($expression)->item(0);
        if ($item) {
            return trim($item->nodeValue);
        }
        return '';
    }
}

Copy after login

The DOMXPath class and DOMDocument class are mainly used here to parse the HTML nodes in the page, and the parseList and parseSingle methods are used to locate and obtain the content of multiple and one target nodes respectively.

4. Create a crawler class

Finally, we need to build a crawler class that specifically crawls page content. The code is as follows:

class Spider
{
    private $selenium;
    private $url;

    public function __construct($url)
    {
        $this->selenium = Selenium::getInstance();
        $this->url = $url;
        $this->selenium->getDriver()->get($url);
        sleep(1);
    }

    public function __destruct()
    {
        $this->selenium->getDriver()->close();
        $this->selenium = null;
    }

    public function getContent($expression, $list = false)
    {
        $pageSource = $this->selenium->getDriver()->getPageSource();
        $parser = new PageParser($pageSource);
        return $parser->parse($expression, $list);
    }
}

Copy after login

The getContent method of this class receives two Parameters, one is the XPath expression of the target node, and the other is whether to obtain multiple contents. The getModelContent function requests the URL and parses the nodes to obtain the required content. After the function is completed, the browser process is closed.

5. Usage Examples

Finally, we use practical examples to illustrate how to use this crawler class. Suppose we need to crawl the href attribute and text information in the a tag from a web page with multiple a tags. We can achieve this through the following code:

$spider = new Spider('https://www.example.com');
$aTags = $spider->getContent('//a', true);
foreach ($aTags as $a) {
    $href = $a->getAttribute('href');
    $text = $a->nodeValue;
    echo "$href -> $text
";
}

Copy after login

In the above code, first use the Spider class to obtain the page source code, then obtain the node information of multiple a tags through XPath expressions, and finally obtain each node information through the getAttribute and nodeValue methods. The href attribute and text of an a tag.

6. Summary

To sum up, this article introduces how to use PHP and Selenium to build a web crawler system, and uses practical examples to illustrate how to obtain node information in the page. The crawler has stable It has the advantages of high accuracy and comprehensive data collection, and has certain application value. But at the same time, it should be noted that when crawling data, you need to pay attention to legality and ethics, and comply with relevant laws and regulations.

The above is the detailed content of How to quickly build your own web crawler system using PHP and Selenium. For more information, please follow other related articles on the PHP Chinese website!