With the development of the Internet, crawlers have become one of the main means of obtaining data. Among the many crawler languages, the combination of PHP and Selenium has also become a solution that has attracted much attention. This article will introduce you to how to use PHP and Selenium to build a reliable and efficient web crawler.
1. Introduction to Selenium
Selenium is a web automation testing framework that can simulate browser operations and provides multiple language implementations (such as Java, Python, PHP, etc.). The PHP version is called for php-webdriver. Selenium's main role is automated testing, but it can also be used for web crawlers. Compared with traditional crawler libraries (such as requests, Scrapy, etc.), Selenium can better handle JavaScript and dynamic web pages, thereby improving crawler efficiency and stability.
2. Selenium installation
1. Install Selenium WebDriver
First you need to install Selenium WebDriver, you can visit Selenium official website http://www.seleniumhq.org/download/ Download the corresponding driver, taking chrome as an example.
After downloading, you need to put the driver file in the system path.
2. Install php-webdriver
You can use Composer to install php-webdriver and execute the following command:
composer require facebook/webdriver
3. Simple example
After the installation is completed , you can use php-webdriver to perform simple operations, such as opening a website and getting the webpage title:
<?php require_once('vendor/autoload.php'); use FacebookWebDriverRemoteRemoteWebDriver; $host = 'http://localhost:9515'; // 默认Chrome浏览器启动地址 $driver = RemoteWebDriver::create($host, DesiredCapabilities::chrome()); $driver->get('http://github.com'); echo "网页标题:" . $driver->getTitle() . PHP_EOL; $driver->quit();
3. Crawler implementation
1. Log in to the website
Some websites require login To obtain data, take Github as an example. First you need to log in manually in the browser and keep the session. Then use the session in the crawler to operate:
<?php require_once('vendor/autoload.php'); use FacebookWebDriverRemoteRemoteWebDriver; use FacebookWebDriverRemoteDesiredCapabilities; // 替换以下参数为自己的github账户和密码 $username = 'yourusername'; $password = 'yourpassword'; // 启动浏览器并登录 $host = 'http://localhost:9515'; // 默认Chrome浏览器启动地址 $driver = RemoteWebDriver::create($host, DesiredCapabilities::chrome()); $driver->get('http://github.com/login'); $driver->findElement(FacebookWebDriverWebDriverBy::cssSelector('input[name="login"]'))->sendKeys($username); $driver->findElement(FacebookWebDriverWebDriverBy::cssSelector('input[name="password"]'))->sendKeys($password); $driver->findElement(FacebookWebDriverWebDriverBy::cssSelector('input[type="submit"]'))->click(); // 检查是否登录成功 $cookies = $driver->manage()->getCookies(); if (count($cookies) == 0) { echo "登录失败" . PHP_EOL; exit; } echo "登录成功" . PHP_EOL;
2. Obtain data
After logging in and entering the corresponding page, you can obtain the corresponding element through the CSS selector or XPath selector. For example, get the number of stars in a warehouse:
<?php // 获取某仓库star数目 $driver->get('https://github.com/twbs/bootstrap'); $starText = $driver->findElement(FacebookWebDriverWebDriverBy::cssSelector('.js-social-count'))->getText(); $starCount = (int)str_replace(',', '', $starText); echo "star数目:" . $starCount . PHP_EOL;
If you need to get multiple elements, you can use the findElements method, which returns a WebDriverElement array:
<?php // 获取某用户的star数目 $driver->get('https://github.com/yourusername?tab=stars'); $stars = $driver->findElements(FacebookWebDriverWebDriverBy::cssSelector('.col-12.d-inline-block>a')); echo "star数目:" . count($stars) . PHP_EOL;
3. Page turning operation
If the data is displayed in pages, page turning may be required. You can first get the current page number, and then turn the page by simulating clicking the next page button:
<?php // Github starred仓库分页 $driver->get('https://github.com/yourusername?tab=stars'); $pageNum = 1; while (true) { echo "第{$pageNum}页:" . PHP_EOL; $pageStars = $driver->findElements(FacebookWebDriverWebDriverBy::cssSelector('.col-12.d-inline-block>a')); foreach ($pageStars as $star) { echo $star->getText() . PHP_EOL; } $nextPageBtn = $driver->findElement(FacebookWebDriverWebDriverBy::cssSelector('.pagination>button:last-child')); if ($nextPageBtn->getAttribute('disabled') == 'true') { break; } $nextPageBtn->click(); $pageNum++; }
4. Summary
Through the combination of PHP and Selenium, javascript and dynamic web pages can be better processed , thereby improving crawler efficiency and stability. At the same time, Selenium also provides a rich API that can easily implement operations such as logging in and turning pages. Of course, Selenium also has certain shortcomings, such as high resource consumption and relatively slow speed. Which solution to use needs to be chosen based on specific needs.
The above is the detailed content of PHP and Selenium: A guide to building a reliable and efficient web crawler. For more information, please follow other related articles on the PHP Chinese website!