This tutorial demonstrates how to efficiently parse HTML using an open-source parser, avoiding the complexities of regular expressions. We'll scrape Envato Tuts as an example, extracting article titles and descriptions. This is for illustrative purposes; remember to always obtain permission before scraping a website.
Begin by installing Composer, a PHP package manager, to simplify library installation.
Further steps are detailed below.
Comprehensive documentation is available on the project's official GitHub repository.
---
Let's create a script to extract article titles and descriptions from Envato Tuts . This is a demonstration and should not be performed without permission. Scraping can overload servers.
The core code snippet:
use voku\helper\HtmlDomParser; require_once 'vendor/autoload.php'; $articles = []; getArticles('https://code.tutsplus.com/tutorials');
This includes the necessary library and initializes an array to store article data. The getArticles
function (defined later) fetches and processes the webpage.
The heart of the script extracts article information:
$items = $html->find('article'); foreach($items as $post) { $articles[] = [ /* title */ $post->findOne(".posts__post-title")->firstChild()->text(), /* description */ $post->findOne("posts__post-teaser")->text() ]; }
This iterates through each article element (<article>
) and extracts the title and description using CSS selectors. Each $articles
entry will contain a title and description pair. For example:
$articles[0][0] = "My Article Name Here"; $articles[0][1] = "This is my article description";
To handle multiple pages, we identify the "next" page link:
The relevant HTML:
<a aria-label="next" class="pagination__button pagination__next-button" href="https://www.php.cn/link/a3cdf7cabc49ea4612b126ae2a30ecbf" rel="next"><i class="fa fa-angle-right"></i></a>
The script finds this link, extracts the href
attribute, and recursively calls getArticles()
for subsequent pages. Crucially, the $html
object is cleared to prevent memory exhaustion.
Parsing large websites can be time-consuming. This tutorial provides a foundation for HTML parsing using a user-friendly library. While this library is convenient, remember that other methods, such as PHP's built-in DOM manipulation with XPath, exist. Always prioritize obtaining permission before scraping any website.
The above is the detailed content of HTML Parsing and Screen Scraping With the Simple HTML DOM Library. For more information, please follow other related articles on the PHP Chinese website!