使用簡單的HTML DOM庫進行HTML解析和屏幕刮擦-php教程-PHP中文網

使用簡單的HTML DOM庫進行HTML解析和屏幕刮擦

Lisa Kudrow

發布： 2025-02-28 10:50:16

原創

790 人瀏覽過

本教程演示瞭如何使用開源解析器有效地解析HTML，從而避免了正則表達式的複雜性。我們將以一個例子為例，提取文章標題和描述。這是出於說明目的；請記住在刮去網站之前始終獲得許可。

>設置

首先安裝PHP軟件包管理器Composer，以簡化庫安裝。

HTML Parsing and Screen Scraping With the Simple HTML DOM Library

其他步驟在下面詳細介紹。

文檔

綜合文檔可在該項目的官方GitHub存儲庫中獲得。

---

HTML Parsing and Screen Scraping With the Simple HTML DOM Library

HTML Parsing and Screen Scraping With the Simple HTML DOM Library 核心代碼段：

這包括必要的庫，並初始化一個數組來存儲文章數據。

use voku\helper\HtmlDomParser;
require_once 'vendor/autoload.php';

$articles = [];
getArticles('https://code.tutsplus.com/tutorials');

登入後複製

>函數（稍後定義）獲取並處理網頁。

> getArticles

>通過每個文章元素（

$items = $html->find('article');
foreach($items as $post) {
    $articles[] = [
        /* title */ $post->findOne(".posts__post-title")->firstChild()->text(),
        /* description */ $post->findOne("posts__post-teaser")->text()
    ];
}

登入後複製

）迭代，並使用CSS選擇器提取標題和描述。每個條目將包含一個標題和描述對。例如：

<article> $articles

$articles[0][0] = "My Article Name Here";
$articles[0][1] = "This is my article description";

登入後複製

>處理分頁

要處理多個頁面，我們確定“下一個”頁面鏈接：>

HTML Parsing and Screen Scraping With the Simple HTML DOM Library 相關的html：

腳本找到了此鏈接，提取屬性，然後遞歸調用

以獲取後續頁面。至關重要的是，要清除

的對像以防止記憶力耗盡。

<a aria-label="next" class="pagination__button pagination__next-button" href="https://www.php.cn/link/a3cdf7cabc49ea4612b126ae2a30ecbf" rel="next"><i class="fa fa-angle-right"></i></a>

登入後複製