在 Python 中從 HTML 內容中提取文字：使用「HTMLParser」的簡單解決方案-Python教學-PHP中文網

在 Python 中從 HTML 內容中提取文字：使用「HTMLParser」的簡單解決方案

Patricia Arquette

發布： 2024-12-10 11:04:16

原創

909 人瀏覽過

Extracting Text from HTML Content in Python: A Simple Solution with `HTMLParser`

介紹

在處理 HTML 資料時，您經常需要清理標籤並僅保留純文字。無論是用於資料分析、自動化，還是只是讓內容可讀，此任務對於開發人員來說都很常見。

在本文中，我將向您展示如何建立一個簡單的 Python 類，以使用內建 Python 模組 HTMLParser 從 HTML 中提取純文字。

為什麼要使用 HTMLParser？

HTMLParser 是一個輕量級的內建 Python 模組，可讓您解析和操作 HTML 文件。與 BeautifulSoup 等外部函式庫不同，它是輕量級的，非常適合 HTML 標籤清理等簡單任務。

解決方案：一個簡單的 Python 類

步驟 1：建立 HTMLTextExtractor 類

from html.parser import HTMLParser

class HTMLTextExtractor(HTMLParser):
    """Class for extracting plain text from HTML content."""

    def __init__(self):
        super().__init__()
        self.text = []

    def handle_data(self, data):
        self.text.append(data.strip())

    def get_text(self):
        return ''.join(self.text)

登入後複製

這個類別主要做了三件事：

初始化清單 self.text 以儲存提取的文字。
使用handle_data方法捕捉HTML標籤之間的所有純文字。
使用 get_text 方法組合所有文字片段。

第 2 步：使用該類別提取文本

以下是如何使用該類別來清理 HTML：

raw_description = """
<div>
    <h1>Welcome to our website!</h1>
    <p>We offer <strong>exceptional services</strong> for our customers.</p>
    <p>Contact us at: <a href="mailto:contact@example.com">contact@example.com</a></p>
</div>
"""

extractor = HTMLTextExtractor()
extractor.feed(raw_description)
description = extractor.get_text()

print(description)

登入後複製

輸出：

Welcome to our website! We offer exceptional services for our customers.Contact us at: contact@example.com

登入後複製

添加對屬性的支持

如果您想捕獲其他信息，例如標籤中的鏈接，這裡是該類的增強版本：

class HTMLTextExtractor(HTMLParser):
    """Class for extracting plain text and links from HTML content."""

    def __init__(self):
        super().__init__()
        self.text = []

    def handle_data(self, data):
        self.text.append(data.strip())

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr, value in attrs:
                if attr == 'href':
                    self.text.append(f" (link: {value})")

    def get_text(self):
        return ''.join(self.text)

登入後複製

增強輸出：

Welcome to our website!We offer exceptional services for our customers.Contact us at: contact@example.com (link: mailto:contact@example.com)

登入後複製

## Use Cases

- **SEO**: Clean HTML tags to analyze the plain text content of a webpage.
- **Emails**: Transform HTML emails into plain text for basic email clients.
- **Scraping**: Extract important data from web pages for analysis or storage.
- **Automated Reports**: Simplify API responses containing HTML into readable text.

登入後複製

這種方法的優點

輕量級：不需要外部函式庫；它是基於 Python 的原生 HTMLParser 建置。
易於使用：將邏輯封裝在一個簡單且可重複使用的類別中。
可自訂：輕鬆擴展功能以捕獲屬性或附加標籤資料等特定資訊。

## Limitations and Alternatives

While `HTMLParser` is simple and efficient, it has some limitations:

- **Complex HTML**: It may struggle with very complex or poorly formatted HTML documents.
- **Limited Features**: It doesn't provide advanced parsing features like CSS selectors or DOM tree manipulation.

### Alternatives

If you need more robust features, consider using these libraries:

- **BeautifulSoup**: Excellent for complex HTML parsing and manipulation.
- **lxml**: Known for its speed and support for both XML and HTML parsing.

登入後複製