Home > Backend Development > Python Tutorial > How Can I Efficiently Extract Text Content from HTML Strings in Python?

How Can I Efficiently Extract Text Content from HTML Strings in Python?

Mary-Kate Olsen
Release: 2024-12-05 07:41:09
Original
770 people have browsed it

How Can I Efficiently Extract Text Content from HTML Strings in Python?

Extracting Content from HTML Strings in Python

When working with HTML data in Python, it's often desirable to strip out the formatting tags and retain only the text content. This simplified view of the data can be useful for summarizing text, performing natural language processing, and other tasks.

One way to accomplish this in Python is through the MLStripper class, which utilizes Python's built-in HTML parser.

# For Python 3+
from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Copy after login
# For Python 2
from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Copy after login

By passing HTML content to the strip_tags function, you can easily extract only the text portions of the HTML:

cleaned_content = strip_tags("<b>Hello</b> world")
# Prints "Hello world"
Copy after login

This MLStripper class and the strip_tags function provide a convenient way to process HTML content in Python, allowing you to focus on the text content without the distractions of formatting tags.

The above is the detailed content of How Can I Efficiently Extract Text Content from HTML Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template