Home > Backend Development > Python Tutorial > How Can I Efficiently Strip HTML Tags from Text in Python?

How Can I Efficiently Strip HTML Tags from Text in Python?

Linda Hamilton
Release: 2024-12-19 22:42:16
Original
607 people have browsed it

How Can I Efficiently Strip HTML Tags from Text in Python?

Stripping HTML Tags in Python for a Pristine Textual Representation

Manipulating HTML responses often involves extracting relevant text content while eliminating the formatting tags. This can be achieved by effectively stripping HTML tags, leaving you with the desired plain text.

Achieving Text-Only Extraction with Python's MLStripper

To streamline the stripping process, the Python standard library provides an efficient function, MLStripper, designed specifically for this purpose. MLStripper takes HTML input and parses it, preserving only non-markup content.

Implementation for Python 3 and 2

Depending on your Python version, you can utilize the following code snippets:

Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Copy after login

Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Copy after login

Usage:

Simply call the strip_tags function passing the HTML input as a string argument. The returned value will be a stripped string with all HTML tags removed.

This technique proves invaluable when you need to work with textual data extracted from HTML sources, ensuring a clean and manageable text representation.

The above is the detailed content of How Can I Efficiently Strip HTML Tags from Text in Python?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template