Stripping HTML from Strings in Python
When interacting with HTML content, it often becomes necessary to separate the meaningful text from the markup tags for further processing or analysis. Here's how to achieve this efficiently in Python.
To strip HTML tags from a string, utilize the HTMLParser from the Python standard library. This parser provides a straightforward way to parse HTML documents and extract only the desired content.
For Python 3, employ the following code:
from io import StringIO from html.parser import HTMLParser class TagStripper(HTMLParser): def __init__(self): super().__init__() self.reset() self.strict = False self.convert_charrefs = True self.text = StringIO() def handle_data(self, data): self.text.write(data) def get_data(self): return self.text.getvalue() def strip_html(html): stripper = TagStripper() stripper.feed(html) return stripper.get_data()
For Python 2, use the following code:
from HTMLParser import HTMLParser from StringIO import StringIO class TagStripper(HTMLParser): def __init__(self): self.reset() self.text = StringIO() def handle_data(self, data): self.text.write(data) def get_data(self): return self.text.getvalue() def strip_html(html): stripper = TagStripper() stripper.feed(html) return stripper.get_data()
Now, let's illustrate its usage:
html = "<p>Hello, <em>world</em>!</p>" stripped_text = strip_html(html) print(stripped_text) # Output: Hello, world!
The above is the detailed content of How Can I Efficiently Strip HTML Tags from Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!