


How to Extract Visible Text from Webpages with BeautifulSoup?
Nov 17, 2024 am 07:43 AMPreserving Visible Text from Webpages with BeautifulSoup
Extracting visible text from webpages can be a complex task, as scripts, comments, and other elements often clutter the content. To overcome this challenge, harness the power of BeautifulSoup's findAll() function.
Identifying Visible Text
To effectively target visible text, employ the following criteria:
- Ignore elements within <style>, <script>, <head>, <title>, <meta>, and [document].
- Filter out instances of Comment objects.
Implementing the Solution
- Define a Visibility Filter:
from bs4.element import Comment def tag_visible(element): if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']: return False if isinstance(element, Comment): return False return True
- Extract Visible Text:
from bs4 import BeautifulSoup import urllib.request def text_from_html(body): soup = BeautifulSoup(body, 'html.parser') texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u" ".join(t.strip() for t in visible_texts)
- Sample Usage:
html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read() print(text_from_html(html))
Output:
This code will extract and print the visible text from the specified webpage, excluding scripts, comments, and other non-textual elements.
The above is the detailed content of How to Extract Visible Text from Webpages with BeautifulSoup?. For more information, please follow other related articles on the PHP Chinese website!

Hot Article

Hot tools Tags

Hot Article

Hot Article Tags

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

How Do I Use Beautiful Soup to Parse HTML?

How to Use Python to Find the Zipf Distribution of a Text File

How to Perform Deep Learning with TensorFlow or PyTorch?

Introduction to Parallel and Concurrent Programming in Python

Serialization and Deserialization of Python Objects: Part 1

How to Implement Your Own Data Structure in Python

Mathematical Modules in Python: Statistics
