Home > Backend Development > Python Tutorial > How to Extract Only Visible Text from Web Pages Using BeautifulSoup?

How to Extract Only Visible Text from Web Pages Using BeautifulSoup?

Susan Sarandon
Release: 2024-11-14 18:56:02
Original
239 people have browsed it

How to Extract Only Visible Text from Web Pages Using BeautifulSoup?

Webpage Text Extraction with BeautifulSoup: Extracting Visible Text Exclusively

Web scraping often involves retrieving specific text content from web pages. Using BeautifulSoup, a widely used HTML parsing library, you may encounter the challenge of extracting only the visible text on a webpage, excluding unwanted elements such as scripts, comments, and CSS.

Identifying Visible Text

To determine whether a particular HTML element contains visible text, you can use the tag_visible function. This function checks if the parent element of the target element is within a specific set of excluded tags (e.g., style, script, head) or if the target element is a comment. If either condition is met, the function returns False, indicating the element is not considered visible.

Extracting Visible Text

To extract the visible text from a web page, follow these steps:

  1. Create a BeautifulSoup object from the HTML body.
  2. Find all text in the HTML using the findAll(text=True) method.
  3. Filter the extracted text using the tag_visible function to eliminate unwanted elements.
  4. Join the visible text strings together, removing leading and trailing whitespace.

Example Usage

The code below demonstrates how to use these techniques to extract visible text from a web page:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))
Copy after login

By leveraging this approach, you can effectively scrape visible text from web pages, excluding irrelevant content from scripts, comments, and other hidden elements.

The above is the detailed content of How to Extract Only Visible Text from Web Pages Using BeautifulSoup?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template