How to Extract Visible Text from Webpages with BeautifulSoup?-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

How to Extract Visible Text from Webpages with BeautifulSoup?

Nov 17, 2024 am 07:43 AM

How to Extract Visible Text from Webpages with BeautifulSoup?

Preserving Visible Text from Webpages with BeautifulSoup

Extracting visible text from webpages can be a complex task, as scripts, comments, and other elements often clutter the content. To overcome this challenge, harness the power of BeautifulSoup's findAll() function.

Identifying Visible Text

To effectively target visible text, employ the following criteria:

Ignore elements within <style>, <script>, <head>, <title>, <meta>, and [document].
Filter out instances of Comment objects.

Implementing the Solution

Define a Visibility Filter:

from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

Copy after login

Extract Visible Text:

from bs4 import BeautifulSoup
import urllib.request

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts) 
    return u" ".join(t.strip() for t in visible_texts)

Copy after login

Sample Usage:

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

Copy after login

Output:

This code will extract and print the visible text from the specified webpage, excluding scripts, comments, and other non-textual elements.

The above is the detailed content of How to Extract Visible Text from Webpages with BeautifulSoup?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn