Extracting Visible Webpage Text with BeautifulSoup
Many web-scraping tasks involve retrieving the visible text content of a webpage, excluding elements like scripts, comments, and CSS styles. Using BeautifulSoup, achieving this can be straightforward with the right approach.
A common issue arises when using the findAll() function, as it retrieves all text nodes, including those hidden within undesirable elements. To address this, we can define a custom filter to exclude specific tags and comments.
The following code exemplifies this approach:
from bs4 import BeautifulSoup from bs4.element import Comment import urllib.request def tag_visible(element): if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']: return False if isinstance(element, Comment): return False return True def text_from_html(body): soup = BeautifulSoup(body, 'html.parser') texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u" ".join(t.strip() for t in visible_texts) html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read() print(text_from_html(html))
The tag_visible function checks if the parent element of a text node matches any of the undesirable tags or if the node is a comment. Nodes that pass this filter are then used to combine the visible text into a single string using u" ".join(t.strip() for t in visible_texts).
This approach effectively extracts only the visible text from a webpage, leaving out unnecessary elements like scripts and comments.
The above is the detailed content of How to Extract Visible Webpage Text Using BeautifulSoup?. For more information, please follow other related articles on the PHP Chinese website!