Table of Contents
Install
Install parser
Objects in beautiful soup
Get title, title and link
处理多值和重复属性
浏览 DOM
仅解析文档的一部分
最终想法
Home Backend Development Python Tutorial Using Beautiful Soup for web scraping in Python: basic knowledge exploration

Using Beautiful Soup for web scraping in Python: basic knowledge exploration

Sep 02, 2023 am 10:49 AM

Python中使用Beautiful Soup进行网页抓取:基础知识探究

In a previous tutorial, I showed you how to access a web page through Python using the Requests module. This tutorial covers topics such as making GET/POST requests and programmatically downloading content such as images or PDFs. One thing the tutorial is missing is a guide on how to scrape the web page you visit with the request to extract the information you need.

In this tutorial, you will learn about Beautiful Soup, a Python library for extracting data from HTML files. This tutorial focuses on learning the basics of the library, with the next tutorial covering more advanced topics. Please note that all examples in this tutorial use Beautiful Soup 4.

Install

You can install Beautiful Soup 4 using pip. The package name is beautifulsoup4. It should work on Python 2 and Python 3.

$ pip install beautifulsoup4
Copy after login

If pip is not installed on your system, you can directly download the Beautiful Soup 4 source code tarball and install it using setup.py.

$ python setup.py install
Copy after login

Beautiful Soup was originally packaged as Python 2 code. When you install it for use with Python 3, it automatically updates to Python 3 code. The code will not be converted unless you install the package. Here are some common errors you may notice:

  • When you run Python 2 version of code under Python 3, "No module named HTMLParser" ImportError will appear.
  • When you run the Python 3 version of the code under Python 2, "No module named html.parser" ImportError will appear.

Both of the above errors can be corrected by uninstalling and reinstalling Beautiful Soup.

Install parser

Before discussing the differences between the different parsers that Beautiful Soup can use, let's write the code to create a soup.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html.parser")
Copy after login

BeautifulSoup The object can accept two parameters. The first parameter is the actual token and the second parameter is the parser you want to use. The different parsers are html.parser, lxml and html5lib. lxml There are two versions of the parser: HTML parser and XML parser.

html.parser is a built-in parser that doesn't work well in older versions of Python. You can install additional parsers using the following command:

$ pip install lxml
$ pip install html5lib
Copy after login

lxml The parser is very fast and can be used to quickly parse the given HTML. On the other hand, the html5lib parser is very slow, but also very lenient. Here are examples using each parser:

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html.parser")
print(soup)
# <html><p>This is <b>invalid HTML</b></p></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "lxml")
print(soup)
# <html><body><p>This is <b>invalid HTML</b></p></body></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "xml")
print(soup)
# <?xml version="1.0" encoding="utf-8"?>
# <html><p>This is <b>invalid HTML</b></p></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html5lib")
print(soup)
# <html><head></head><body><p>This is <b>invalid HTML</b></p></body></html>
Copy after login

The differences outlined in the example above only make sense if you are parsing invalid HTML. However, most HTML on the web is malformed, and understanding these differences will help you debug some parsing errors and decide which parser to use in your project. In general, the lxml parser is a very good choice.

Objects in beautiful soup

Beautiful Soup Parses the given HTML document into a tree of Python objects. There are four main Python objects you need to know: Tag, NavigableString, BeautifulSoup, and Comment.

Tag The object refers to the actual XML or HTML tag in the document. You can access the name of a tag using tag.name. You can also set the label's name to something else. The name change will be visible in the markup generated by Beautiful Soup.

You can access different properties, such as the tag's class and id, using tag['class'] and tag['id'] respectively. You can also access the entire attribute dictionary using tag.attrs. You can also add, delete, or modify a label's properties. Attributes like an element's class can take multiple values ​​and are stored as a list.

The text within the

tag is stored in Beautiful Soup as NavigableString. It has some useful methods such as replace_with("string") to replace text within a tag. You can also use unicode() to convert NavigableString to a unicode string.

Beautiful Soup also allows you to access comments in web pages. These comments are stored as a Comment object, which is also basically a NavigableString.

You already learned about the BeautifulSoup object in the previous section. It is used to represent the entire document. Since it's not an actual object, it doesn't have any name or properties.

You can easily extract page titles and other such data using Beautiful Soup. Let’s scrape the Wikipedia page about Python. First, you have to get the page tag using the following code as per the requests module tutorial to access the web page.

import requests
from bs4 import BeautifulSoup

req = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
soup = BeautifulSoup(req.text, "lxml")
Copy after login

Now that you have created the soup, you can get the title of the web page using the following code:

soup.title
# <title>Python (programming language) - Wikipedia</title>

soup.title.name
# 'title'

soup.title.string
# 'Python (programming language) - Wikipedia'
Copy after login

您还可以抓取网页以获取其他信息,例如主标题或第一段、它们的类或 id 属性。

soup.h1
# <h1 class="firstHeading" id="firstHeading" lang="en">Python (programming language)</h1>

soup.h1.string
# 'Python (programming language)'

soup.h1['class']
# ['firstHeading']

soup.h1['id']
# 'firstHeading'

soup.h1.attrs
# {'class': ['firstHeading'], 'id': 'firstHeading', 'lang': 'en'}

soup.h1['class'] = 'firstHeading, mainHeading'
soup.h1.string.replace_with("Python - Programming Language")
del soup.h1['lang']
del soup.h1['id']

soup.h1
# <h1 class="firstHeading, mainHeading">Python - Programming Language</h1>
Copy after login

同样,您可以使用以下代码遍历文档中的所有链接或副标题:

for sub_heading in soup.find_all('h2'):
    print(sub_heading.text)
    
# all the sub-headings like Contents, History[edit]...
Copy after login

处理多值和重复属性

HTML 文档中的不同元素使用各种属性来实现不同的目的。例如,您可以将 class 或 id 属性添加到样式、组或标识元素。同样,您可以使用数据属性来存储任何附加信息。并非所有属性都可以接受多个值,但有一些可以。 HTML 规范对这些情况有一套明确的规则,Beautiful Soup 试图遵循所有这些规则。但是,它还允许您指定如何处理多值属性返回的数据。该功能是在4.8版本中添加的,因此在使用之前请确保您已经安装了正确的版本。

默认情况下,像 class 这样可以有多个值的属性将返回一个列表,但像 id 这样的属性将返回单个字符串值。您可以在 BeautifulSoup 构造函数中传递名为 multi_valued_attributes 的参数,并将其值设置为 None。这将确保所有属性返回的值都是字符串。

这是一个例子:

from bs4 import BeautifulSoup

markup = '''
<a class="notice light" id="recent-posts" data-links="1 5 20" href="/recent-posts/">Recent Posts</a>
'''

soup = BeautifulSoup(markup, 'html.parser')
print(soup.a['class'])
print(soup.a['id'])
print(soup.a['data-links'] + "\n")
''' 
Output:
['notice', 'light']
recent-posts
1 5 20
'''


soup = BeautifulSoup(markup, 'html.parser', multi_valued_attributes=None)

print(soup.a['class'])
print(soup.a['id'])
print(soup.a['data-links'] + "\n")
'''
Output:
notice light
recent-posts
1 5 20
'''
Copy after login

无法保证您从不同网站获得的 HTML 始终完全有效。它可能存在许多不同的问题,例如重复的属性。从版本 4.9.1 开始,Beautiful Soup 允许您通过为 on_duplicate_attribute 参数设置值来指定在这种情况下应该执行的操作。不同的解析器以不同的方式处理此问题,您将需要使用内置的 html.parser 来强制执行特定行为。

from bs4 import BeautifulSoup

markup = '''
<a class="notice light" href="/recent-posts/" class="important dark">Recent Posts</a>
'''

soup = BeautifulSoup(markup, 'lxml')
print(soup.a['class'])
# ['notice', 'light']

soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore')
print(soup.a['class'])
# ['notice', 'light']

soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace')
print(soup.a['class'])
# ['important', 'dark']
Copy after login

浏览 DOM

您可以使用常规标签名称在 DOM 树中导航。链接这些标签名称可以帮助您更深入地导航树。例如,您可以使用 soup.p.a 获取给定维基百科页面第一段中的第一个链接。第一段中的所有链接都可以使用 soup.p.find_all('a') 访问。

您还可以使用 tag.contents 以列表形式访问标记的所有子级。要获取特定索引处的子项,您可以使用 tag.contents[index]。您还可以使用 .children 属性来迭代标记的子级。

仅当您想要访问标记的直接或第一级后代时,.children.contents 才有用。要获取所有后代,您可以使用 .descendants 属性。

print(soup.p.contents)
# [<b>Python</b>, ' is a widely used ',.....the full list]

print(soup.p.contents[10])
# <a href="/wiki/Readability" title="Readability">readability</a>

for child in soup.p.children:
    print(child.name)
# b
# None
# a
# None
# a
# None
# ... and so on.
Copy after login

您还可以使用 .parent 属性访问元素的父元素。同样,您可以使用 .parents 属性访问元素的所有祖先。顶级 <html> 标签的父级是 BeautifulSoup 对象本身,其父级为 None。

print(soup.p.parent.name)
# div

for parent in soup.p.parents:
    print(parent.name)
# div
# div
# div
# body
# html
# [document]
Copy after login

您可以使用 .previous_sibling.next_sibling 属性访问元素的上一个和下一个同级元素。

要使两个元素成为兄弟元素,它们应该具有相同的父元素。这意味着元素的第一个子元素不会有前一个同级元素。类似地,元素的最后一个子元素不会有下一个同级元素。在实际的网页中,元素的上一个和下一个同级元素很可能是换行符。

您还可以使用 .previous_siblings.next_siblings 迭代元素的所有同级元素。

soup.head.next_sibling
# '\n'

soup.p.a.next_sibling
# ' for '

soup.p.a.previous_sibling
# ' is a widely used '

print(soup.p.b.previous_sibling)
# None
Copy after login

您可以使用 .next_element 属性转到紧随当前元素之后的元素。要访问紧邻当前元素之前的元素,请使用 .previous_element 属性。

同样,您可以分别使用 .previous_elements.next_elements 迭代当前元素之前和之后的所有元素。

仅解析文档的一部分

假设您在查找特定内容时需要处理大量数据,并且节省一些处理时间或内存对您来说很重要。在这种情况下,您可以利用 Beautiful Soup 中的 SoupStrainer 类。此类允许您仅关注特定元素,而忽略文档的其余部分。例如,您可以通过在 SoupStrainer 构造函数中传递适当的选择器,使用它来忽略网页上除图像之外的所有其他内容。

请记住,汤过滤器不能与 html5lib 解析器一起使用。但是,您可以将其与 lxml 和内置解析器一起使用。下面是一个示例,我们解析美国的维基百科页面并获取类为 thumbimage 的所有图像。

import requests
from bs4 import BeautifulSoup, SoupStrainer

req = requests.get('https://en.wikipedia.org/wiki/United_States')

thumb_images = SoupStrainer(class_="thumbimage")

soup = BeautifulSoup(req.text, "lxml", parse_only=thumb_images)

for image in soup.find_all("img"):
    print(image['src'])
'''
Output:
//upload.wikimedia.org/wikipedia/commons/thumb/7/7b/Mesa_Verde_National_Park_-_Cliff_Palace.jpg/220px-Mesa_Verde_National_Park_-_Cliff_Palace.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Map_of_territorial_growth_1775.svg/260px-Map_of_territorial_growth_1775.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Declaration_of_Independence_%281819%29%2C_by_John_Trumbull.jpg/220px-Declaration_of_Independence_%281819%29%2C_by_John_Trumbull.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/9/94/U.S._Territorial_Acquisitions.png/310px-U.S._Territorial_Acquisitions.png
...and many more images
'''
Copy after login

您应该注意,我使用 class_ 而不是 class 来获取这些元素,因为 class 是 Python 中的保留关键字。

最终想法

完成本教程后,您现在应该能够很好地理解不同 HTML 解析器之间的主要差异。您现在还应该能够浏览网页并提取重要数据。当您想要分析给定网站上的所有标题或链接时,这会很有帮助。

在本系列的下一部分中,您将学习如何使用 Beautiful Soup 库来搜索和修改 DOM。

The above is the detailed content of Using Beautiful Soup for web scraping in Python: basic knowledge exploration. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How Do I Use Beautiful Soup to Parse HTML? How Do I Use Beautiful Soup to Parse HTML? Mar 10, 2025 pm 06:54 PM

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

Mathematical Modules in Python: Statistics Mathematical Modules in Python: Statistics Mar 09, 2025 am 11:40 AM

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

Serialization and Deserialization of Python Objects: Part 1 Serialization and Deserialization of Python Objects: Part 1 Mar 08, 2025 am 09:39 AM

Serialization and deserialization of Python objects are key aspects of any non-trivial program. If you save something to a Python file, you do object serialization and deserialization if you read the configuration file, or if you respond to an HTTP request. In a sense, serialization and deserialization are the most boring things in the world. Who cares about all these formats and protocols? You want to persist or stream some Python objects and retrieve them in full at a later time. This is a great way to see the world on a conceptual level. However, on a practical level, the serialization scheme, format or protocol you choose may determine the speed, security, freedom of maintenance status, and other aspects of the program

How to Perform Deep Learning with TensorFlow or PyTorch? How to Perform Deep Learning with TensorFlow or PyTorch? Mar 10, 2025 pm 06:52 PM

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

What are some popular Python libraries and their uses? What are some popular Python libraries and their uses? Mar 21, 2025 pm 06:46 PM

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

How to Create Command-Line Interfaces (CLIs) with Python? How to Create Command-Line Interfaces (CLIs) with Python? Mar 10, 2025 pm 06:48 PM

This article guides Python developers on building command-line interfaces (CLIs). It details using libraries like typer, click, and argparse, emphasizing input/output handling, and promoting user-friendly design patterns for improved CLI usability.

Scraping Webpages in Python With Beautiful Soup: Search and DOM Modification Scraping Webpages in Python With Beautiful Soup: Search and DOM Modification Mar 08, 2025 am 10:36 AM

This tutorial builds upon the previous introduction to Beautiful Soup, focusing on DOM manipulation beyond simple tree navigation. We'll explore efficient search methods and techniques for modifying HTML structure. One common DOM search method is ex

Explain the purpose of virtual environments in Python. Explain the purpose of virtual environments in Python. Mar 19, 2025 pm 02:27 PM

The article discusses the role of virtual environments in Python, focusing on managing project dependencies and avoiding conflicts. It details their creation, activation, and benefits in improving project management and reducing dependency issues.

See all articles