Home Backend Development Python Tutorial Crawler analysis method 2: Beautifulsoup

Crawler analysis method 2: Beautifulsoup

Jun 05, 2019 pm 01:25 PM
beautifulsoup python reptile

Many languages ​​​​can crawl, but python-based crawlers are more concise and convenient. Crawlers have also become an essential part of the python language. There are also many ways to parse crawlers.

Everyone must have mastered the usage of the Requests library, but when we use Requests to obtain the HTML code information of the web page, how can we grab the information we want? I believe you must have tried many methods, such as the find method of strings and more advanced regular expressions. Although regular expressions can match the information we need, I believe that everyone must be very frustrated when trying the regular matching rules again and again to match a certain string.

Then, we will wonder if there is a more convenient tool. The answer is yes, we also have a powerful tool called BeautifulSoup. With it, we can easily extract the content in HTML or XML tags. In this article, let us learn about the common methods of BeautifulSoup.

The previous article explained to you the crawler analysis method 1: JOSN analysis, this article brings you Beautifulsoup analysis.

Crawler analysis method 2: Beautifulsoup


What is BeautifulSoup?

Python's web page parsing can be completed using regular expressions. So when we write, we have to match the codes one by one, and we also have to write matching rules. The overall implementation is Very complicated. As for BeautifulSoup, it is a convenient web page parsing library with efficient processing and supports multiple parsers. In most cases, we can use it to easily extract web page information without writing regular expressions.

Official Document

Installation: $ pip install beautifulsoup4

BeautifulSoup is a web page parsing library that supports many parsers, but there are two most mainstream ones. One is the Python standard library and the other is the lxml HTML parser. The usage of the two is similar:

from bs4 import BeautifulSoup
 
# Python的标准库
BeautifulSoup(html, 'html.parser')
 
# lxml
BeautifulSoup(html, 'lxml')
Copy after login

The execution speed of Python’s built-in standard library is average, but in lower versions of Python, the fault tolerance of Chinese is relatively poor. The execution speed of the lxmlHTML parser is fast, but it requires the installation of C language dependent libraries.

Installation of lxml

Since lxml installation depends on the C language library, when lxml is installed on Windows, we will find various strange errors. Of course, the face It is good to use pip install lxml

to install successfully. But most people will fall here.

It is recommended that you use lxml's .whl file to install. First we need to install the wheel library. Only with this library can we install the .whl file normally. pip install wheel

Download the lxml file matching the system and Python version from the official website.

In addition, friends who don’t know their own system and python version information. You need to enter the system administrator tool (CMD) or python's IDLE and enter the following code:

import pip
 
print(pip.pep425tags.get_supported())
Copy after login

At this time we can see the printed Python version information.
After downloading the lxml file, we need to find the location of the file, then enter the administrator tool and use pip to install: pip install The full name of the whl file

After the installation is completed, you can enter Python and import it , if no error is reported, congratulations on successful installation.
If some friends find it troublesome, then I recommend that you install anaconda download address (if the installation speed is slow, you can find domestic mirrors). Friends who don’t know what it is can Google it. With it, those who use pip on Windows Problems with installation errors will no longer exist.


BeautifulSoup’s basic tag selection method

Although Python’s built-in standard library parser is not bad, I still recommend it to everyone. lxml because it's fast enough. Then we use the lxml parser to demonstrate the following code.
Let’s first import the example of the official document:

html_doc = """
<html><head><title>The Dormouse&#39;s story</title></head>
<body>
<p class="title"><b>The Dormouse&#39;s story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""
Copy after login

HTML code, we can get a BeautifulSoup object and output it according to the standard indented format structure:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, &#39;lxml&#39;)
Copy after login

We can see that the above HTML code is not complete. Next, we use the prettify() method to perform automatic completion. The comment part is the output of the operation:

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse&#39;s story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse&#39;s story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>
Copy after login

Get tag

print(soup.title)
# <title>The Dormouse&#39;s story</title>
Copy after login

Through the output result, we can see the attribute of the obtained content, which is actually a title tag in the HTML code .

Get the name

print(soup.title.name)
# &#39;title&#39;
Copy after login

is actually the name of the label.

Get attributes

print(soup.p.attrs[&#39;class&#39;])
# &#39;title&#39;
 
print(soup.p[&#39;class&#39;])
# &#39;title&#39;
Copy after login

To get the attributes of a label, we can use the attrs method and pass it the attribute name to get the attributes of the label. From the results, we can see that if we directly pass the p tag attribute name, we can also get the tag attribute.

Get content

print(soup.title.string)
# &#39;The Dormouse&#39;s story&#39;
Copy after login

我们还可以使用嵌套的选择,比如我们获得body标签里面p标签的内容:

print(soup.body.p.string)
# &#39;The Dormouse&#39;s story&#39;
Copy after login

常见用法

标准选择器

虽然BeautifulSoup的基本用法,标签获取,内容获取,可以解析一些 html代码。但是在遇到很多复杂的页面时,上面的方法是完全不足的,或者是很繁琐的,因为有时候有的标签会有几个属性(class、id等)。

索性BeautifulSoup给我们提供了很方便的标准选择器,也就是 API 方法,这里着重介绍2个: find() 和 find_all() 。其它方法的参数和用法类似,大家举一反三吧。

find_all()

find_all(name, attrs, recursive, text, **kwargs)可以根据标签,属性,内容查找文档。
find_all()其实和正则表达式的原理很相似,他能找出所有能满足匹配模式的结果,在把结果以列表的形式返回。
仍然是文档的例子:

html_doc = """
<html><head><title>The Dormouse&#39;s story</title></head>
<body>
<p class="title"><b>The Dormouse&#39;s story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
 
soup = BeautifulSoup(html_doc, 'lxml')
Copy after login

过滤器

文档参考
介绍 find_all() 方法前,大家可以参考一下过滤器的类型。过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切。这些过滤器贯穿整个搜索的API。过滤器可以被用在 tag 的name中,节点的属性中,字符串中或他们的混合中。

find_all() 方法搜索当前 tag 的所有 tag 子节点,并判断是否符合过滤器的条件。这里有几个例子:

soup.find_all("title")
# [<title>The Dormouse&#39;s story</title>]
 
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse&#39;s story</b></p>]
 
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Copy after login

有几个方法很相似,还有几个方法是新的,参数中的 string 和id是什么含义? 为什么 find_all("p", "title") 返回的是CSS Class为”title”的标签? 我们来仔细看一下find_all()的参数:

name参数

name 参数可以查找所有名字为 name 的 tag,字符串对象会被自动忽略掉。

soup.find_all("title")
# [The Dormouse&#39;s story]
Copy after login

搜索 name 参数的值可以使任一类型的过滤器,字符窜,正则表达式,列表,方法或是True 。
我们常用的 name 参数是搜索文档的标签名。

keyword参数

如果我们的 HTML代码中有几个div标签,但是我们只想获取到class属性为top的div标签,我们怎么出来呢。

soup.find_all(&#39;div&#39;, class_=&#39;top&#39;)
Copy after login

# 这里注意下,class是Python的内部关键词,我们需要在css属性class后面加一个下划线'_',不然会报错。

仍然以上面的代码实例:

soup.find_all(&#39;a&#39;, id=&#39;link2&#39;)
# [<a id="link2" href="http://example.com/lacie">Lacie</a>]
Copy after login

这样我们就只获取到id为link2的a标签。

limit参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。效果与 SQL 中的limit关键字类似,当搜索到的结果数量达到limit的限制时,就停止搜索返回结果。

比如我们要搜索出a标签,但是满足的有3个,我们只想要得到2个:

soup.find_all("a", limit=2)
# [<a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>,
# <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a>]
Copy after login

其他的参数,不是经常用到,大家如需了解可以参考官方文档。

find()

find_all()返回的是所有元素列表,find()返回单个元素。

find( name , attrs , recursive , string , **kwargs )
Copy after login

find_all()方法将返回文档中符合条件的所有 tag,尽管有时候我们只想得到一个结果。比如文档中只有一个标签,那么使用find_all()方法来查找标签就不太合适, 使用find_all方法并设置limit=1参数不如直接使用find()方法。下面两行代码是等价的:

soup.find_all(&#39;title&#39;, limit=1)
# [The Dormouse&#39;s story]
 
soup.find(&#39;title&#39;)
#The Dormouse&#39;s story
Copy after login

唯一的区别是find_all()方法的返回结果是值包含一个元素的列表,而find()方法直接返回结果。find_all()方法没有找到目标是返回空列表, find()方法找不到目标时,返回None。

CSS选择器

Beautiful Soup支持大部分的 CSS选择器。在Tag或BeautifulSoup对象的.select()方法中传入字符串参数, 即可使用 CSS选择器的语法找到 tag。我们在写 css 时,标签 class类名加”.“,id属性加”#“。

soup.select("title")
# [The Dormouse&#39;s story]
Copy after login

通过 tag标签逐层查找:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.select("html head title")
# [<title>The Dormouse&#39;s story</title>]
Copy after login

找到某个 tag标签下的直接子标签:

soup.select("head > title")
# [<title>The Dormouse&#39;s story</title>]
 
soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
 
soup.select("body > a")
# []
Copy after login

通过 CSS 的 class类名查找:

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Copy after login

通过 tag 的 id 查找:

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
 
soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Copy after login

同时用多种 CSS选择器查询元素,使用逗号隔开:

soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Copy after login

提取标签内容

如果我们得到了几个标签:

list = [<a href="http://www.baidu.com/">百度</a>,
 
<a href="http://www.163.com/">网易</a>,
 
<a href="http://www.sina.com/"新浪</a>]
Copy after login

我们要怎样提取他里面的内容呢。我们开始的时候有提及。

for i in list:
    print(i.get_text()) # 我们使用get_text()方法获得标签内容
    print(i.get[&#39;href&#39;] # get[&#39;attrs&#39;]方法获得标签属性
    print(i[&#39;href&#39;]) # 简写结果一样
Copy after login

结果:

百度
网易
新浪
http://www.baidu.com/
http://www.163.com/
http://www.sina.com/
http://www.baidu.com/
http://www.163.com/
http://www.sina.com/
Copy after login

   

总结

BeautifulSoup's parsing library, it is recommended to use lxml. If garbled characters appear, you can use html.parser; BeautifulSoup's tag selection and filtering method is weak but fast; it is recommended to use find_all(), find() methods to search tags , of course, if you are familiar with CSS selectors, it is recommended to use the .select() method; the get_text() method to obtain the label text content, and the get[attrs] method to obtain the label attribute value.

The above is the detailed content of Crawler analysis method 2: Beautifulsoup. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Code Examples and Comparison PHP and Python: Code Examples and Comparison Apr 15, 2025 am 12:07 AM

PHP and Python have their own advantages and disadvantages, and the choice depends on project needs and personal preferences. 1.PHP is suitable for rapid development and maintenance of large-scale web applications. 2. Python dominates the field of data science and machine learning.

Python vs. JavaScript: Community, Libraries, and Resources Python vs. JavaScript: Community, Libraries, and Resources Apr 15, 2025 am 12:16 AM

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

Detailed explanation of docker principle Detailed explanation of docker principle Apr 14, 2025 pm 11:57 PM

Docker uses Linux kernel features to provide an efficient and isolated application running environment. Its working principle is as follows: 1. The mirror is used as a read-only template, which contains everything you need to run the application; 2. The Union File System (UnionFS) stacks multiple file systems, only storing the differences, saving space and speeding up; 3. The daemon manages the mirrors and containers, and the client uses them for interaction; 4. Namespaces and cgroups implement container isolation and resource limitations; 5. Multiple network modes support container interconnection. Only by understanding these core concepts can you better utilize Docker.

How to run programs in terminal vscode How to run programs in terminal vscode Apr 15, 2025 pm 06:42 PM

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

Python: Automation, Scripting, and Task Management Python: Automation, Scripting, and Task Management Apr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

What is vscode What is vscode for? What is vscode What is vscode for? Apr 15, 2025 pm 06:45 PM

VS Code is the full name Visual Studio Code, which is a free and open source cross-platform code editor and development environment developed by Microsoft. It supports a wide range of programming languages ​​and provides syntax highlighting, code automatic completion, code snippets and smart prompts to improve development efficiency. Through a rich extension ecosystem, users can add extensions to specific needs and languages, such as debuggers, code formatting tools, and Git integrations. VS Code also includes an intuitive debugger that helps quickly find and resolve bugs in your code.

Can vs code run in Windows 8 Can vs code run in Windows 8 Apr 15, 2025 pm 07:24 PM

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

Can visual studio code be used in python Can visual studio code be used in python Apr 15, 2025 pm 08:18 PM

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

See all articles