How to parse HTML pages with Python crawler-Python Tutorial-php.cn

Parsing HTML pages with Python

We usually need to parse web crawled pages to get the data we need. By analyzing the combined structure of HTML tags, we can extract useful information contained in web pages. In Python, there are three common ways to parse HTML: regular expression parsing, XPath parsing, and CSS selector parsing.

The structure of HTML page

Understanding the basic structure of HTML page is a prerequisite before explaining the HTML parsing method. When we open a website in a browser and select the "Show web page source code" menu item through the right-click menu of the mouse, we can see the HTML code corresponding to the web page. HTML code usually consists of tags, attributes, and text. The label carries the content displayed on the page, the attributes supplement the label information, and the text is the content displayed by the label. The following is a simple HTML page code structure example:

<!DOCTYPE html>
<html>
    <head>
        <!-- head 标签中的内容不会在浏览器窗口中显示 -->
        <title>这是页面标题</title>
    </head>
    <body>
        <!-- body 标签中的内容会在浏览器窗口中显示 -->
        <h2>这是一级标题</h2>
        <p>这是一段文本</p>
    </body>
</html>

Copy after login

In this HTML page code example, <!DOCTYPE html> is the document type declaration, <html> The tag is the root tag of the entire page, <head> and <body> are sub-tags of the <html> tag, placed in The content under the <body> tag will be displayed in the browser window. This part of the content is the main body of the web page; the content under the <head> tag will not be displayed in the browser window. It is displayed in the browser window, but it contains important meta-information of the page, usually called the header of the web page. The general code structure of an HTML page is as follows:

<!DOCTYPE html>
<html>
    <head>
        <!-- 页面的元信息，如字符编码、标题、关键字、媒体查询等 -->
    </head>
    <body>
        <!-- 页面的主体，显示在浏览器窗口中的内容 -->
    </body>
</html>

Copy after login

tags, cascading style sheets (CSS) and JavaScript are the three basic components that make up an HTML page. Tags are used to carry the content to be displayed on the page, CSS is responsible for rendering the page, and JavaScript is used to control the interactive behavior of the page. To parse HTML pages, you can use XPath syntax, which is originally a query syntax for XML. It can extract content or tag attributes in tags based on the hierarchical structure of HTML tags. In addition, you can also use CSS selectors to locate pages. Elements are the same as rendering page elements using CSS.

XPath parsing

XPath is a syntax for finding information in XML (eXtensible Markup Language) documents. XML is similar to HTML and is a tag language that uses tags to carry data. The difference The reason is that XML tags are extensible and customizable, and XML has stricter syntax requirements. XPath uses path expressions to select nodes or node sets in XML documents. The nodes mentioned here include elements, attributes, text, namespaces, processing instructions, comments, root nodes, etc.

XPath path expression is similar to file path syntax, you can use "/" and "//" to select nodes. When selecting the root node, you can use a single slash "/"; when selecting a node at any position, you can use a double slash "//". For example, "/bookstore/book" means selecting all book sub-nodes under the root node bookstore, and "//title" means selecting the title node at any position.

XPath can also use predicates to filter nodes. Nested expressions within square brackets can be numbers, comparison operators, or function calls that serve as predicates. For example, "/bookstore/book[1]" means selecting the first child node book of bookstore, and "//book[@lang]" means selecting all book nodes with the lang attribute.

XPath functions include string, mathematical, logical, node, sequence and other functions. These functions can be used to select nodes, calculate values, convert data types and other operations. For example, the "string-length(string)" function can return the length of the string, and the "count(node-set)" function can return the number of nodes in the node set.

Below we use an example to illustrate how to use XPath to parse the page. Suppose we have the following XML file:

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book>
      <title lang="eng">Harry Potter</title>
      <price>29.99</price>
    </book>
    <book>
      <title lang="zh">Learning XML</title>
      <price>39.95</price>
    </book>
</bookstore>

Copy after login

For this XML file, we can use the XPath syntax as shown below to get the nodes in the document.

##/bookstoreSelect the root element bookstore. Note: If the path starts with a forward slash ( / ), this path always represents an absolute path to an element! //bookSelects all book child elements regardless of their position in the document. //@langSelect all attributes named lang. /bookstore/book[1]Select the first child node book of bookstore.

CSS 选择器解析

通过HTML标签的属性和关系来定位元素的方式被称为CSS选择器。根据 HTML 标签的层级结构、类名、id 等属性能够确定元素的位置。在 Python 中，我们可以使用 BeautifulSoup 库来进行 CSS 选择器解析。

我们接下来会举一个例子，讲解如何运用 CSS 选择器来分析页面。假设我们有如下的 HTML 代码：

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2>这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>

Copy after login

我们可以使用如下所示的 CSS 选择器语法来选取页面元素。

Path expression	Result

选择器	结果
div.content	选取 class 为 content 的 div 元素。
h2	选取所有的 h2 元素。
div.footer p	选取 class 为 footer 的 div 元素下的所有 p 元素。
[href]	选取所有具有 href 属性的元素。

正则表达式解析

用正则表达式可以解析 HTML 页面，从而实现文本的匹配、查找和替换。使用 re 模块可以进行 Python 的正则表达式解析。

下面我们通过一个例子来说明如何使用正则表达式对页面进行解析。假设我们有如下的 HTML 代码：

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2>这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>

Copy after login

我们可以使用如下所示的正则表达式来选取页面元素。

import re
html = '''
<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2>这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>
'''
pattern = re.compile(r'.*?(.*?)
.*?(.*?)
.*?', re.S)
match = re.search(pattern, html)
if match:
    title = match.group(1)
    text = match.group(2)
    print(title)
    print(text)

Copy after login

以上代码中，我们使用 re 模块的 compile 方法来编译正则表达式，然后使用 search 方法来匹配 HTML 代码。在正则表达式中，“.*?”表示非贪婪匹配，也就是匹配到第一个符合条件的标签就停止匹配，而“re.S”表示让“.”可以匹配包括换行符在内的任意字符。最后，我们使用 group 方法来获取匹配的结果。

The above is the detailed content of How to parse HTML pages with Python crawler. For more information, please follow other related articles on the PHP Chinese website!