Home Backend Development Python Tutorial Detailed explanation of the basic use of xpath in python crawler

Detailed explanation of the basic use of xpath in python crawler

Apr 27, 2018 am 11:01 AM
python xpath use

This article mainly introduces the basic use of xpath in python crawler. Now I will share it with you and give you a reference. Let’s take a look together

1. Introduction

XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents. XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on XPath expressions.

2. Installation

##

pip3 install lxml
Copy after login

##3 , use 1, import

from lxml import etree
Copy after login

2, basically use

from lxml import etree
wb_data = """
    <p>
      <ul>
         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
       </ul>
     </p>

    """
html = etree.HTML(wb_data)
print(html)
result = etree.tostring(html)
print(result.decode("utf-8"))
Copy after login

Judging from the results below, our printer html is actually a python object, and etree.tostring(html) is the basic writing method of incomplete html, which completes the label that is missing arms and legs.


 <Element html at 0x39e58f0>
<html><body><p>
      <ul>
         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>

       </li></ul>
     </p>
    </body></html>
Copy after login

3. Get the content of a certain tag (basic use). Note that to get all the content of the a tag, there is no need to add a after it. Forward slash, otherwise an error will be reported.

Writing method one


html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a&#39;)

print(html)

for i in html_data:

  print(i.text)

<Element html at 0x12fe4b8>

first item

second item

third item

fourth item

fifth item
Copy after login

Writing method two (just add a /text() directly after the tag where you need to find the content)

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a/text()&#39;)

print(html)

for i in html_data:

  print(i) 

<Element html at 0x138e4b8>

first item

second item

third item

fourth item

fifth item
Copy after login

4. Open and read the html file


#使用parse打开html的文件

html = etree.parse(&#39;test.html&#39;)

html_data = html.xpath(&#39;//*&#39;)<br>#打印是一个列表,需要遍历

print(html_data)

for i in html_data:

  print(i.text)
Copy after login

##

html = etree.parse(&#39;test.html&#39;)

html_data = etree.tostring(html,pretty_print=True)

res = html_data.decode(&#39;utf-8&#39;)

print(res)

 

打印:

<p>

   <ul>

     <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

     <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

     <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

     <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

     <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a></li>

   </ul>

</p>
Copy after login

5. Print the attributes of a tag under the specified path (you can get the value of an attribute by traversing and find the content of the tag)


html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a/@href&#39;)

for i in html_data:

  print(i)
Copy after login

Print:

link1.html

link2.html

link3.html

link4.html

link5.html

6. We know that we use xpath to get ElementTree objects one by one, so if we need to find the content, we need to traverse to get the data. list.

It is found that the a tag attribute under the absolute path is equal to link2.html.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]/text()&#39;)

print(html_data)

for i in html_data:

  print(i)
Copy after login

Print:

['second item']

second item

7. Above we find all absolute paths (each one is searched from the root), below we find relative paths, for example, find the a tag content under all li tags.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)
Copy after login

Print:

['first item', 'second item', 'third item', 'fourth item' , 'fifth item']

first item

second item

third item

fourth item

fifth item

8. Above we used the absolute path to find all the attributes of the a tag that are equal to the href attribute value. We used /---absolute path. Next we use the relative path to find the li tag under the l relative path. The value of the href attribute under the a tag. Note that double // is required after the a tag.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a//@href&#39;)

print(html_data)

for i in html_data:

  print(i)
Copy after login

Print:

['link1.html', 'link2.html', 'link3.html', ' link4.html', 'link5.html']

link1.html

link2.html

link3.html

link4.html

link5.html

9. The methods of checking specific attributes under relative paths are similar to those under absolute paths, or they can be said to be the same.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]&#39;)

print(html_data)

for i in html_data:

  print(i.text)
Copy after login

Print:

[]

second item

10. Find the href attribute of the a tag in the last li tag

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li[last()]/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)
Copy after login

Print:

['fifth item']

fifth item

11. Find the href attribute of the a tag in the penultimate li tag

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li[last()-1]/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)
Copy after login

Print:

['fourth item']

fourth item

12. If you are extracting a page If the xpath path of a certain tag is as follows:

//*[@id="kw"]
Copy after login

Explanation: Use relative paths to find all tags whose attribute id is equal to kw.

Commonly used


#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
  <head lang="en">
    <meta charset="UTF-8">
    <title></title>
  </head>
  <body>
    <ul>
      <li class="item-"><a id=&#39;i1&#39; href="link.html" rel="external nofollow" rel="external nofollow" >first item</a></li>
      <li class="item-0"><a id=&#39;i2&#39; href="llink.html" rel="external nofollow" >first item</a></li>
      <li class="item-1"><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item<span>vv</span></a></li>
    </ul>
    <p><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item</a></p>
  </body>
</html>
"""
response = HtmlResponse(url=&#39;http://example.com&#39;, body=html,encoding=&#39;utf-8&#39;)
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[2]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@id]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@id="i1"]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@href="link.html" rel="external nofollow" rel="external nofollow" ][@id="i1"]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[contains(@href, "link")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[starts-with(@href, "link")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]/text()&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]/@href&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;/html/body/ul/li/a/@href&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//body/ul/li/a/@href&#39;).extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath(&#39;//body/ul/li&#39;)
# for item in ul_list:
#   v = item.xpath(&#39;./a/span&#39;)
#   # 或
#   # v = item.xpath(&#39;a/span&#39;)
#   # 或
#   # v = item.xpath(&#39;*/a/span&#39;)
#   print(v)
Copy after login

Related recommendations:


Summary of two methods for python crawlers to use real browsers to open web pages


##

The above is the detailed content of Detailed explanation of the basic use of xpath in python crawler. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Do mysql need to pay Do mysql need to pay Apr 08, 2025 pm 05:36 PM

MySQL has a free community version and a paid enterprise version. The community version can be used and modified for free, but the support is limited and is suitable for applications with low stability requirements and strong technical capabilities. The Enterprise Edition provides comprehensive commercial support for applications that require a stable, reliable, high-performance database and willing to pay for support. Factors considered when choosing a version include application criticality, budgeting, and technical skills. There is no perfect option, only the most suitable option, and you need to choose carefully according to the specific situation.

How to use mysql after installation How to use mysql after installation Apr 08, 2025 am 11:48 AM

The article introduces the operation of MySQL database. First, you need to install a MySQL client, such as MySQLWorkbench or command line client. 1. Use the mysql-uroot-p command to connect to the server and log in with the root account password; 2. Use CREATEDATABASE to create a database, and USE select a database; 3. Use CREATETABLE to create a table, define fields and data types; 4. Use INSERTINTO to insert data, query data, update data by UPDATE, and delete data by DELETE. Only by mastering these steps, learning to deal with common problems and optimizing database performance can you use MySQL efficiently.

How to optimize MySQL performance for high-load applications? How to optimize MySQL performance for high-load applications? Apr 08, 2025 pm 06:03 PM

MySQL database performance optimization guide In resource-intensive applications, MySQL database plays a crucial role and is responsible for managing massive transactions. However, as the scale of application expands, database performance bottlenecks often become a constraint. This article will explore a series of effective MySQL performance optimization strategies to ensure that your application remains efficient and responsive under high loads. We will combine actual cases to explain in-depth key technologies such as indexing, query optimization, database design and caching. 1. Database architecture design and optimized database architecture is the cornerstone of MySQL performance optimization. Here are some core principles: Selecting the right data type and selecting the smallest data type that meets the needs can not only save storage space, but also improve data processing speed.

HadiDB: A lightweight, horizontally scalable database in Python HadiDB: A lightweight, horizontally scalable database in Python Apr 08, 2025 pm 06:12 PM

HadiDB: A lightweight, high-level scalable Python database HadiDB (hadidb) is a lightweight database written in Python, with a high level of scalability. Install HadiDB using pip installation: pipinstallhadidb User Management Create user: createuser() method to create a new user. The authentication() method authenticates the user's identity. fromhadidb.operationimportuseruser_obj=user("admin","admin")user_obj.

Navicat's method to view MongoDB database password Navicat's method to view MongoDB database password Apr 08, 2025 pm 09:39 PM

It is impossible to view MongoDB password directly through Navicat because it is stored as hash values. How to retrieve lost passwords: 1. Reset passwords; 2. Check configuration files (may contain hash values); 3. Check codes (may hardcode passwords).

Does mysql need the internet Does mysql need the internet Apr 08, 2025 pm 02:18 PM

MySQL can run without network connections for basic data storage and management. However, network connection is required for interaction with other systems, remote access, or using advanced features such as replication and clustering. Additionally, security measures (such as firewalls), performance optimization (choose the right network connection), and data backup are critical to connecting to the Internet.

Can mysql workbench connect to mariadb Can mysql workbench connect to mariadb Apr 08, 2025 pm 02:33 PM

MySQL Workbench can connect to MariaDB, provided that the configuration is correct. First select "MariaDB" as the connector type. In the connection configuration, set HOST, PORT, USER, PASSWORD, and DATABASE correctly. When testing the connection, check that the MariaDB service is started, whether the username and password are correct, whether the port number is correct, whether the firewall allows connections, and whether the database exists. In advanced usage, use connection pooling technology to optimize performance. Common errors include insufficient permissions, network connection problems, etc. When debugging errors, carefully analyze error information and use debugging tools. Optimizing network configuration can improve performance

Does mysql need a server Does mysql need a server Apr 08, 2025 pm 02:12 PM

For production environments, a server is usually required to run MySQL, for reasons including performance, reliability, security, and scalability. Servers usually have more powerful hardware, redundant configurations and stricter security measures. For small, low-load applications, MySQL can be run on local machines, but resource consumption, security risks and maintenance costs need to be carefully considered. For greater reliability and security, MySQL should be deployed on cloud or other servers. Choosing the appropriate server configuration requires evaluation based on application load and data volume.

See all articles