This article introduces you to the combined use of lxml-etree and xpath in Python crawlers (with cases). The content is very detailed. I hope it can help you.
lxml: Python's HTML/XML parser
Official website documentation: https://lxml.de/
Before use, you need to install the lxml package
Function:
1. Parse HTML: use etree.HTML(text) to parse html fragments in string format into html documents
2.Read xml files
3.etree and XPath are used together
Installation of lxml
[PyCharm]>[file]>[settings]>[Project Interpreter]>[ ] >[lxml]>[install]
Detailed operation screenshots:
##lxml-etree usage# 先安装lxml # 用 lxml 来解析HTML代码 from lxml import etree text = '''<p> <ul> <li class="item-0"><a href="0.html">item 0 </a></li> <li class="item-1"><a href="1.html">item 1 </a></li> <li class="item-2"><a href="2.html">item 2 </a></li> <li class="item-3"><a href="3.html">item 3 </a></li> <li class="item-4"><a href="4.html">item 4 </a></li> <li class="item-5"><a href="5.html">item 5 </a></li> </ul> </p>''' # 利用 etree.HTML 把字符串解析成 HTML 文件 html = etree.HTML(text) s = etree.tostring(html).decode() print(s)
##lxml-etree Use
# lxml-etree读取文件from lxml import etree xml = etree.parse("./py24.xml") sxml = etree.tostring(xml, pretty_print=True) print(sxml)
Use etree and XPath together
# lxml-etree读取文件from lxml import etree xml = etree.parse("./py24.xml") print(type(xml))# 查找所有 book 节点rst = xml.xpath('//book') print(type(rst)) print(rst)# 查找带有 category 属性值为 sport 的元素rst2 = xml.xpath('//book[@category="sport"]') print(type(rst2)) print(rst2)# 查找带有category属性值为sport的元素的book元素下到的year元素rst3 = xml.xpath('//book[@category="sport"]/year') rst3 = rst3[0] print('-------------\n',type(rst3)) print(rst3.tag) print(rst3.text)
Related recommendations:
The basics of xpath for python crawlers Detailed explanation of usage#What is a python crawler? Why is python called a crawler?The above is the detailed content of The combined use of lxml-etree and xpath in Python crawler (with case). For more information, please follow other related articles on the PHP Chinese website!