Start from scratch and learn about the selectors supported by lxml!
The selector is one of the very important tools in the process of web page parsing and data extraction. lxml is a powerful Python library that provides a variety of selectors that can help us locate and extract content in web pages more easily. This article will introduce some common selectors supported by lxml and provide a simple example demonstration.
lxml is a high-performance HTML and XML parser based on C language. Its speed and memory usage are better than Python's own parser. lxml supports two commonly used selector syntaxes, XPath and CSS selectors. Below we introduce their usage respectively.
XPath is a selector based on the XML path expression language, which locates nodes through path expressions. Using XPath syntax in lxml is very simple, just use the xpath() method. Here are some examples of XPath expressions:
from lxml import etree html = """ <html> <body> <div class="content"> <h1>标题</h1> <ul> <li>列表1</li> <li>列表2</li> <li>列表3</li> </ul> </div> </body> </html> """ # 创建解析器对象 parser = etree.HTMLParser() # 解析HTML tree = etree.parse(html, parser) # 使用XPath选择器 title = tree.xpath("//h1/text()")[0] print(title) # 输出:标题 # 获取所有列表项 items = tree.xpath("//li") for item in items: print(item.text) # 输出:列表1 列表2 列表3
CSS selector is a commonly used selector syntax that selects elements through styles. To use CSS selectors in lxml, you can use the cssselect library. Here are some examples of CSS selectors:
from lxml import etree from lxml.cssselect import CSSSelector html = """ <html> <body> <div class="content"> <h1>标题</h1> <ul> <li>列表1</li> <li>列表2</li> <li>列表3</li> </ul> </div> </body> </html> """ # 创建解析器对象 parser = etree.HTMLParser() # 解析HTML tree = etree.parse(html, parser) # 使用CSS选择器 selector = CSSSelector("h1") title = selector(tree)[0].text print(title) # 输出:标题 # 获取所有列表项 selector = CSSSelector("li") items = selector(tree) for item in items: print(item.text) # 输出:列表1 列表2 列表3
Through the above examples, we can see that lxml's selectors are very flexible and simple. In addition to the basic usage introduced above, lxml also supports more complex selector operations, such as selector combination, selector nesting, etc.
To summarize, lxml is a powerful HTML and XML parsing library that supports two commonly used selector syntaxes, XPath and CSS selectors. Using the selector in lxml, we can quickly and accurately locate and extract the content in the web page, which facilitates subsequent data processing and analysis. I hope this article can help readers understand the selector function of lxml and be fully applied in actual projects.
The above is the detailed content of A basic beginner's guide to lxml selectors. For more information, please follow other related articles on the PHP Chinese website!