想对html内容使用XPath选择器
步骤是:
种chrome右键得到XPath选择器
在lxml中使用
但是:
按理来说能选到, 但是返回的是空列表
python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
pip show lxml
---
Metadata-Version: 1.1
Name: lxml
Version: 3.5.0
Summary: Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API.
Home-page: http://lxml.de/
Author: lxml dev team
Author-email: lxml-dev@lxml.de
License: UNKNOWN
Location: /usr/lib/python2.7/dist-packages
Requires:
Classifiers:
Development Status :: 5 - Production/Stable
Intended Audience :: Developers
Intended Audience :: Information Technology
License :: OSI Approved :: BSD License
Programming Language :: Cython
Programming Language :: Python :: 2
Programming Language :: Python :: 2.6
Programming Language :: Python :: 2.7
Programming Language :: Python :: 3
Programming Language :: Python :: 3.2
Programming Language :: Python :: 3.3
Programming Language :: Python :: 3.4
Programming Language :: Python :: 3.5
Programming Language :: C
Operating System :: OS Independent
Topic :: Text Processing :: Markup :: HTML
Topic :: Text Processing :: Markup :: XML
Topic :: Software Development :: Libraries :: Python Modules
拷贝代码, 运行
注意代码中的url, 可以在chrome中做实验, 确实这个选择器, Firefox中XPath选择有所不同
from __future__ import absolute_import, unicode_literals
from lxml.etree import HTML
import requests
def get_text(url):
return requests.get(url).text
page = HTML(get_text('http://v2ex.com/?tab=hot'))
print page.xpath('//*[@id="Main"]/p[2]/p[10]/table/tbody/tr/td[3]/span[1]/a') #这里没有选到内容, 按理来说要选到
猜测lxml的规则有所不同? (但是使用css选择器, 则没有问题)
答案也許是這個https://www.zhihu.com/question/41221020
真是坑啊, 如果不多搜尋下, 又得浪費多少時間...
lxml的官網排版就是一坨shi