python - 我写的Xpath 为什么爬取不到内容
阿神
阿神 2017-04-18 10:30:15
0
1
806

-- coding:utf-8 --

import lxml,requests,sys
from bs4 import BeautifulSoup
from lxml import etree

reload(sys)
sys.setdefaultencoding("utf-8")

def main():

url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%9A%B4%E8%B5%B0%E6%BC%AB%E7%94%BB&pn=0'

req = requests.get(url).content

# soup = BeautifulSoup(req.content,'lxml')
# imgs = soup.find_all('img')

content = etree.HTML(req)
paths = content.xpath('//*[@id="imgid"]/ul/li[1]/a/img/text()')
# for img in imgs:
#
#     print img

# for img in imgs :

print paths

main()

阿神
阿神

闭关修行中......

reply all(1)
Peter_Zhu

When writing a crawler, you must use xpath to confirm whether there is data in the source code of the web page. If not, it means it is loaded asynchronously

1. Enter this link in the browser to view the source code, ctrl+f to find the location of imgid

view-source:https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%9A%B4%E8%B5%B0%E6%BC%AB%E7%94%BB&pn=0

2. Discover

The following picture list was not found. We can determine that the pictures are loaded by js

3. Find

Looking at the network in F12 (you can only see it after refreshing), I did not find the image information loaded by the asynchronous request, so I guessed that the data should be in the html, but it was placed in js and processed when loading the image

The same way to view the source code as above, search for the parameter objURL and find the real url

//很多,集中在html下半部分
http://img3.duitang.com/uploads/item/201608/06/20160806110540_MAcru.jpeg

Solution

The rest is up to you~ Find a way to parse the real url below!

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template