import lxml,requests,sys
from bs4 import BeautifulSoup
from lxml import etree
reload(sys)
sys.setdefaultencoding("utf-8")
def main():
url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%9A%B4%E8%B5%B0%E6%BC%AB%E7%94%BB&pn=0'
req = requests.get(url).content
# soup = BeautifulSoup(req.content,'lxml')
# imgs = soup.find_all('img')
content = etree.HTML(req)
paths = content.xpath('//*[@id="imgid"]/ul/li[1]/a/img/text()')
# for img in imgs:
#
# print img
# for img in imgs :
print paths
main()
1. Enter this link in the browser to view the source code, ctrl+f to find the location of imgid
2. Discover
The following picture list was not found. We can determine that the pictures are loaded by js
3. Find
Looking at the network in F12 (you can only see it after refreshing), I did not find the image information loaded by the asynchronous request, so I guessed that the data should be in the html, but it was placed in js and processed when loading the image
The same way to view the source code as above, search for the parameter objURL and find the real url
Solution
The rest is up to you~ Find a way to parse the real url below!