python - How to check what kind of anti-crawler is used on a URL
ringa_lee
ringa_lee 2017-06-12 09:27:51
0
4
1322

Website: https://www.nvshens.com/g/22377/. Open the website directly in the browser and right-click on the image to download it. Then the image directly requested by my crawler has been blocked. Then I changed the headers and set up the IP proxy, but it still didn't work. But looking at the packet capture, it’s not dynamically loaded data! ! ! Please answer = =

ringa_lee
ringa_lee

ringa_lee

reply all(4)
过去多啦不再A梦

The girl is quite pretty.
It can indeed be opened by right-clicking, but after refreshing, it becomes a hotlinked picture. Generally, to prevent hotlinking, the server will check the Referer field in the request header, which is why it is not the original image after refreshing (the Referer has changed after refreshing).

img_url = "https://t1.onvshen.com:85/gallery/21501/22377/s/003.jpg"
r = requests.get(img_url, headers={'Referer':"https://www.nvshens.com/g/22377/"}).content
with open("00.jpg",'wb') as f:
    f.write(r)
学霸

Did you miss any parameters by capturing the packet when getting the picture?

我想大声告诉你

I was just looking at the content of the website and almost forgot about it being official.
You can follow all the information you requested

Then try it

女神的闺蜜爱上我

Referer According to the design of this website, each page should be more in line with the behavior of pretending to be a human being, instead of using a single Referer
The following is the complete code that can be run to capture all the pictures on page 18

# Putting all together
def url_guess_src_large (u):
    return ("https://www.nvshens.com/img.html?img=" +  '/'.join(u.split('/s/')))
# 下载函数
def get_img_using_requests(url, fn ):
    import shutil
    headers ['Referer'] = url_guess_src_large(url) #"https://www.nvshens.com/g/22377/" 
    print (headers)
    response = requests.get(url, headers = headers, stream=True)
    with open(fn, 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)
    del response

import requests
# 用xpath擷取內容
from lxml import etree
url_ = 'https://www.nvshens.com/g/22377/{p}.html'  
headers = {
    "Connection" : "close",  # one way to cover tracks
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2900.1 Iron Safari/537.36}"
}

for i in range(1,18+1):
    url = url_.format(p=i)
    r = requests.get(url, headers=headers)
    html = requests.get(url,headers=headers).content.decode('utf-8')
    selector = etree.HTML(html)
    xpaths = '//*[@id="hgallery"]/img/@src'
    content = [x for x in selector.xpath(item)]
    urls_2get = [url_guess_src_large(x) for x in content]
    filenames = [os.path.split(x)[0].split('/gallery/')[1].replace("/","_") + "_" + os.path.split(x)[1] for x in urls_2get]
    for i, x in enumerate(content):
        get_img_using_requests (content[i], filenames[i])
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template