Detailed description of 403 access forbidden error in python crawler-Python Tutorial-php.cn

Detailed description of 403 access forbidden error in python crawler

高洛峰

Release： 2017-03-13 09:44:22

Original

1918 people have browsed it

This article mainly introduces relevant information about the detailed explanation of 403 access forbidden error in python crawler. Friends in need can refer to it

Python crawler solves 403 access forbidden error

When writing a crawler in Python, html.getcode() will encounter the problem of 403 forbidden access. This is a ban on automated crawlers by the website. To solve this problem, you need to use the python module urllib2 module

urllib2 module is an advanced crawler crawling module. There are many methods. For example, if you connect url=http://blog.csdn.NET/qysh123, there may be a 403 access forbidden problem for this connection.

To solve this problem, the following steps are required:

<span style="font-size:18px;">req = urllib2.Request(url) 
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36") 
req.add_header("GET",url) 
req.add_header("Host","blog.csdn.net") 
req.add_header("Referer","http://blog.csdn.net/")</span>

Copy after login

Among them, User-Agent is a browser-specific attribute, By viewing the source code through the browser, you can see

and then

html=urllib2.urlopen(req)


print html.read()

Copy after login

to download all the web page code without the problem of 403 forbidden access. .

For the above problems, it can be encapsulated into a function for future call convenience. The specific code is:

#-*-coding:utf-8-*- 
 
import urllib2 
import random 
 
url="http://blog.csdn.net/qysh123/article/details/44564943" 
 
my_headers=["Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36", 
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36", 
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0" 
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14", 
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)" 
   
] 
def get_content(url,headers): 
  &#39;&#39;&#39;&#39;&#39; 
  @获取403禁止访问的网页 
  &#39;&#39;&#39; 
  randdom_header=random.choice(headers) 
 
  req=urllib2.Request(url) 
  req.add_header("User-Agent",randdom_header) 
  req.add_header("Host","blog.csdn.net") 
  req.add_header("Referer","http://blog.csdn.net/") 
  req.add_header("GET",url) 
 
  content=urllib2.urlopen(req).read() 
  return content 
 
print get_content(url,my_headers)

Copy after login

where The random function is used to automatically obtain the User-Agent information of the browser type that has been written. In

Custom function, you need to write your own Host, Referer, GET information, etc. to solve these problems. , you can access smoothly, and the 403 access information will no longer appear.

Of course, if the access frequency is too fast, some websites will still be filtered. To solve this problem, you need to use a proxy IP method. . . You can solve it specifically by yourself

Thank you for reading, I hope it can help everyone, thank you everyone for your support of this site!

The above is the detailed content of Detailed description of 403 access forbidden error in python crawler. For more information, please follow other related articles on the PHP Chinese website!