Solution to 403 error in python crawler

Y2J
Release: 2017-05-15 10:58:08
Original
4144 people have browsed it

This article mainly introduces the relevant information of python crawler to solve 403 forbidden access error. Friends who need it can refer to it

python crawler solves 403 forbidden access error

When writing a crawler in Python, html.getcode() will encounter the problem of 403 forbidden access. This is a ban on automated crawlers on the website. To solve this problem, you need to use the python module urllib2 module

The urllib2 module is an advanced crawler module. There are many methods. For example, if you connect url=http://blog.csdn.NET/qysh123, there may be a 403 access forbidden problem for this connection.

To solve this problem, the following steps are required:

<span style="font-size:18px;">req = urllib2.Request(url) 
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36") 
req.add_header("GET",url) 
req.add_header("Host","blog.csdn.net") 
req.add_header("Referer","http://blog.csdn.net/")</span>
Copy after login

User-Agent is a browser-specific attribute, which can be seen by viewing the source code through the browser

Then

html=urllib2.urlopen(req)
print html.read()
Copy after login

you can download all the web page code without the problem of 403 forbidden access.

For the above problems, it can be encapsulated into a function for easy use in the future. The specific code:

#-*-coding:utf-8-*- 
 
import urllib2 
import random 
 
url="http://blog.csdn.net/qysh123/article/details/44564943" 
 
my_headers=["Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36", 
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36", 
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0" 
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14", 
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)" 
   
] 
def get_content(url,headers): 
  &#39;&#39;&#39;&#39;&#39; 
  @获取403禁止访问的网页 
  &#39;&#39;&#39; 
  randdom_header=random.choice(headers) 
 
  req=urllib2.Request(url) 
  req.add_header("User-Agent",randdom_header) 
  req.add_header("Host","blog.csdn.net") 
  req.add_header("Referer","http://blog.csdn.net/") 
  req.add_header("GET",url) 
 
  content=urllib2.urlopen(req).read() 
  return content 
 
print get_content(url,my_headers)
Copy after login

The random function is used to automatically obtain the already written For browser-type User-Agent information, you need to write your own Host, Referer, GET information, etc. in Custom Function. After solving these problems, you can access smoothly and no more 403 access will occur. Information.

Of course, if the access frequency is too fast, some websites will still be filtered. To solve this problem, you need to use a proxy IP method. . . Solve it yourself specifically

[Related recommendations]

1. Special recommendation:"php programmer toolbox" V0.1 version download

2. Python free video tutorial

3. Python application in data science video tutorial

The above is the detailed content of Solution to 403 error in python crawler. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template