python爬虫如何批量爬取糗事百科段子
伊谢尔伦
伊谢尔伦 2017-04-18 10:20:18
0
1
804

刚学Python不会scrapy框架,就是想做个简单爬虫实现抓取前10页段子(前N页)。请问不用scrapy能有什么简单一些的代码能实现?之前有试过在page那里加for循环,但是也只能抓到一个页面,不知道怎么弄。

import urllib
import urllib2
import re

page = 1
url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
user_agent = 'Mozilla/5.0 ( Windows NT 6.1)'
headers = { 'User-Agent' : user_agent }

try:
    request = urllib2.Request(url,headers = headers)
    response = urllib2.urlopen(request)
    content = response.read().decode('utf-8')
    pattern = re.compile('<p.*?class="content">.*?<span>(.*?)</span>.*?</p>.*?',re.S)
    items = re.findall(pattern,content)
    for item in items:
        print item

except urllib2.URLError, e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason
伊谢尔伦
伊谢尔伦

小伙看你根骨奇佳,潜力无限,来学PHP伐。

reply all(1)
Peter_Zhu

I ran your code and found that it ran out of the first two pages, but returned an error code after that. I think it’s because you didn’t do anti-crawling processing, because your result came out within one second. , 10 consecutive visits within one second is definitely not something that humans can do.

Many websites can know that you are using code to brush their website. Some websites hate this and will do anti-crawling. They may directly block your IP so that you can’t access it, because if you don’t do this, Yes, if you directly access it too many times in a short period of time, your website may be paralyzed.

My suggestion is to wait 1 second after crawling a page and modify your code:

import urllib
import urllib2
import re
import time

for page in range(1, 11):
    print('at page %s' % page)
    url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
    user_agent = 'Mozilla/5.0 ( Windows NT 6.1)'
    headers = { 'User-Agent' : user_agent }

    try:
        request = urllib2.Request(url,headers = headers)
        response = urllib2.urlopen(request)
        content = response.read().decode('utf-8')
        pattern = re.compile('<p.*?class="content">.*?<span>(.*?)</span>.*?</p>.*?',re.S)
        items = re.findall(pattern,content)
        for item in items:
            print item

    except urllib2.URLError, e:
        if hasattr(e,"code"):
            print e.code
        if hasattr(e,"reason"):
            print e.reason
    
    time.sleep(1)

I can get results here, but I would like to recommend another third-party library to you, called requests. Since you know urllib, this is not difficult, but it is more user-friendly to use, and it works with the BeatuifulSoup library (used for It is very convenient to parse and process HTML text. You can also search online to find out more.

Also, when doing crawling in the future, you must pay attention to prevent anti-crawling!

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template