Douban movie image crawling example-Python Tutorial-php.cn

Douban movie image crawling example

PHP中文网

Release： 2017-06-20 15:26:40

Original

2105 people have browsed it

1. Get the effect first

2. Install Scrapy and use

Official website:.

Installation command: pip install Scrapy

## Installation completed, use the default Create a new project from the template, command: scrapy startproject xx

The above picture vividly illustrates the operating mechanism of scrapy. The specific meaning and function of each part can be found on Baidu, so I won’t go into details here. Generally, what we need to do is the following steps.

#　1) Configure settings. For other configurations, you can view the document configuration according to your own requirements.

DEFAULT_REQUEST_HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.10 Safari/537.36'}
DOWNLOAD_TIMEOUT = 30IMAGES_STORE = 'Images'

Copy after login

　2) Define the items class, which is equivalent to the Model class. For example:

class CnblogImageItem(scrapy.Item):
    image = scrapy.Field()
    imagePath = scrapy.Field()
    name = scrapy.Field()

Copy after login

　3) Configure the download middleware. The function of the download middleware is to customize how to send a request. Generally, there are middleware for handling agents, middleware for PhantomJs, etc. Here, we only use proxy middleware.

class GaoxiaoSpiderMiddleware(object):def process_request(self, request, spider):if len(request.flags) > 0 and request.flags[0] == 'img':return None
        driver = webdriver.PhantomJS()# 设置全屏        driver.maximize_window()
        driver.get(request.url)
        content = driver.page_source
        driver.quit()return HtmlResponse(request.url, encoding='utf-8', body=content)class ProxyMiddleWare(object):def process_request(self, request, spider):
        request.meta['proxy'] = 'http://175.155.24.103:808'

Copy after login

　4) Write a pipeline, which is used to process items passed from Spider, save excel, database, download pictures, etc. Here is my code for downloading images, using the official image downloading framework.

class CnblogImagesPipeline(ImagesPipeline):
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")def get_media_requests(self, item, info):
        image_url = item['image']if image_url != '':yield scrapy.Request(str(image_url), flags=['img'])def item_completed(self, result, item, info):
        image_path = [x["path"] for ok, x in result if ok]if image_path:# 重命名if item['name'] != None and item['name'] != '':
                ext = os.path.splitext(image_path[0])[1]
                os.rename(self.IMAGES_STORE + '/' +  image_path[0], self.IMAGES_STORE + '/' + item['name'] + ext)
            item["imagePath"] = image_pathelse:
            item['imagePath'] = ''return item

Copy after login

　5) Write your own Spider class. The role of Spider is to configure some information, initiate url requests, and process response data. The download middleware configuration and pipeline here can be placed in the settings file. Here I put them in their respective spiders. Because the project contains multiple spiders, and they use different download middleware, they are configured separately.

# coding=utf-8import sysimport scrapyimport gaoxiao.itemsimport json
reload(sys)
sys.setdefaultencoding('utf-8')class doubanSpider(scrapy.Spider):
    name = 'douban'allowed_domains = ['movie.douban.com']
    baseUrl = ''start = 0
    start_urls = [baseUrl + str(start)]
    custom_settings = {'DOWNLOADER_MIDDLEWARES': {'gaoxiao.middlewares.ProxyMiddleWare': 1,#             'gaoxiao.middlewares.GaoxiaoSpiderMiddleware': 544        },'ITEM_PIPELINES': {'gaoxiao.pipelines.CnblogImagesPipeline': 1,
        }
    }def parse(self, response):
        data = json.loads(response.text)['subjects']for i in data:
            item = gaoxiao.items.CnblogImageItem()if i['cover'] != '':
                item['image'] = i['cover']
                item['name'] = i['title']else:
                item['image'] = ''yield itemif self.start < 400:
            self.start += 20yield scrapy.Request(self.baseUrl + str(self.start), callback=self.parse)

Copy after login

The above is the detailed content of Douban movie image crawling example. For more information, please follow other related articles on the PHP Chinese website!