First introduction to scrapy, actual combat of image crawling on Mekong.com_html/css

Home

Web Front-end

HTML Tutorial

First introduction to scrapy, actual combat of image crawling on Mekong.com_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 24, 2016 am 11:53 AM

I have been studying the scrapy crawler framework in the past two days, and I am planning to write a crawler to practice. What I usually do more often is browse pictures, yes, that’s right, those are artistic photos. I proudly believe that looking at more beautiful photos will definitely improve your aesthetics and become an elegant programmer. O(∩_∩)O~ Just kidding, so without further ado, let’s get to the point and write an image crawler.

Design idea: The crawl target is the model photos of Meikong.com, use CrawlSpider to extract the URL address of each photo, and write the extracted image URL into a static html text for storage, and you can open it to view the image. My environment is win8.1, python2.7 Scrapy 0.24.4. I won’t tell you how to configure the environment. You can search it on Baidu yourself.

Referring to the official documentation, I summarized the four steps to build a crawler program:

Create a scrapy project

Define the element items that need to be extracted from the web page

Implement a spider class to complete the function of crawling URLs and extracting items through the interface

Implement an item pipeline class to complete the storage function of Items.

The next step is very simple. Just follow the steps step by step. First, create a project in the terminal. Let’s name the project moko. Enter the command scrapy startproject moko. Scrapy will create a moko file directory in the current directory. There are some initial files in it. If you are interested in the use of the files, check out the documentation. I will mainly introduce the files we used this time.

Define Item Define the data we want to capture in items.py:

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MokoItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    url = scrapy.Field()

Copy after login

The url here is used The dict number that stores the final result will be explained later. The name is randomly named. For example, if I also need to crawl the name of the author of the picture, then we can add a name = scrapy.Field(), and so on.

Next we enter the spiders folder and create a python file in it. Let’s take the name mokospider.py and add the core code to implement Spider:

Spider is a script inherited from scrapy The Python class of .contrib.spiders.CrawlSpider has three required defined members

name: The name, the identifier of this spider, must be unique. Different crawlers define different names

start_urls: a list of URLs, the spider starts crawling from these web pages

parse(): parsing method, when calling, pass in the Response object returned from each URL as the only parameter, responsible for parsing and matching the crawl Get the data (parsed into items) and track more URLs.

# -*- coding: utf-8 -*-#File name :spyders/mokospider.py#Author:Jhonny Zhang#mail:veinyy@163.com#create Time : 2014-11-29#############################################################################from scrapy.contrib.spiders import CrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom moko.items import MokoItemimport refrom scrapy.http import Requestfrom scrapy.selector import Selectorclass MokoSpider(CrawlSpider):    name = "moko"    allowed_domains = ["moko.cc"]    start_urls=["http://www.moko.cc/post/aaronsky/list.html"]    rules = (Rule(SgmlLinkExtractor(allow=('/post/\d*\.html')),  callback = 'parse_img', follow=True),)    def parse_img(self, response):        urlItem = MokoItem()        sel = Selector(response)        for divs in sel.xpath('//div[@class="pic dBd"]'):            img_url=divs.xpath('.//img/@src2').extract()[0]            urlItem['url'] = img_url            yield urlItem

Copy after login

Our project is named moko. The allowed_domains area allowed by the crawler is limited to moko.cc, which is the restricted area of the crawler. It stipulates that the crawler only crawls web pages under this domain name. The starting address of the crawler starts from http://www.moko.cc/post/aaronsky/list.html. Then set the crawling rule Rule. This is what makes CrawlSpider different from basic crawlers. For example, we start crawling from web page A. There are many hyperlink URLs on web page A. Our crawler will proceed based on the set rules. Crawl the hyperlink URLs that comply with the rules, and repeat this process. The callback function is used when a web page calls this callback function. The reason why I did not use the default name of parse is because the official documentation says that parse may be called in the crawler framework, causing conflicts.

There are many links to pictures on the target http://www.moko.cc/post/aaronsky/list.html webpage. The links to each picture have rules to follow. For example, just click on one to open it. http://www.moko.cc/post/1052776.html, http://www.moko.cc/post/ here are all the same, and the different parts of each link are the numbers behind them. So here we use regular expressions to fill in the rules rules = (Rule(SgmlLinkExtractor(allow=('/post/d*.html')), callback = 'parse_img', follow=True),) refers to the current web page, all matches All web pages with the suffix /post/d*.html are crawled and processed by calling parse_img.

Next, define the parsing function parse_img. This is more critical. The parameter it passes in is the response object returned by the crawler after opening the URL. The content in the response object is simply a large string. We are using the crawler Filter out what we need. How to filter it? ? ? Haha, there is an awesome Selector method that uses its xpath() path expression formula to parse the content. Before parsing, you need to analyze the web page in detail. The tool we use here is firebug. The intercepted web core code is

　　我们需要的是src2部分！他在

标签下的里面，首先实例一个在Items.py里面定义的MokoItem()的对象urlItem，用牛逼的Selector传入response，我这里用了一个循环，每次处理一个url，利用xpath路径表达式解析取出url，至于xpath如何用，自行百度下。结果存储到urlItem里面，这里用到了我们Items.py里面定义的url了！

然后定义一下pipelines，这部分管我们的内容存储。

from moko.items import MokoItemclass MokoPipeline(object):    def __init__(self):        self.mfile = open('test.html', 'w')    def process_item(self, item, spider):        text = '<img src="' + item['url'] + '" alt = "" />'        self.mfile.writelines(text)    def close_spider(self, spider):        self.mfile.close()

Copy after login

建立一个test.html文件用来存储结果。注意我的process_item里用到了一些html规则，作用是直接在html里面显示图片。结尾在定义一个关闭文件的方法，在爬虫结束时候调用。

最后定义设置一下settings.py

BOT_NAME = 'moko'SPIDER_MODULES = ['moko.spiders']NEWSPIDER_MODULE = 'moko.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'moko (+http://www.yourdomain.com)'ITEM_PIPELINES={'moko.pipelines.MokoPipeline': 1,}

Copy after login

最后展示一下效果图吧，祝各位玩的快乐 ^_^

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7564

CakePHP Tutorial

1385

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

What is the purpose of the <progress> element? Mar 21, 2025 pm 12:34 PM

The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

What is the purpose of the <datalist> element? Mar 21, 2025 pm 12:33 PM

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

Is HTML easy to learn for beginners? Apr 07, 2025 am 12:11 AM

HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

What is the purpose of the <meter> element? Mar 21, 2025 pm 12:35 PM

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

What is the purpose of the <iframe> tag? What are the security considerations when using it? Mar 20, 2025 pm 06:05 PM

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

What is the viewport meta tag? Why is it important for responsive design? Mar 20, 2025 pm 05:56 PM

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

The Roles of HTML, CSS, and JavaScript: Core Responsibilities Apr 08, 2025 pm 07:05 PM

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

What is an example of a starting tag in HTML? Apr 06, 2025 am 12:04 AM

AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.

See all articles