


First introduction to scrapy, actual combat of image crawling on Mekong.com_html/css_WEB-ITnose
I have been studying the scrapy crawler framework in the past two days, and I am planning to write a crawler to practice. What I usually do more often is browse pictures, yes, that’s right, those are artistic photos. I proudly believe that looking at more beautiful photos will definitely improve your aesthetics and become an elegant programmer. O(∩_∩)O~ Just kidding, so without further ado, let’s get to the point and write an image crawler.
Design idea: The crawl target is the model photos of Meikong.com, use CrawlSpider to extract the URL address of each photo, and write the extracted image URL into a static html text for storage, and you can open it to view the image. My environment is win8.1, python2.7 Scrapy 0.24.4. I won’t tell you how to configure the environment. You can search it on Baidu yourself.
Referring to the official documentation, I summarized the four steps to build a crawler program:
The next step is very simple. Just follow the steps step by step. First, create a project in the terminal. Let’s name the project moko. Enter the command scrapy startproject moko. Scrapy will create a moko file directory in the current directory. There are some initial files in it. If you are interested in the use of the files, check out the documentation. I will mainly introduce the files we used this time.
Define Item Define the data we want to capture in items.py:
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MokoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() url = scrapy.Field()
Spider is a script inherited from scrapy The Python class of .contrib.spiders.CrawlSpider has three required defined members
name: The name, the identifier of this spider, must be unique. Different crawlers define different names
start_urls: a list of URLs, the spider starts crawling from these web pages
parse(): parsing method, when calling, pass in the Response object returned from each URL as the only parameter, responsible for parsing and matching the crawl Get the data (parsed into items) and track more URLs.
# -*- coding: utf-8 -*-#File name :spyders/mokospider.py#Author:Jhonny Zhang#mail:veinyy@163.com#create Time : 2014-11-29#############################################################################from scrapy.contrib.spiders import CrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom moko.items import MokoItemimport refrom scrapy.http import Requestfrom scrapy.selector import Selectorclass MokoSpider(CrawlSpider): name = "moko" allowed_domains = ["moko.cc"] start_urls=["http://www.moko.cc/post/aaronsky/list.html"] rules = (Rule(SgmlLinkExtractor(allow=('/post/\d*\.html')), callback = 'parse_img', follow=True),) def parse_img(self, response): urlItem = MokoItem() sel = Selector(response) for divs in sel.xpath('//div[@class="pic dBd"]'): img_url=divs.xpath('.//img/@src2').extract()[0] urlItem['url'] = img_url yield urlItem
Our project is named moko. The allowed_domains area allowed by the crawler is limited to moko.cc, which is the restricted area of the crawler. It stipulates that the crawler only crawls web pages under this domain name. The starting address of the crawler starts from http://www.moko.cc/post/aaronsky/list.html. Then set the crawling rule Rule. This is what makes CrawlSpider different from basic crawlers. For example, we start crawling from web page A. There are many hyperlink URLs on web page A. Our crawler will proceed based on the set rules. Crawl the hyperlink URLs that comply with the rules, and repeat this process. The callback function is used when a web page calls this callback function. The reason why I did not use the default name of parse is because the official documentation says that parse may be called in the crawler framework, causing conflicts.
There are many links to pictures on the target http://www.moko.cc/post/aaronsky/list.html webpage. The links to each picture have rules to follow. For example, just click on one to open it. http://www.moko.cc/post/1052776.html, http://www.moko.cc/post/ here are all the same, and the different parts of each link are the numbers behind them. So here we use regular expressions to fill in the rules rules = (Rule(SgmlLinkExtractor(allow=('/post/d*.html')), callback = 'parse_img', follow=True),) refers to the current web page, all matches All web pages with the suffix /post/d*.html are crawled and processed by calling parse_img.
Next, define the parsing function parse_img. This is more critical. The parameter it passes in is the response object returned by the crawler after opening the URL. The content in the response object is simply a large string. We are using the crawler Filter out what we need. How to filter it? ? ? Haha, there is an awesome Selector method that uses its xpath() path expression formula to parse the content. Before parsing, you need to analyze the web page in detail. The tool we use here is firebug. The intercepted web core code is
我们需要的是src2部分!他在
然后定义一下pipelines,这部分管我们的内容存储。
from moko.items import MokoItemclass MokoPipeline(object): def __init__(self): self.mfile = open('test.html', 'w') def process_item(self, item, spider): text = '<img src="' + item['url'] + '" alt = "" />' self.mfile.writelines(text) def close_spider(self, spider): self.mfile.close()
建立一个test.html文件用来存储结果。注意我的process_item里用到了一些html规则,作用是直接在html里面显示图片。结尾在定义一个关闭文件的方法,在爬虫结束时候调用。
最后定义设置一下settings.py
BOT_NAME = 'moko'SPIDER_MODULES = ['moko.spiders']NEWSPIDER_MODULE = 'moko.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'moko (+http://www.yourdomain.com)'ITEM_PIPELINES={'moko.pipelines.MokoPipeline': 1,}
最后展示一下效果图吧,祝各位玩的快乐 ^_^

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.
