Detailed explanation of scrapy examples of python crawler framework-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Detailed explanation of scrapy examples of python crawler framework

高洛峰

Oct 18, 2016 am 10:25 AM

Generate Project

Scrapy provides a tool to generate projects. Some files are preset in the generated project, and users need to add their own code to these files.

Open the command line and execute: scrapy startproject tutorial. The generated project has a structure similar to the following

tutorial/

scrapy.cfg

tutorial/

__init__.py

items.py

pipelines.py

settings .py

spiders/

The name attribute is important , different spiders cannot use the same name

start_urls is the starting point for spiders to crawl web pages, and can include multiple URLs

parse method is the callback called by default after spider captures a web page, avoid using this name to define your own method .

When the spider gets the content of the url, it will call the parse method and pass it a response parameter. The response contains the content of the captured web page. In the parse method, you can parse the data from the captured web page. The code above simply saves the web page content to a file.

Start crawling

You can open the command line, enter the generated project root directory tutorial/, and execute scrapy crawl dmoz, where dmoz is the name of the spider.

Parse web page content

scrapy provides a convenient way to parse data from web pages, which requires the use of HtmlXPathSelector

from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, &#39;wb&#39;).write(response.body)

Copy after login

HtmlXPathSelector uses Xpath to parse data

//ul/li means to select all ul tags The li tag below

a/@href means selecting the href attribute of all a tags

a/text() means selecting the a tag text

a[@href="abc"] means selecting all a whose href attribute is abc Tag

We can save the parsed data in an object that scrapy can use, and then scrapy can help us save these objects without having to save the data to a file ourselves. We need to add some classes to items.py, which are used to describe the data we want to save

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select(&#39;//ul/li&#39;)
        for site in sites:
            title = site.select(&#39;a/text()&#39;).extract()
            link = site.select(&#39;a/@href&#39;).extract()
            desc = site.select(&#39;text()&#39;).extract()
            print title, link, desc

Copy after login

When executing scrapy on the command line, we can add two parameters to let scrapy output the items returned by the parse method to json In the file

scrapy crawl dmoz -o items.json -t json

items.json will be placed in the root directory of the project

Let scrapy automatically crawl all links on the webpage

In the example above, scrapy Only the contents of the two URLs in start_urls are crawled, but usually what we want to achieve is for scrapy to automatically discover all the links on a web page, and then crawl the contents of these links. In order to achieve this, we can extract the links we need in the parse method, then construct some Request objects and return them. Scrapy will automatically crawl these links. The code is similar:

from scrapy.item import Item, Field
class DmozItem(Item):
   title = Field()
   link = Field()
   desc = Field()
然后在spider的parse方法中，我们把解析出来的数据保存在DomzItem对象中。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]
   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select(&#39;//ul/li&#39;)
       items = []
       for site in sites:
           item = DmozItem()
           item[&#39;title&#39;] = site.select(&#39;a/text()&#39;).extract()
           item[&#39;link&#39;] = site.select(&#39;a/@href&#39;).extract()
           item[&#39;desc&#39;] = site.select(&#39;text()&#39;).extract()
           items.append(item)
       return items

Copy after login

parse is the default callback, which returns a Request list. Scrapy automatically crawls web pages based on this list. Whenever a web page is captured, parse_item will be called, and parse_item will also return a list. Scrapy will The web page will be crawled based on this list, and parse_details will be called after crawling

In order to make such work easier, scrapy provides another spider base class, using which we can easily implement automatic crawling of links. We need to use CrawlSpider

class MySpider(BaseSpider):
    name = &#39;myspider&#39;
    start_urls = (
        &#39;http://example.com/page1&#39;,
        &#39;http://example.com/page2&#39;,
        )
    def parse(self, response):
        # collect `item_urls`
        for item_url in item_urls:
            yield Request(url=item_url, callback=self.parse_item)
    def parse_item(self, response):
        item = MyItem()
        # populate `item` fields
        yield Request(url=item_details_url, meta={&#39;item&#39;: item},
            callback=self.parse_details)
    def parse_details(self, response):
        item = response.meta[&#39;item&#39;]
        # populate more `item` fields
        return item

Copy after login

Compared with BaseSpider, the new class has an additional rules attribute. This attribute is a list, which can contain multiple Rules. Each Rule describes which links need to be crawled and which do not. This is the documentation for the Rule class http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule

These rules can have callbacks or not, when there is no callback , scrapy simply follows all these links.

Usage of pipelines.py

In pipelines.py we can add some classes to filter out the items we don’t want and save the items to the database.

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MininovaSpider(CrawlSpider):
    name = &#39;mininova.org&#39;
    allowed_domains = [&#39;mininova.org&#39;]
    start_urls = [&#39;http://www.mininova.org/today&#39;]
    rules = [Rule(SgmlLinkExtractor(allow=[&#39;/tor/\d+&#39;])),
             Rule(SgmlLinkExtractor(allow=[&#39;/abc/\d+&#39;]), &#39;parse_torrent&#39;)]
    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = TorrentItem()
        torrent[&#39;url&#39;] = response.url
        torrent[&#39;name&#39;] = x.select("//h1/text()").extract()
        torrent[&#39;description&#39;] = x.select("//div[@id=&#39;description&#39;]").extract()
        torrent[&#39;size&#39;] = x.select("//div[@id=&#39;info-left&#39;]/p[2]/text()[2]").extract()
        return torrent

Copy after login

If the item does not meet the requirements, then an exception will be thrown and the item will not be output to the json file.

To use pipelines, we also need to modify settings.py

Add a line

ITEM_PIPELINES = ['dirbot.pipelines.FilterWordsPipeline']

Now execute scrapy crawl dmoz -o items.json -t json, which does not meet the requirements The item was filtered out

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7433

CakePHP Tutorial

1359

What is the format of the account name of steam

win11 activation key permanent

Related knowledge

How to solve the permissions problem encountered when viewing Python version in Linux terminal? Apr 01, 2025 pm 05:09 PM

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How Do I Use Beautiful Soup to Parse HTML? Mar 10, 2025 pm 06:54 PM

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

How to Perform Deep Learning with TensorFlow or PyTorch? Mar 10, 2025 pm 06:52 PM

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

Mathematical Modules in Python: Statistics Mar 09, 2025 am 11:40 AM

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

What are some popular Python libraries and their uses? Mar 21, 2025 pm 06:46 PM

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

How to Create Command-Line Interfaces (CLIs) with Python? Mar 10, 2025 pm 06:48 PM

This article guides Python developers on building command-line interfaces (CLIs). It details using libraries like typer, click, and argparse, emphasizing input/output handling, and promoting user-friendly design patterns for improved CLI usability.

How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? Apr 01, 2025 pm 11:15 PM

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

Explain the purpose of virtual environments in Python. Mar 19, 2025 pm 02:27 PM

The article discusses the role of virtual environments in Python, focusing on managing project dependencies and avoiding conflicts. It details their creation, activation, and benefits in improving project management and reducing dependency issues.

See all articles