Home Backend Development Python Tutorial Scrapy in action: crawling Douban movie data and rating popularity rankings

Scrapy in action: crawling Douban movie data and rating popularity rankings

Jun 22, 2023 pm 01:49 PM
Douban Crawling scrapy

Scrapy is an open source Python framework for crawling data quickly and efficiently. In this article, we will use Scrapy to crawl the data and rating popularity of Douban movies.

  1. Preparation

First, we need to install Scrapy. You can install Scrapy by entering the following command at the command line:

pip install scrapy
Copy after login

Next, we will create a Scrapy project. At the command line, enter the following command:

scrapy startproject doubanmovie
Copy after login

This will create a Scrapy project named doubanmovie. We will then go into the project directory and create a spider called douban.py. At the command line, enter the following command:

cd doubanmovie
scrapy genspider douban douban.com
Copy after login

Now, we have a Spider ready to use. Next, we will define the spider's behavior to get the required data.

  1. Crawling movie data

We will use Spider to crawl Douban movie data. Specifically, we will get the following information:

  • Movie Name
  • Director
  • Actor
  • Type
  • Country
  • Language
  • Release date
  • Length
  • Rating
  • Number of reviewers

Open douban.py file, we will add the following code:

import scrapy

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for movie in movie_list:
            yield {
                'name': movie.xpath('.//span[@class="title"]/text()').get(),
                'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(),
                'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(),
                'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(),
                'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(),
                'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(),
                'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(),
                'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(),
                'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(),
                'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(),
            }
Copy after login

In this code, we use XPath to select the information we need to get. We use yield to generate this information and return to return it to the user.

If we run our Spider now (run the following command: scrapy crawl douban), it will crawl the data for the first 250 movies and return them to the command line.

  1. Get rating popularity ranking

Now, we have successfully obtained the data of the top 250 movies. Next, we will get their rating popularity ranking.

We need to create a new Spider first to crawl the TOP250 list of Douban movies. We will use this list to get the ranking of the movies.

In the douban.py file, we will add the following code:

import scrapy

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for movie in movie_list:
            yield {
                'name': movie.xpath('.//span[@class="title"]/text()').get(),
                'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(),
                'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(),
                'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(),
                'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(),
                'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(),
                'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(),
                'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(),
                'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(),
                'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(),
            }

        next_page = response.xpath('//span[@class="next"]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].get())
            yield scrapy.Request(url, callback=self.parse)
Copy after login

In the code, we use a variable called next_page to check if we have reached the last page. If we haven't reached the last page yet, we continue crawling to the next page.

Next, we need to update the parse method to get the ranking of the movie. We will use Python's enumerate function to associate a ranking with each movie.

In the douban.py file, we will replace the original parse method:

def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for i, movie in enumerate(movie_list):
            yield {
                'rank': i + 1,
                'name': movie.xpath('.//span[@class="title"]/text()').get(),
                'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(),
                'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(),
                'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(),
                'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(),
                'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(),
                'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(),
                'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(),
                'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(),
                'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(),
            }

        next_page = response.xpath('//span[@class="next"]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].get())
            yield scrapy.Request(url, callback=self.parse)
Copy after login

Now, if we run our Spider again, it will get the data for the first 250 movies and will They are returned to the command line. At this point, we will see the ranking of all movies.

  1. Conclusion

Scrapy is a very powerful and flexible tool for crawling data quickly and efficiently. In this article, we have successfully used Scrapy to crawl the data and rating popularity of Douban movies.

We use Python code and XPath to selectively obtain information on the web page, and use the yield statement to return it to the user. Throughout the process, Scrapy provides a simple and effective way to manage and crawl large amounts of data, allowing us to quickly perform data analysis and processing.

The above is the detailed content of Scrapy in action: crawling Douban movie data and rating popularity rankings. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Scrapy implements crawling and analysis of WeChat public account articles Scrapy implements crawling and analysis of WeChat public account articles Jun 22, 2023 am 09:41 AM

Scrapy implements article crawling and analysis of WeChat public accounts. WeChat is a popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc. So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scr

Metadata scraping using the New York Times API Metadata scraping using the New York Times API Sep 02, 2023 pm 10:13 PM

Introduction Last week, I wrote an introduction about scraping web pages to collect metadata, and mentioned that it was impossible to scrape the New York Times website. The New York Times paywall blocks your attempts to collect basic metadata. But there is a way to solve this problem using New York Times API. Recently I started building a community website on the Yii platform, which I will publish in a future tutorial. I want to be able to easily add links that are relevant to the content on my site. While people can easily paste URLs into forms, providing title and source information is time-consuming. So in today's tutorial I'm going to extend the scraping code I recently wrote to leverage the New York Times API to collect headlines when adding a New York Times link. Remember, I'm involved

Scrapy asynchronous loading implementation method based on Ajax Scrapy asynchronous loading implementation method based on Ajax Jun 22, 2023 pm 11:09 PM

Scrapy is an open source Python crawler framework that can quickly and efficiently obtain data from websites. However, many websites use Ajax asynchronous loading technology, making it impossible for Scrapy to obtain data directly. This article will introduce the Scrapy implementation method based on Ajax asynchronous loading. 1. Ajax asynchronous loading principle Ajax asynchronous loading: In the traditional page loading method, after the browser sends a request to the server, it must wait for the server to return a response and load the entire page before proceeding to the next step.

How to set English mode on Douban app How to set English mode on Douban app How to set English mode on Douban app How to set English mode on Douban app Mar 12, 2024 pm 02:46 PM

How to set English mode on Douban app? Douban app is a software that allows you to view reviews of various resources. This software has many functions. When users use this software for the first time, they need to log in, and the default language on this software is For Chinese mode, some users like to use English mode, but they don’t know how to set the English mode on this software. The editor below has compiled the method of setting the English mode for your reference. How to set the English mode on the Douban app: 1. Open the "Douban" app on your phone; 2. Click "My"; 3. Select "Settings" in the upper right corner.

Scrapy case analysis: How to crawl company information on LinkedIn Scrapy case analysis: How to crawl company information on LinkedIn Jun 23, 2023 am 10:04 AM

Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn. Determine the target URL First, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and

How to crawl and process data by calling API interface in PHP project? How to crawl and process data by calling API interface in PHP project? Sep 05, 2023 am 08:41 AM

How to crawl and process data by calling API interface in PHP project? 1. Introduction In PHP projects, we often need to crawl data from other websites and process these data. Many websites provide API interfaces, and we can obtain data by calling these interfaces. This article will introduce how to use PHP to call the API interface to crawl and process data. 2. Obtain the URL and parameters of the API interface. Before starting, we need to obtain the URL of the target API interface and the required parameters.

Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Jun 22, 2023 pm 01:57 PM

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to

The space thriller movie 'Alien' scored 7.7 on Douban, and the box office exceeded 100 million the day after its release. The space thriller movie 'Alien' scored 7.7 on Douban, and the box office exceeded 100 million the day after its release. Aug 17, 2024 pm 10:50 PM

According to news from this website on August 17, the space thriller "Alien: The Last Ship" by 20th Century Pictures was released in mainland China yesterday (August 16). The Douban score was announced today as 7.7. According to real-time data from Beacon Professional Edition, as of 20:5 on August 17, the film’s box office has exceeded 100 million. The distribution of ratings on this site is as follows: 5 stars account for 20.9% 4 stars account for 49.5% 3 stars account for 25.4% 2 stars account for 3.7% 1 stars account for 0.6% "Alien: Death Ship" is produced by 20th Century Pictures , Ridley Scott, the director of "Blade Runner" and "Prometheus", serves as the producer, directed by Fede Alvare, written by Fede Alvare and Rodo Seiagues, and Card Leigh Spaeny, Isabella Merced, Aileen Wu, Spike Fey

See all articles