Home Backend Development Python Tutorial Scrapy crawler in action: crawling Maoyan movie ranking data

Scrapy crawler in action: crawling Maoyan movie ranking data

Jun 22, 2023 am 08:58 AM
reptile scrapy cat eye

Scrapy crawler practice: crawling Maoyan movie ranking data

With the development of the Internet, data crawling has become an important part of the big data era. In the process of data crawling, crawler technology can be used to automatically obtain the data needed at the moment, process and analyze it. In recent years, Python has become one of the most popular programming languages. Among them, Scrapy is a powerful crawler framework based on Python. It has a wide range of applications and has attracted everyone's attention especially in the field of data crawling.

This article is based on the Scrapy framework to crawl Maoyan movie ranking data. The specific process is divided into four parts: analyzing the page structure, writing the crawler framework, parsing the page, and storing data.

1. Analyze the page structure

First, we need to analyze the structure of the Maoyan movie ranking page. For the convenience of operation, we use Google Chrome browser for page analysis and XPath to extract the required information.

As you can see, the Maoyan movie ranking page contains information about multiple movies, and each movie has an HTML code block similar to the picture below.

Our goal is to obtain the five data of the movie’s name, starring role, release time, movie poster link and rating from each HTML code block. Then we can press the F12 key to open the developer tools in the Google Chrome browser, then select the "Elements" tab, move the mouse to the target element we need to extract, right-click and select "Copy -> Copy XPath" .

The copied XPath path is as follows:

/html/body/div[3]/div/div[2]/dl/dd[1]/div/div/div[1 ]/p[1]/a/text()

where "/html/body/div[3]/div/div[2]/dl/dd" represents the parent node of the entire movie list, in order Scroll down to find the elements we need to extract.

2. Write a crawler framework

Next, we need to create a Scrapy project, please refer to Scrapy’s official documentation (https://docs.scrapy.org/en/latest/intro/ tutorial.html). After creating the project, create a new file named maoyan.py in the Spiders directory.

The following is our crawler framework code:

import scrapy
from maoyan.items import MaoyanItem

class MaoyanSpider(scrapy.Spider):

name = 'maoyan'
allowed_domains = ['maoyan.com']
start_urls = ['http://maoyan.com/board/4']

def parse(self, response):
    movies = response.xpath('//dl[@class="board-wrapper"]/dd')
    for movie in movies:
        item = MaoyanItem()
        item['title'] = movie.xpath('.//p[@class="name"]/a/@title').extract_first()
        item['actors'] = movie.xpath('.//p[@class="star"]/text()').extract_first().strip()
        item['release_date'] = movie.xpath('.//p[@class="releasetime"]/text()').extract_first().strip()
        item['image_url'] = movie.xpath('.//img/@data-src').extract_first()
        item['score'] = movie.xpath('.//p[@class="score"]/i[@class="integer"]/text()').extract_first() + 
                        movie.xpath('.//p[@class="score"]/i[@class="fraction"]/text()').extract_first()
        yield item
Copy after login

In the code, we first define Spider's name, allowed_domains and start_urls. Among them, "allowed_domains" means that only URLs belonging to this domain name will be accessed and extracted by the crawler. At the same time, "start_urls" indicates the first URL address that the crawler will request.

Spider's parse method receives the content from the response, and then extracts five data items of each movie's name, starring role, release time, movie poster link, and rating through the XPath path, and saves them to MaoyanItem.

Finally, we returned each Item object through the yield keyword. Note: The Item object we defined is in a file named items.py and needs to be imported.

3. Parse the page

When the crawler locates the page we need to crawl, we can start to parse the HTML document and extract the information we need. This part of the content mainly focuses on XPath query and regular expression processing of response objects in Scrapy.

In this example, we use the XPath path to extract five pieces of data for each movie in the Maoyan movie ranking page.

4. Store data

After the data is parsed, we need to store it. Generally speaking, we store the obtained data in a file or save it to a database.

In this example, we choose to save the data to a .csv file:

import csv

class MaoyanPipeline(object):

def __init__(self):
    self.file = open('maoyan_top100_movies.csv', 'w', newline='', encoding='utf-8')
    self.writer = csv.writer(self.file)

def process_item(self, item, spider):
    row = [item['title'], item['actors'], item['release_date'], item['image_url'], item['score']]
    self.writer.writerow(row)
    return item

def close_spider(self, spider):
    self.file.close()
Copy after login

In the above code, we use Python's internal csv module to write data to a file named maoyan_top100_movies.csv. When the spider is closed, the csv file will also be closed.

Summary

Through this article, we learned how to use the Scrapy framework to crawl Maoyan movie ranking data. First we analyzed the page structure, and then wrote the Scrapy framework to crawl data, parse the page and store data. In actual combat, we can learn how to unify legality, usability and efficiency in capturing data.

The above is the detailed content of Scrapy crawler in action: crawling Maoyan movie ranking data. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How long does it take to learn python crawler How long does it take to learn python crawler Oct 25, 2023 am 09:44 AM

The time it takes to learn Python crawlers varies from person to person and depends on factors such as personal learning ability, learning methods, learning time and experience. Learning Python crawlers is not just about learning the technology itself, but also requires good information gathering skills, problem solving skills and teamwork skills. Through continuous learning and practice, you will gradually grow into an excellent Python crawler developer.

Scrapy case analysis: How to crawl company information on LinkedIn Scrapy case analysis: How to crawl company information on LinkedIn Jun 23, 2023 am 10:04 AM

Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn. Determine the target URL First, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and

Scrapy asynchronous loading implementation method based on Ajax Scrapy asynchronous loading implementation method based on Ajax Jun 22, 2023 pm 11:09 PM

Scrapy is an open source Python crawler framework that can quickly and efficiently obtain data from websites. However, many websites use Ajax asynchronous loading technology, making it impossible for Scrapy to obtain data directly. This article will introduce the Scrapy implementation method based on Ajax asynchronous loading. 1. Ajax asynchronous loading principle Ajax asynchronous loading: In the traditional page loading method, after the browser sends a request to the server, it must wait for the server to return a response and load the entire page before proceeding to the next step.

Analysis and solutions to common problems of PHP crawlers Analysis and solutions to common problems of PHP crawlers Aug 06, 2023 pm 12:57 PM

Analysis of common problems and solutions for PHP crawlers Introduction: With the rapid development of the Internet, the acquisition of network data has become an important link in various fields. As a widely used scripting language, PHP has powerful capabilities in data acquisition. One of the commonly used technologies is crawlers. However, in the process of developing and using PHP crawlers, we often encounter some problems. This article will analyze and give solutions to these problems and provide corresponding code examples. 1. Description of the problem that the data of the target web page cannot be correctly parsed.

Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Jun 22, 2023 pm 01:57 PM

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to

How to set up password-free payment on Maoyan? Specific steps on how to set up password-free payment on Maoyan How to set up password-free payment on Maoyan? Specific steps on how to set up password-free payment on Maoyan Mar 15, 2024 pm 06:16 PM

Maoyan is a very special ticket purchasing platform. It provides users with a lot of rich content, including performances, movies, music festivals, etc., bringing convenience to everyone. There are many functions on the Maoyan platform, so do you know how to set up password-free payment on Maoyan? The specific steps on how to set up password-free payment on Maoyan 1. Open Maoyan Movies, click My in the lower right corner, and then click Wallet. 2. Click Payment Settings, then click Payment Password Settings. 3. Click on password-free payment. Features of the software: 1. Popular movies, all available. Detailed and rich movie information, classic and hilarious movie reviews. High-definition preview trailer, you can see everything at a glance. 2. Massive theaters for you to choose. Check theater schedules in real time and locate them anytime and anywhere.

Efficient Java crawler practice: sharing of web data crawling techniques Efficient Java crawler practice: sharing of web data crawling techniques Jan 09, 2024 pm 12:29 PM

Java crawler practice: How to efficiently crawl web page data Introduction: With the rapid development of the Internet, a large amount of valuable data is stored in various web pages. To obtain this data, it is often necessary to manually access each web page and extract the information one by one, which is undoubtedly a tedious and time-consuming task. In order to solve this problem, people have developed various crawler tools, among which Java crawler is one of the most commonly used. This article will lead readers to understand how to use Java to write an efficient web crawler, and demonstrate the practice through specific code examples. 1. The base of the reptile

In-depth use of Scrapy: How to crawl HTML, XML, and JSON data? In-depth use of Scrapy: How to crawl HTML, XML, and JSON data? Jun 22, 2023 pm 05:58 PM

Scrapy is a powerful Python crawler framework that can help us obtain data on the Internet quickly and flexibly. In the actual crawling process, we often encounter various data formats such as HTML, XML, and JSON. In this article, we will introduce how to use Scrapy to crawl these three data formats respectively. 1. Crawl HTML data and create a Scrapy project. First, we need to create a Scrapy project. Open the command line and enter the following command: scrapys

See all articles