Home > Backend Development > Python Tutorial > Scrapy crawler in action: crawling Maoyan movie ranking data

Scrapy crawler in action: crawling Maoyan movie ranking data

WBOY
Release: 2023-06-22 08:58:55
Original
2278 people have browsed it

Scrapy crawler practice: crawling Maoyan movie ranking data

With the development of the Internet, data crawling has become an important part of the big data era. In the process of data crawling, crawler technology can be used to automatically obtain the data needed at the moment, process and analyze it. In recent years, Python has become one of the most popular programming languages. Among them, Scrapy is a powerful crawler framework based on Python. It has a wide range of applications and has attracted everyone's attention especially in the field of data crawling.

This article is based on the Scrapy framework to crawl Maoyan movie ranking data. The specific process is divided into four parts: analyzing the page structure, writing the crawler framework, parsing the page, and storing data.

1. Analyze the page structure

First, we need to analyze the structure of the Maoyan movie ranking page. For the convenience of operation, we use Google Chrome browser for page analysis and XPath to extract the required information.

As you can see, the Maoyan movie ranking page contains information about multiple movies, and each movie has an HTML code block similar to the picture below.

Our goal is to obtain the five data of the movie’s name, starring role, release time, movie poster link and rating from each HTML code block. Then we can press the F12 key to open the developer tools in the Google Chrome browser, then select the "Elements" tab, move the mouse to the target element we need to extract, right-click and select "Copy -> Copy XPath" .

The copied XPath path is as follows:

/html/body/div[3]/div/div[2]/dl/dd[1]/div/div/div[1 ]/p[1]/a/text()

where "/html/body/div[3]/div/div[2]/dl/dd" represents the parent node of the entire movie list, in order Scroll down to find the elements we need to extract.

2. Write a crawler framework

Next, we need to create a Scrapy project, please refer to Scrapy’s official documentation (https://docs.scrapy.org/en/latest/intro/ tutorial.html). After creating the project, create a new file named maoyan.py in the Spiders directory.

The following is our crawler framework code:

import scrapy
from maoyan.items import MaoyanItem

class MaoyanSpider(scrapy.Spider):

name = 'maoyan'
allowed_domains = ['maoyan.com']
start_urls = ['http://maoyan.com/board/4']

def parse(self, response):
    movies = response.xpath('//dl[@class="board-wrapper"]/dd')
    for movie in movies:
        item = MaoyanItem()
        item['title'] = movie.xpath('.//p[@class="name"]/a/@title').extract_first()
        item['actors'] = movie.xpath('.//p[@class="star"]/text()').extract_first().strip()
        item['release_date'] = movie.xpath('.//p[@class="releasetime"]/text()').extract_first().strip()
        item['image_url'] = movie.xpath('.//img/@data-src').extract_first()
        item['score'] = movie.xpath('.//p[@class="score"]/i[@class="integer"]/text()').extract_first() + 
                        movie.xpath('.//p[@class="score"]/i[@class="fraction"]/text()').extract_first()
        yield item
Copy after login

In the code, we first define Spider's name, allowed_domains and start_urls. Among them, "allowed_domains" means that only URLs belonging to this domain name will be accessed and extracted by the crawler. At the same time, "start_urls" indicates the first URL address that the crawler will request.

Spider's parse method receives the content from the response, and then extracts five data items of each movie's name, starring role, release time, movie poster link, and rating through the XPath path, and saves them to MaoyanItem.

Finally, we returned each Item object through the yield keyword. Note: The Item object we defined is in a file named items.py and needs to be imported.

3. Parse the page

When the crawler locates the page we need to crawl, we can start to parse the HTML document and extract the information we need. This part of the content mainly focuses on XPath query and regular expression processing of response objects in Scrapy.

In this example, we use the XPath path to extract five pieces of data for each movie in the Maoyan movie ranking page.

4. Store data

After the data is parsed, we need to store it. Generally speaking, we store the obtained data in a file or save it to a database.

In this example, we choose to save the data to a .csv file:

import csv

class MaoyanPipeline(object):

def __init__(self):
    self.file = open('maoyan_top100_movies.csv', 'w', newline='', encoding='utf-8')
    self.writer = csv.writer(self.file)

def process_item(self, item, spider):
    row = [item['title'], item['actors'], item['release_date'], item['image_url'], item['score']]
    self.writer.writerow(row)
    return item

def close_spider(self, spider):
    self.file.close()
Copy after login

In the above code, we use Python's internal csv module to write data to a file named maoyan_top100_movies.csv. When the spider is closed, the csv file will also be closed.

Summary

Through this article, we learned how to use the Scrapy framework to crawl Maoyan movie ranking data. First we analyzed the page structure, and then wrote the Scrapy framework to crawl data, parse the page and store data. In actual combat, we can learn how to unify legality, usability and efficiency in capturing data.

The above is the detailed content of Scrapy crawler in action: crawling Maoyan movie ranking data. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template