Scrapy is an open source Python framework for crawling data quickly and efficiently. In this article, we will use Scrapy to crawl the data and rating popularity of Douban movies.
First, we need to install Scrapy. You can install Scrapy by entering the following command at the command line:
pip install scrapy
Next, we will create a Scrapy project. At the command line, enter the following command:
scrapy startproject doubanmovie
This will create a Scrapy project named doubanmovie. We will then go into the project directory and create a spider called douban.py. At the command line, enter the following command:
cd doubanmovie scrapy genspider douban douban.com
Now, we have a Spider ready to use. Next, we will define the spider's behavior to get the required data.
We will use Spider to crawl Douban movie data. Specifically, we will get the following information:
Open douban.py file, we will add the following code:
import scrapy class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): movie_list = response.xpath('//div[@class="item"]') for movie in movie_list: yield { 'name': movie.xpath('.//span[@class="title"]/text()').get(), 'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(), 'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(), 'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(), 'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(), 'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(), 'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(), 'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(), 'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(), 'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(), }
In this code, we use XPath to select the information we need to get. We use yield to generate this information and return to return it to the user.
If we run our Spider now (run the following command: scrapy crawl douban), it will crawl the data for the first 250 movies and return them to the command line.
Now, we have successfully obtained the data of the top 250 movies. Next, we will get their rating popularity ranking.
We need to create a new Spider first to crawl the TOP250 list of Douban movies. We will use this list to get the ranking of the movies.
In the douban.py file, we will add the following code:
import scrapy class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): movie_list = response.xpath('//div[@class="item"]') for movie in movie_list: yield { 'name': movie.xpath('.//span[@class="title"]/text()').get(), 'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(), 'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(), 'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(), 'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(), 'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(), 'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(), 'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(), 'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(), 'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(), } next_page = response.xpath('//span[@class="next"]/a/@href') if next_page: url = response.urljoin(next_page[0].get()) yield scrapy.Request(url, callback=self.parse)
In the code, we use a variable called next_page to check if we have reached the last page. If we haven't reached the last page yet, we continue crawling to the next page.
Next, we need to update the parse method to get the ranking of the movie. We will use Python's enumerate function to associate a ranking with each movie.
In the douban.py file, we will replace the original parse method:
def parse(self, response): movie_list = response.xpath('//div[@class="item"]') for i, movie in enumerate(movie_list): yield { 'rank': i + 1, 'name': movie.xpath('.//span[@class="title"]/text()').get(), 'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(), 'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(), 'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(), 'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(), 'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(), 'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(), 'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(), 'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(), 'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(), } next_page = response.xpath('//span[@class="next"]/a/@href') if next_page: url = response.urljoin(next_page[0].get()) yield scrapy.Request(url, callback=self.parse)
Now, if we run our Spider again, it will get the data for the first 250 movies and will They are returned to the command line. At this point, we will see the ranking of all movies.
Scrapy is a very powerful and flexible tool for crawling data quickly and efficiently. In this article, we have successfully used Scrapy to crawl the data and rating popularity of Douban movies.
We use Python code and XPath to selectively obtain information on the web page, and use the yield statement to return it to the user. Throughout the process, Scrapy provides a simple and effective way to manage and crawl large amounts of data, allowing us to quickly perform data analysis and processing.
The above is the detailed content of Scrapy in action: crawling Douban movie data and rating popularity rankings. For more information, please follow other related articles on the PHP Chinese website!