


Scrapy implements crawling and analysis of WeChat public account articles
Scrapy implements crawling and analysis of WeChat public account articles
WeChat is a very popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc.
So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scrapy is a Python web crawler framework whose main function is data mining and information search. Therefore, Scrapy is very customizable and efficient.
- Install Scrapy and create a project
To use the Scrapy framework for crawling, you first need to install Scrapy and other dependencies. You can use the pip command to install. The installation process is as follows:
pip install scrapy pip install pymongo pip install mysql-connector-python
After installing Scrapy, we need to use the Scrapy command line tool to create the project. The command is as follows:
scrapy startproject wechat
After executing this command, Scrapy will create a project named "wechat" and create many files and directories in the project directory.
- Implement crawling of WeChat public account articles
Before we start crawling, we need to first understand the URL format of the WeChat public account article page. The URL of a typical WeChat public account article page looks like this:
https://mp.weixin.qq.com/s?__biz=XXX&mid=XXX&idx=1&sn=XXX&chksm=XXX#wechat_redirect
Among them, __biz represents the ID of the WeChat public account, mid represents the ID of the article, idx represents the serial number of the article, sn represents the signature of the article, and chksm represents Content verification. Therefore, if we want to crawl all articles of a certain official account, we need to find the ID of the official account and use it to build the URL. Among them, biz_id is the unique identifier of the official account.
First of all, we need to prepare a list containing many official account IDs, because we want to crawl the articles of these official accounts. The collection of IDs can be achieved through various means. Here, we use a list containing several test IDs as an example:
biz_ids = ['MzU5MjcwMzA4MA==', 'MzI4MzMwNDgwMQ==', 'MzAxMTcyMzg2MA==']
Next, we need to write a Spider to crawl all articles of a certain public account. Here, we pass the name and ID of the official account to Spider so that we can handle different official account IDs.
import scrapy import re class WeChatSpider(scrapy.Spider): name = "wechat" allowed_domains = ["mp.weixin.qq.com"] def __init__(self, name=None, biz_id=None): super().__init__(name=name) self.start_urls = ['https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}==#wechat_redirect'.format(biz_id)] def parse(self, response): article_urls = response.xpath('//h4[1]/a/@href') for url in article_urls.extract(): yield scrapy.Request(url, callback=self.parse_article) next_page = response.xpath('//a[@id="js_next"]/@href') if next_page: yield scrapy.Request(response.urljoin(next_page[0].extract()), callback=self.parse) def parse_article(self, response): url = response.url title = response.xpath('//h2[@class="rich_media_title"]/text()') yield {'url': url, 'title': title.extract_first().strip()}
The main function of Spider is to use the given official account ID to access the official account homepage, and then recursively traverse each page to extract the URLs of all articles. In addition, the parse_article method is used to extract the URL and title of the article for subsequent processing. Overall, this spider is not very complex, but the extraction speed is slow.
Finally, we need to enter the following command in Terminal to start Spider:
scrapy crawl wechat -a biz_id=XXXXXXXX
Similarly, we can also crawl multiple official accounts, we only need to specify the names of all official accounts in the command Just the ID:
scrapy crawl wechat -a biz_id=ID1,ID2,ID3
- Storing article data
After crawling the article, we need to save the title and URL of the article to a database (such as MongoDB, MySQL, etc.). Here, we will use the pymongo library to save the crawled data.
import pymongo class MongoPipeline(object): collection_name = 'wechat' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert_one(dict(item)) return item
In this Pipeline, we use MongoDB as the backend to store data. This class can be modified as needed to use other database systems.
Next, we need to configure database-related parameters in the settings.py file:
MONGO_URI = 'mongodb://localhost:27017/' MONGO_DATABASE = 'wechat' ITEM_PIPELINES = {'myproject.pipelines.MongoPipeline': 300}
Finally, we call Pipeline in Spider to store data in MongoDB:
class WeChatSpider(scrapy.Spider): name = "wechat" allowed_domains = ["mp.weixin.qq.com"] def __init__(self, name=None, biz_id=None): super().__init__(name=name) self.start_urls = ['https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}==#wechat_redirect'.format(biz_id)] def parse(self, response): article_urls = response.xpath('//h4[1]/a/@href') for url in article_urls.extract(): yield scrapy.Request(url, callback=self.parse_article) next_page = response.xpath('//a[@id="js_next"]/@href') if next_page: yield scrapy.Request(response.urljoin(next_page[0].extract()), callback=self.parse) def parse_article(self, response): url = response.url title = response.xpath('//h2[@class="rich_media_title"]/text()') yield {'url': url, 'title': title.extract_first().strip()} pipeline = response.meta.get('pipeline') if pipeline: item = dict() item['url'] = url item['title'] = title.extract_first().strip() yield item
In the above code, response.meta.get('pipeline') is used to obtain the Pipeline object we set in Spider. Therefore, just add the following code to the Spider code to support Pipeline:
yield scrapy.Request(url, callback=self.parse_article, meta={'pipeline': 1})
- Data Analysis
Finally, we will use libraries such as Scrapy and pandas to Implement data analysis and visualization.
Here we will extract the data we crawled from MongoDB and save it to a CSV file. Subsequently, we can use pandas to process and visualize the CSV file.
The following is the implementation process:
import pandas as pd from pymongo import MongoClient client = MongoClient('mongodb://localhost:27017/') db = client['wechat'] articles = db['wechat'] cursor = articles.find() doc = list(cursor) df = pd.DataFrame(doc) df.to_csv('wechat.csv', encoding='utf-8') df.groupby('biz_id')['title'].count().plot(kind='bar')
In the above code, we use the MongoDB and Pandas libraries to save the crawled data into the data folder of the CSV file. Subsequently, we used the powerful data analysis function of Pandas to visually display the number of articles of each public account.
The above is the detailed content of Scrapy implements crawling and analysis of WeChat public account articles. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Scrapy implements article crawling and analysis of WeChat public accounts. WeChat is a popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc. So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scr

Scrapy is an open source Python crawler framework that can quickly and efficiently obtain data from websites. However, many websites use Ajax asynchronous loading technology, making it impossible for Scrapy to obtain data directly. This article will introduce the Scrapy implementation method based on Ajax asynchronous loading. 1. Ajax asynchronous loading principle Ajax asynchronous loading: In the traditional page loading method, after the browser sends a request to the server, it must wait for the server to return a response and load the entire page before proceeding to the next step.

Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn. Determine the target URL First, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and

The difference between WeChat public account authentication and non-authentication lies in the authentication logo, function permissions, push frequency, interface permissions and user trust. Detailed introduction: 1. Certification logo. Certified public accounts will obtain the official certification logo, which is the blue V logo. This logo can increase the credibility and authority of the public account and make it easier for users to identify the real official public account; 2. Function permissions. Certified public accounts have more functions and permissions than uncertified public accounts. For example, certified public accounts can apply to activate the WeChat payment function to achieve online payment and commercial operations, etc.

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to

Using Selenium and PhantomJS in Scrapy crawlers Scrapy is an excellent web crawler framework under Python and has been widely used in data collection and processing in various fields. In the implementation of the crawler, sometimes it is necessary to simulate browser operations to obtain the content presented by certain websites. In this case, Selenium and PhantomJS are needed. Selenium simulates human operations on the browser, allowing us to automate web application testing

Scrapy is a powerful Python crawler framework that can help us obtain data on the Internet quickly and flexibly. In the actual crawling process, we often encounter various data formats such as HTML, XML, and JSON. In this article, we will introduce how to use Scrapy to crawl these three data formats respectively. 1. Crawl HTML data and create a Scrapy project. First, we need to create a Scrapy project. Open the command line and enter the following command: scrapys

How to use Laravel to develop an online ordering system based on WeChat official accounts. With the widespread use of WeChat official accounts, more and more companies are beginning to use them as an important channel for online marketing. In the catering industry, developing an online ordering system based on WeChat public accounts can improve the efficiency and sales of enterprises. This article will introduce how to use the Laravel framework to develop such a system and provide specific code examples. Project preparation First, you need to ensure that the Laravel framework has been installed in the local environment. OK
