Home Backend Development Python Tutorial Scrapy implements crawling and analysis of WeChat public account articles

Scrapy implements crawling and analysis of WeChat public account articles

Jun 22, 2023 am 09:41 AM
WeChat public account analyze. scrapy

Scrapy implements crawling and analysis of WeChat public account articles

WeChat is a very popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc.

So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scrapy is a Python web crawler framework whose main function is data mining and information search. Therefore, Scrapy is very customizable and efficient.

  1. Install Scrapy and create a project

To use the Scrapy framework for crawling, you first need to install Scrapy and other dependencies. You can use the pip command to install. The installation process is as follows:

pip install scrapy
pip install pymongo
pip install mysql-connector-python
Copy after login

After installing Scrapy, we need to use the Scrapy command line tool to create the project. The command is as follows:

scrapy startproject wechat
Copy after login

After executing this command, Scrapy will create a project named "wechat" and create many files and directories in the project directory.

  1. Implement crawling of WeChat public account articles

Before we start crawling, we need to first understand the URL format of the WeChat public account article page. The URL of a typical WeChat public account article page looks like this:

https://mp.weixin.qq.com/s?__biz=XXX&mid=XXX&idx=1&sn=XXX&chksm=XXX#wechat_redirect
Copy after login

Among them, __biz represents the ID of the WeChat public account, mid represents the ID of the article, idx represents the serial number of the article, sn represents the signature of the article, and chksm represents Content verification. Therefore, if we want to crawl all articles of a certain official account, we need to find the ID of the official account and use it to build the URL. Among them, biz_id is the unique identifier of the official account.

First of all, we need to prepare a list containing many official account IDs, because we want to crawl the articles of these official accounts. The collection of IDs can be achieved through various means. Here, we use a list containing several test IDs as an example:

biz_ids = ['MzU5MjcwMzA4MA==', 'MzI4MzMwNDgwMQ==', 'MzAxMTcyMzg2MA==']
Copy after login

Next, we need to write a Spider to crawl all articles of a certain public account. Here, we pass the name and ID of the official account to Spider so that we can handle different official account IDs.

import scrapy
import re

class WeChatSpider(scrapy.Spider):
    name = "wechat"
    allowed_domains = ["mp.weixin.qq.com"]
    
    def __init__(self, name=None, biz_id=None):
        super().__init__(name=name)
        self.start_urls = ['https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}==#wechat_redirect'.format(biz_id)]

    def parse(self, response):
        article_urls = response.xpath('//h4[1]/a/@href')
        for url in article_urls.extract():
            yield scrapy.Request(url, callback=self.parse_article)
        
        next_page = response.xpath('//a[@id="js_next"]/@href')
        if next_page:
            yield scrapy.Request(response.urljoin(next_page[0].extract()), callback=self.parse)
    
    def parse_article(self, response):
        url = response.url
        title = response.xpath('//h2[@class="rich_media_title"]/text()')
        yield {'url': url, 'title': title.extract_first().strip()}
Copy after login

The main function of Spider is to use the given official account ID to access the official account homepage, and then recursively traverse each page to extract the URLs of all articles. In addition, the parse_article method is used to extract the URL and title of the article for subsequent processing. Overall, this spider is not very complex, but the extraction speed is slow.

Finally, we need to enter the following command in Terminal to start Spider:

scrapy crawl wechat -a biz_id=XXXXXXXX
Copy after login

Similarly, we can also crawl multiple official accounts, we only need to specify the names of all official accounts in the command Just the ID:

scrapy crawl wechat -a biz_id=ID1,ID2,ID3
Copy after login
  1. Storing article data

After crawling the article, we need to save the title and URL of the article to a database (such as MongoDB, MySQL, etc.). Here, we will use the pymongo library to save the crawled data.

import pymongo

class MongoPipeline(object):
    collection_name = 'wechat'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item
Copy after login

In this Pipeline, we use MongoDB as the backend to store data. This class can be modified as needed to use other database systems.

Next, we need to configure database-related parameters in the settings.py file:

MONGO_URI = 'mongodb://localhost:27017/'
MONGO_DATABASE = 'wechat'
ITEM_PIPELINES = {'myproject.pipelines.MongoPipeline': 300}
Copy after login

Finally, we call Pipeline in Spider to store data in MongoDB:

class WeChatSpider(scrapy.Spider):
    name = "wechat"
    allowed_domains = ["mp.weixin.qq.com"]
    
    def __init__(self, name=None, biz_id=None):
        super().__init__(name=name)
        self.start_urls = ['https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}==#wechat_redirect'.format(biz_id)]

    def parse(self, response):
        article_urls = response.xpath('//h4[1]/a/@href')
        for url in article_urls.extract():
            yield scrapy.Request(url, callback=self.parse_article)
        
        next_page = response.xpath('//a[@id="js_next"]/@href')
        if next_page:
            yield scrapy.Request(response.urljoin(next_page[0].extract()), callback=self.parse)
            
    def parse_article(self, response):
        url = response.url
        title = response.xpath('//h2[@class="rich_media_title"]/text()')
        yield {'url': url, 'title': title.extract_first().strip()}

        pipeline = response.meta.get('pipeline')
        if pipeline:
            item = dict()
            item['url'] = url
            item['title'] = title.extract_first().strip()
            yield item
Copy after login

In the above code, response.meta.get('pipeline') is used to obtain the Pipeline object we set in Spider. Therefore, just add the following code to the Spider code to support Pipeline:

yield scrapy.Request(url, callback=self.parse_article, meta={'pipeline': 1})
Copy after login
  1. Data Analysis

Finally, we will use libraries such as Scrapy and pandas to Implement data analysis and visualization.

Here we will extract the data we crawled from MongoDB and save it to a CSV file. Subsequently, we can use pandas to process and visualize the CSV file.

The following is the implementation process:

import pandas as pd
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['wechat']
articles = db['wechat']

cursor = articles.find()
doc = list(cursor)

df = pd.DataFrame(doc)
df.to_csv('wechat.csv', encoding='utf-8')

df.groupby('biz_id')['title'].count().plot(kind='bar')
Copy after login

In the above code, we use the MongoDB and Pandas libraries to save the crawled data into the data folder of the CSV file. Subsequently, we used the powerful data analysis function of Pandas to visually display the number of articles of each public account.

The above is the detailed content of Scrapy implements crawling and analysis of WeChat public account articles. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Scrapy implements crawling and analysis of WeChat public account articles Scrapy implements crawling and analysis of WeChat public account articles Jun 22, 2023 am 09:41 AM

Scrapy implements article crawling and analysis of WeChat public accounts. WeChat is a popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc. So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scr

Scrapy asynchronous loading implementation method based on Ajax Scrapy asynchronous loading implementation method based on Ajax Jun 22, 2023 pm 11:09 PM

Scrapy is an open source Python crawler framework that can quickly and efficiently obtain data from websites. However, many websites use Ajax asynchronous loading technology, making it impossible for Scrapy to obtain data directly. This article will introduce the Scrapy implementation method based on Ajax asynchronous loading. 1. Ajax asynchronous loading principle Ajax asynchronous loading: In the traditional page loading method, after the browser sends a request to the server, it must wait for the server to return a response and load the entire page before proceeding to the next step.

Scrapy case analysis: How to crawl company information on LinkedIn Scrapy case analysis: How to crawl company information on LinkedIn Jun 23, 2023 am 10:04 AM

Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn. Determine the target URL First, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and

What are the differences between WeChat official account certification and non-certification? What are the differences between WeChat official account certification and non-certification? Sep 19, 2023 pm 02:15 PM

The difference between WeChat public account authentication and non-authentication lies in the authentication logo, function permissions, push frequency, interface permissions and user trust. Detailed introduction: 1. Certification logo. Certified public accounts will obtain the official certification logo, which is the blue V logo. This logo can increase the credibility and authority of the public account and make it easier for users to identify the real official public account; 2. Function permissions. Certified public accounts have more functions and permissions than uncertified public accounts. For example, certified public accounts can apply to activate the WeChat payment function to achieve online payment and commercial operations, etc.

Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Jun 22, 2023 pm 01:57 PM

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to

Using Selenium and PhantomJS in Scrapy crawler Using Selenium and PhantomJS in Scrapy crawler Jun 22, 2023 pm 06:03 PM

Using Selenium and PhantomJS in Scrapy crawlers Scrapy is an excellent web crawler framework under Python and has been widely used in data collection and processing in various fields. In the implementation of the crawler, sometimes it is necessary to simulate browser operations to obtain the content presented by certain websites. In this case, Selenium and PhantomJS are needed. Selenium simulates human operations on the browser, allowing us to automate web application testing

In-depth use of Scrapy: How to crawl HTML, XML, and JSON data? In-depth use of Scrapy: How to crawl HTML, XML, and JSON data? Jun 22, 2023 pm 05:58 PM

Scrapy is a powerful Python crawler framework that can help us obtain data on the Internet quickly and flexibly. In the actual crawling process, we often encounter various data formats such as HTML, XML, and JSON. In this article, we will introduce how to use Scrapy to crawl these three data formats respectively. 1. Crawl HTML data and create a Scrapy project. First, we need to create a Scrapy project. Open the command line and enter the following command: scrapys

How to use Laravel to develop an online ordering system based on WeChat public account How to use Laravel to develop an online ordering system based on WeChat public account Nov 02, 2023 am 09:42 AM

How to use Laravel to develop an online ordering system based on WeChat official accounts. With the widespread use of WeChat official accounts, more and more companies are beginning to use them as an important channel for online marketing. In the catering industry, developing an online ordering system based on WeChat public accounts can improve the efficiency and sales of enterprises. This article will introduce how to use the Laravel framework to develop such a system and provide specific code examples. Project preparation First, you need to ensure that the Laravel framework has been installed in the local environment. OK

See all articles