How to use Python to crawl popular comments on NetEase Cloud Music-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

How to use Python to crawl popular comments on NetEase Cloud Music

零到壹度

Apr 11, 2018 pm 05:33 PM

The content of this article is to share with you how to use Python to crawl popular comments on NetEase Cloud Music. It has a certain reference value. Friends in need can refer to it

Preface

Recently I have been studying text mining related content. The so-called clever woman cannot make a meal without rice. To conduct text analysis, you first need to have text. There are many ways to obtain text, such as downloading ready-made text documents from the Internet, or obtaining data through APIs provided by third parties. But sometimes the data we want cannot be obtained directly because there is no direct download channel or API for us to obtain the data. So what should we do at this time? A better way is to use a web crawler, that is, writing a computer program to pretend to be a user to obtain the desired data. Using the efficiency of computers, we can obtain data easily and quickly.

##About crawlers

So how to write a crawler Woolen cloth? There are many languages that can be used to write crawlers, such as Java, php, python, etc. I personally prefer to use python. Because Python not only has built-in powerful network libraries, but also has many excellent third-party libraries. Others have directly built the wheel, and we can just use it. This brings great convenience to writing crawlers. It is no exaggeration to say that you can actually write a small crawler with less than 10 lines of python code, while using other languages can require you to write a lot more code. Simple and easy to understand is a huge advantage of python .

#Okay, without further ado, let’s get to the main topic today. NetEase Cloud Music has become very popular in recent years. I am a user of NetEase Cloud Music and have been using it for several years. I used QQ Music and Kugou in the past. Based on my own personal experience, I think the best features of NetEase Cloud Music are its accurate song recommendations and unique user reviews (Formal statement!!! This is not a soft article, not an advertisement! It only represents my personal opinion!) . Often there will be some comments under a song that have received many likes. Coupled with the fact that NetEase Cloud Music put selected user reviews on the subway a few days ago, NetEase Cloud Music's reviews have become popular again. So I want to analyze NetEase Cloud’s comments and discover the patterns, especially the common characteristics of some hot comments. With this purpose, I started crawling NetEase Cloud comments.

Network library

## Python has two built-in network libraries, urllib and urllib2, but these two libraries are not particularly convenient to use, so here we use a well-received third-party libraryrequests. Using requests, you can achieve more complex crawler work such as setting up agents and simulating logins with just a few lines of code. If pip is already installed, just use pip install requests to install it.

Chinese document addressHere http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

##If you have any questions, you can refer to the official documentation, above There will be a very detailed introduction. As for the two libraries urllib and urllib2, they are also quite useful. I will introduce them to you if I have the opportunity in the future.

##Working Principle

##Before we officially start to introduce the crawler, let’s first talk about the basic work of the crawler Principle, we know that when we open the browser

to visit a certain URL, we essentially send a certain request to the server

. After the server receives our request, it will return data according to our request, and then pass it through the browser Analyze these data and present them in front of us.

If we

use the code

, we have to skip this step in the browser and go directly Send certain data to the server, and then retrieve the data returned by the server to extract the information we want.

#But the problem is that sometimes the server needs to verify the request we send. If it thinks our request is illegal, it No data will be returned, or incorrect data will be returned. So in order to avoid this situation, we sometimes need to

disguise the program as a normal user

in order to successfully get a response from the server.

How to disguise it?

This depends on the difference between a user accessing a web page through a browser and us accessing a web page through a program.

Generally speaking, when we access a web page through a browser,

In addition to sending the accessed URL, additional information will also be sent to the service. Information

, such as headers (header information), etc., which is equivalent to the identity certificate of the request. When the server sees this data, it will know that we are accessing through a normal browser, and it will return the data obediently. We are. Simulated login

So our program has to be like a browser, bringing this information that marks our identity when sending a request, so that we can get the data smoothly. Sometimes, we must be logged in to get some data, so we must Simulate login.

#Essentially, logging in through the browser means posting some form information to the server (including user name, password and other information). After the server verifies We can log in smoothly, and the same applies to the application. We can just send whatever data the browser posts as it is.

About simulated login, I will introduce it specifically later. Of course, things sometimes don't go so smoothly, because some websites have set up anti-crawling measures. For example, if the access is too fast, the IP address will sometimes be blocked (typically Douban). At this time, we still have to set up a proxy server, that is, change our IP address. If one IP is blocked, change it to another IP. How to do this specifically will be discussed later.

Tips

Finally, let me introduce a little trick that I think is very useful in the process of writing a crawler. If you are using Firefox or Chrome, you may have noticed a place called developer tools (chrome) or web console (firefox). This tool is very useful because with it, we can clearly see what information the browser sends and what information the server returns when visiting a website. This information is the key to writing a crawler. Below you will see how useful it can be.

How to crawl comments

First open the web version of NetEase Cloud Music and select a song to open it webpage, here I take Jay Chou's "Sunny Day" as an example. As shown below:

How to use Python to crawl popular comments on NetEase Cloud Music

Next open the web console (if you use Chrome, open the developer tools, it should be similar for other browsers), as shown below:

How to use Python to crawl popular comments on NetEase Cloud Music

Then at this time we need to click on the network, clear all the information, and then click on Resend (equivalent to refreshing the browser), In this way we can intuitively see what information the browser sends and what information the server responds to. As shown below:

How to use Python to crawl popular comments on NetEase Cloud Music

#The data obtained after refreshing is as follows

How to use Python to crawl popular comments on NetEase Cloud Music

You can see that the browser sends a lot of information, so which one do we want? What do you want? Here we can make a preliminary judgment through status code. The status code (status code) marks the status of the server request. Herethe status code is 200, which means the request is normal, while 304 means Indicates abnormal (There are many types of status codes. If you want to know more about it, you can search it by yourself. The specific meaning of 304 will not be mentioned here) . So we generally only need to look at requests with status code 200. Also, we can roughly observe what information the server returns (or view the response) through the preview in the right column. As shown below:

How to use Python to crawl popular comments on NetEase Cloud Music

By combining these two methods, we can quickly find the request we want to analyze. Note that the request URL column in Figure 5 is the URL we want to request. There are two request methods:

get and post. Another thing that needs to be focused on is the request header, which contains user -Agent (client information), refrence (where to jump from) and other information. Generally, we will bring the header information whether it is the get or post method. The header information is as follows:

In addition, it should be noted that: in get requests, generally the request parameters are directly replaced with ? parameter1=value1¶meter2=value2 etc. is sent in the form, so there is no need to bring additional request parameters. Post requests generally need to bring additional parameters instead of directly placing the parameters in the URL. So sometimes we also need to pay attention to the parameter column. After careful search, we finally found the original comment-related request at http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016?csrf_token= This request Among them, as shown below:

How to use Python to crawl popular comments on NetEase Cloud Music

## Click on this request, we It is found that it is a post request. There are two parameters in the request, one is params, and the other is encSecKey. The values of these two parameters are very long, and it feels like they are encrypted. As shown below:

How to use Python to crawl popular comments on NetEase Cloud Music

The data related to comments returned by the server is in json format, which contains very rich information (such as information about the commentator, comment date, number of likes, comment content, etc.), as shown in Figure 9 below: (In fact, hotComments is hot comments and comments is an array of comments)

How to use Python to crawl popular comments on NetEase Cloud Music

At this point, we have determined the direction, that is, we only need to determine the two parameter values of params and encSecKey. This problem has troubled me all afternoon. I have been working on it for a long time but I still can’t figure out these two parameters. encryption method, but I discovered a pattern, http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016?csrf_token=中R_SO_4_ The number after is the id value of the song. For the param and encSecKey values of different songs, if the two parameter values of a song such as A are passed to the song B, then for the same page Number, this parameter is universal, that is, if the two parameter values of the first page of A are passed to the two parameters of any other song, the comments of the first page of the corresponding song can be obtained. For the second page, the The same goes for three pages and so on.

But unfortunately, different page number parameters are different, this method can only capture limited A few pages (of course it is enough to capture the total number of comments and popular comments). If you want to capture all the data, you must understand the encryption method of these two parameter values.

# I thought I didn’t understand it. Last night I went to Zhihu to search for this question, and I actually found the answer. @ Flat-chested Little Fairy This friend explained in detail how to crack the encryption process of these two parameters. I researched it and found that it is still a bit complicated. I changed it according to the method written by 知友 , successfully got all the comments. I would like to express my gratitude to Zhihu@flat-chested little fairy.

So far, we have finished explaining how to capture all the data of NetEase Cloud Music’s comments. As usual, I uploaded the code last, and it worked in my own test:

#!/usr/bin/env python2.7   
# -*- coding: utf-8 -*-   
# @Time   : 2017/3/28 8:46   
# @Author : Lyrichu   
# @Email  : 919987476@qq.com   
# @File   : NetCloud_spider3.py   &#39;&#39;&#39;   
@Description:   
网易云音乐评论爬虫，可以完整爬取整个评论   
部分参考了@平胸小仙女的文章
来源：知乎
&#39;&#39;&#39;   from Crypto.Cipher import AES   
import base64   
import requests   
import json   
import codecs   
import time   
# 头部信息   
headers = {   
    &#39;Host&#39;:"music.163.com",   
    &#39;Accept-Language&#39;:"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",   
    &#39;Accept-Encoding&#39;:"gzip, deflate",   
    &#39;Content-Type&#39;:"application/x-www-form-urlencoded",   
    &#39;Cookie&#39;:"_ntes_nnid=754361b04b121e078dee797cdb30e0fd,1486026808627; _ntes_nuid=754361b04b121e078dee797cdb30e0fd; JSESSIONID-WYYY=yfqt9ofhY%5CIYNkXW71TqY5OtSZyjE%2FoswGgtl4dMv3Oa7%5CQ50T%2FVaee%2FMSsCifHE0TGtRMYhSPpr20i%5CRO%2BO%2B9pbbJnrUvGzkibhNqw3Tlgn%5Coil%2FrW7zFZZWSA3K9gD77MPSVH6fnv5hIT8ms70MNB3CxK5r3ecj3tFMlWFbFOZmGw%5C%3A1490677541180; _iuqxldmzr_=32; vjuids=c8ca7976.15a029d006a.0.51373751e63af8; vjlast=1486102528.1490172479.21; __gads=ID=a9eed5e3cae4d252:T=1486102537:S=ALNI_Mb5XX2vlkjsiU5cIy91-ToUDoFxIw; vinfo_n_f_l_n3=411a2def7f75a62e.1.1.1486349441669.1486349607905.1490173828142; P_INFO=m15527594439@163.com|1489375076|1|study|00&99|null&null&null#hub&420100#10#0#0|155439&1|study_client|15527594439@163.com; NTES_CMT_USER_INFO=84794134%7Cm155****4439%7Chttps%3A%2F%2Fsimg.ws.126.net%2Fe%2Fimg5.cache.netease.com%2Ftie%2Fimages%2Fyun%2Fphoto_default_62.png.39x39.100.jpg%7Cfalse%7CbTE1NTI3NTk0NDM5QDE2My5jb20%3D; usertrack=c+5+hljHgU0T1FDmA66MAg==; Province=027; City=027; _ga=GA1.2.1549851014.1489469781; __utma=94650624.1549851014.1489469781.1490664577.1490672820.8; __utmc=94650624; __utmz=94650624.1490661822.6.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; playerid=81568911; __utmb=94650624.23.10.1490672820",   
    &#39;Connection&#39;:"keep-alive",   
    &#39;Referer&#39;:&#39;http://music.163.com/&#39;   }
# 设置代理服务器
  proxies= {   
            &#39;http:&#39;:&#39;http://121.232.146.184&#39;,   
            &#39;https:&#39;:&#39;https://144.255.48.197&#39;   
        }
# offset的取值为:
(评论页数-1)*20,total第一页为true，其余页为false   # first_param = &#39;{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}&#39; 
# 第一个参数   second_param = "010001" 
# 第二个参数   
# 第三个参数   third_param = "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"   
# 第四个参数   forth_param = "0CoJUm6Qyw8W8jud"   
# 获取参数   def get_params(page): 
# page为传入页数   
    iv = "0102030405060708"   
    first_key = forth_param   
    second_key = 16 * &#39;F&#39;   
    if(page == 1): # 如果为第一页   
        first_param = &#39;{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}&#39;   
        h_encText = AES_encrypt(first_param, first_key, iv)   
    else:   
        offset = str((page-1)*20)   
        first_param = &#39;{rid:"", offset:"%s", total:"%s", limit:"20", csrf_token:""}&#39; %(offset,&#39;false&#39;)   
        h_encText = AES_encrypt(first_param, first_key, iv)   
    h_encText = AES_encrypt(h_encText, second_key, iv)   
    return h_encText  
# 获取 encSecKey   
def get_encSecKey():   
    encSecKey = "257348aecb5e556c066de214e531faadd1c55d814f9be95fd06d6bff9f4c7a41f831f6394d5a3fd2e3881736d94a02ca919d952872e7d0a50ebfa1769a7a62d512f5f1ca21aec60bc3819a9c3ffca5eca9a0dba6d6f7249b06f5965ecfff3695b54e1c28f3f624750ed39e7de08fc8493242e26dbc4484a01c76f739e135637c"   
    return encSecKey   
# 解密过程  
 def AES_encrypt(text, key, iv):   
    pad = 16 - len(text) % 16   
    text = text + pad * chr(pad)   
    encryptor = AES.new(key, AES.MODE_CBC, iv)   
    encrypt_text = encryptor.encrypt(text)   
    encrypt_text = base64.b64encode(encrypt_text)   
    return encrypt_text  
# 获得评论json数据  
def get_json(url, params, encSecKey):   
    data = {   
         "params": params,   
         "encSecKey": encSecKey   
    }   
    response = requests.post(url, headers=headers, data=data,proxies = proxies)   
    return response.content
# 抓取热门评论，返回热评列表   
def get_hot_comments(url):   
    hot_comments_list = []   
    hot_comments_list.append(u"用户ID 用户昵称 用户头像地址 评论时间 点赞总数 评论内容")   
    params = get_params(1) # 第一页   
    encSecKey = get_encSecKey()   
    json_text = get_json(url,params,encSecKey)   
    json_dict = json.loads(json_text)   
    hot_comments = json_dict[&#39;hotComments&#39;] # 热门评论   
    print("共有%d条热门评论!" % len(hot_comments))   
    for item in hot_comments:   
            comment = item[&#39;content&#39;] # 评论内容   
            likedCount = item[&#39;likedCount&#39;] # 点赞总数   
            comment_time = item[&#39;time&#39;] # 评论时间(时间戳)   
            userID = item[&#39;user&#39;][&#39;userID&#39;] # 评论者id   
            nickname = item[&#39;user&#39;][&#39;nickname&#39;] # 昵称   
            avatarUrl = item[&#39;user&#39;][&#39;avatarUrl&#39;] # 头像地址   
            comment_info = userID + " " + nickname + " " + avatarUrl + " " + comment_time + " " + likedCount + " " + comment + u""   
            hot_comments_list.append(comment_info)   
    return hot_comments_list     
 # 抓取某一首歌的全部评论   
def get_all_comments(url):   
    all_comments_list = [] # 存放所有评论   
    all_comments_list.append(u"用户ID 用户昵称 用户头像地址 评论时间 点赞总数 评论内容") # 头部信息   
    params = get_params(1)   
    encSecKey = get_encSecKey()   
    json_text = get_json(url,params,encSecKey)   
    json_dict = json.loads(json_text)   
    comments_num = int(json_dict[&#39;total&#39;])   
    if(comments_num % 20 == 0):   
        page = comments_num / 20   
    else:   
        page = int(comments_num / 20) + 1   
    print("共有%d页评论!" % page)   
    for i in range(page):  # 逐页抓取   
        params = get_params(i+1)   
        encSecKey = get_encSecKey()   
        json_text = get_json(url,params,encSecKey)   
        json_dict = json.loads(json_text)   
        if i == 0:   
            print("共有%d条评论!" % comments_num) # 全部评论总数   
        for item in json_dict[&#39;comments&#39;]:   
            comment = item[&#39;content&#39;] # 评论内容   
            likedCount = item[&#39;likedCount&#39;] # 点赞总数   
            comment_time = item[&#39;time&#39;] # 评论时间(时间戳)   
            userID = item[&#39;user&#39;][&#39;userId&#39;] # 评论者id   
            nickname = item[&#39;user&#39;][&#39;nickname&#39;] # 昵称   
            avatarUrl = item[&#39;user&#39;][&#39;avatarUrl&#39;] # 头像地址   
            comment_info = unicode(userID) + u" " + nickname + u" " + avatarUrl + u" " + unicode(comment_time) + u" " + unicode(likedCount) + u" " + comment + u""   
            all_comments_list.append(comment_info)   
        print("第%d页抓取完毕!" % (i+1))   
    return all_comments_list
# 将评论写入文本文件
def save_to_file(list,filename):   
        with codecs.open(filename,&#39;a&#39;,encoding=&#39;utf-8&#39;) as f:   
            f.writelines(list)   
        print("写入文件成功!")   
if __name__ == "__main__":   
    start_time = time.time() # 开始时间   
    url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016/?csrf_token="   
    filename = u"晴天.txt"   
    all_comments_list = get_all_comments(url)   
    save_to_file(all_comments_list,filename)   
    end_time = time.time() #结束时间   
    print("程序耗时%f秒." % (end_time - start_time))

Copy after login

I used the above code to run and captured two of Jay Chou's popular songs "Sunny Day" (with more than 1.3 million comments) and "Confession Balloon" (with more than 200,000 comments), the former ran for about 20 minutes, and the latter lasted for more than 6,600 seconds (that is, nearly 2 hours). The screenshots are as follows:

How to use Python to crawl popular comments on NetEase Cloud Music

Note that I separate them by spaces. Each line has a user ID. User nickname user avatar address comment time total number of likes comment content These contents. StudentsWhen running the code to capture by yourself, be careful not to open too many threads and put too much pressure on the NetEase Cloud server (There was a period of time when the server returned data very slowly. I don’t know if it is a limitation. I visited and got better later). I may do my own visual analysis of the comment data later, so stay tuned!

Appendix: Those heart-warming comments

How to use Python to crawl popular comments on NetEase Cloud Music

The above is the detailed content of How to use Python to crawl popular comments on NetEase Cloud Music. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Two Point Museum: All Exhibits And Where To Find Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7372

Java Tutorial

1628

CakePHP Tutorial

1355

Laravel Tutorial

1266

PHP Tutorial

1215

Related knowledge

Is there any mobile app that can convert XML into PDF? Apr 02, 2025 pm 08:54 PM

An application that converts XML directly to PDF cannot be found because they are two fundamentally different formats. XML is used to store data, while PDF is used to display documents. To complete the transformation, you can use programming languages and libraries such as Python and ReportLab to parse XML data and generate PDF documents.

Is the conversion speed fast when converting XML to PDF on mobile phone? Apr 02, 2025 pm 10:09 PM

The speed of mobile XML to PDF depends on the following factors: the complexity of XML structure. Mobile hardware configuration conversion method (library, algorithm) code quality optimization methods (select efficient libraries, optimize algorithms, cache data, and utilize multi-threading). Overall, there is no absolute answer and it needs to be optimized according to the specific situation.

How to convert XML files to PDF on your phone? Apr 02, 2025 pm 10:12 PM

It is impossible to complete XML to PDF conversion directly on your phone with a single application. It is necessary to use cloud services, which can be achieved through two steps: 1. Convert XML to PDF in the cloud, 2. Access or download the converted PDF file on the mobile phone.

How to control the size of XML converted to images? Apr 02, 2025 pm 07:24 PM

To generate images through XML, you need to use graph libraries (such as Pillow and JFreeChart) as bridges to generate images based on metadata (size, color) in XML. The key to controlling the size of the image is to adjust the values of the <width> and <height> tags in XML. However, in practical applications, the complexity of XML structure, the fineness of graph drawing, the speed of image generation and memory consumption, and the selection of image formats all have an impact on the generated image size. Therefore, it is necessary to have a deep understanding of XML structure, proficient in the graphics library, and consider factors such as optimization algorithms and image format selection.

How to open xml format Apr 02, 2025 pm 09:00 PM

Use most text editors to open XML files; if you need a more intuitive tree display, you can use an XML editor, such as Oxygen XML Editor or XMLSpy; if you process XML data in a program, you need to use a programming language (such as Python) and XML libraries (such as xml.etree.ElementTree) to parse.

Recommended XML formatting tool Apr 02, 2025 pm 09:03 PM

XML formatting tools can type code according to rules to improve readability and understanding. When selecting a tool, pay attention to customization capabilities, handling of special circumstances, performance and ease of use. Commonly used tool types include online tools, IDE plug-ins, and command-line tools.

Is there a mobile app that can convert XML into PDF? Apr 02, 2025 pm 09:45 PM

There is no APP that can convert all XML files into PDFs because the XML structure is flexible and diverse. The core of XML to PDF is to convert the data structure into a page layout, which requires parsing XML and generating PDF. Common methods include parsing XML using Python libraries such as ElementTree and generating PDFs using ReportLab library. For complex XML, it may be necessary to use XSLT transformation structures. When optimizing performance, consider using multithreaded or multiprocesses and select the appropriate library.

What is the function of C language sum? Apr 03, 2025 pm 02:21 PM

There is no built-in sum function in C language, so it needs to be written by yourself. Sum can be achieved by traversing the array and accumulating elements: Loop version: Sum is calculated using for loop and array length. Pointer version: Use pointers to point to array elements, and efficient summing is achieved through self-increment pointers. Dynamically allocate array version: Dynamically allocate arrays and manage memory yourself, ensuring that allocated memory is freed to prevent memory leaks.

See all articles