python - 爬取人民日报的评论员文章，碰到问题了，求解答。

Question

新手，在学习python爬虫，环境是python3.4，想爬取人民日报评论员文章，现在只怕去了一个网页，代码如下， {代码...} 网页上出错的源代码如下：&lt;span style="text-indent: 2em; display: block;" id="paper_nu...

天蓬老师 · Answer

The link in the original code is no longer valid. I took the article in http://cpc.people.com.cn/n1/2016/0628/c404684-28502214.html as an example.

Working code:

#! /usr/bin/env python
# -*- coding: utf-8 -*-
# @Last Modified time: 2016-06-30 12:32:52

import requests
from bs4 import BeautifulSoup


myUrl = "http://cpc.people.com.cn/n1/2016/0628/c404684-28502214.html"
response = requests.get(myUrl)

response.encoding = response.apparent_encoding

soup = BeautifulSoup(response.text)
print soup.title.string

for a in soup.find_all(style="text-indent: 2em;"):
    if a.string:
        print a.string

Run result:

The encoding problem encountered here is very common. Simply put, requests guessed the wrong encoding method of the web page.

After requests obtain the response, the obtained data will be decoded according to the encoding given in the headers. If the response header does not specify an encoding, the default is ISO-8859-1 (encoding attribute). Fortunately, requests can also guess the encoding scheme based on the content, and the guessed result is stored in the apparent_encoding attribute. For People's Daily comments, here is GB2312. Therefore, you only need to specify encoding = apparent_encoding, and then get the text to get the correct decoding result. (Note that apparent_encoding is not guaranteed to be 100% correct)

Requests document part can refer to Response Content
For understanding of encoding, you can refer to: Human-Computer Interaction Character Encoding and Five Minutes to Defeat Python Character Encoding.
For details on requests encoding analysis, please refer to Python + Requests encoding issues

Coding is indeed a pitfall, but once you figure it out, it’s easy to avoid it.

大家讲道理 · Answer

Find a common element and then use regular expression to filter the data

伊谢尔伦 · Answer

The reason for the error is that the NoneType class does not have the encode attribute, which means that you used soup.find_all() to not match the parameters in the brackets. You can try matching the tag first, and then the style. You may find the reason, but it is not possible to use regular rules