新手,在学习python爬虫,环境是python3.4,想爬取人民日报评论员文章,现在只怕去了一个网页,代码如下,
import requests
from bs4 import BeautifulSoup
import re
myUrl = "http://cpc.people.com.cn/pinglun/n1/201/0613/c78779-28428425.html"
response = requests.get(myUrl)
soup = BeautifulSoup(response.text, "lxml", from_encoding="gbk")
print(soup.title.string.encode('ISO-8859-1').decode('gbk'))
for a in soup.find_all(style="text-indent: 2em;"):
print(a.string.encode('ISO-8859-1').decode('gbk'))
网页上出错的源代码如下:
<span style="text-indent: 2em; display: block;" id="paper_num">《 人民日报 》( 2016年06月13日 01 版)</span>
我的出错提示如下:
Traceback (most recent call last):
File "pa_chong_lx.py", line 21, in <module>
print(a.string.encode('ISO-8859-1').decode('gbk'))
AttributeError: 'NoneType' object has no attribute 'encode'
原因分析:
我查找的关键词是style="text-indent: 2em;,这段代码<span style="text-indent: 2em; display: block;" id="paper_num">《 人民日报 》( 2016年06月13日 01 版)</span> 格式与前边的主题文章代码不一样,所以出错,求解答怎么改。
新手,因为编码的问题卡了好久,感觉一步一个坑,步步是坑!python虽然简单,但也正是简单,我不知道哪里出错了,或者是知道错误但不知道怎么改正。
The link in the original code is no longer valid. I took the article in http://cpc.people.com.cn/n1/2016/0628/c404684-28502214.html as an example.
Working code:
Run result:
The encoding problem encountered here is very common. Simply put, requests guessed the wrong encoding method of the web page.
After requests obtain the response, the obtained data will be decoded according to the encoding given in the headers. If the response header does not specify an encoding, the default is ISO-8859-1 (encoding attribute). Fortunately, requests can also guess the encoding scheme based on the content, and the guessed result is stored in the apparent_encoding attribute. For People's Daily comments, here is GB2312. Therefore, you only need to specify encoding = apparent_encoding, and then get the text to get the correct decoding result. (Note that apparent_encoding is not guaranteed to be 100% correct)
Requests document part can refer to Response Content
For understanding of encoding, you can refer to: Human-Computer Interaction Character Encoding and Five Minutes to Defeat Python Character Encoding.
For details on requests encoding analysis, please refer to Python + Requests encoding issues
Coding is indeed a pitfall, but once you figure it out, it’s easy to avoid it.
Find a common element and then use regular expression to filter the data
The reason for the error is that the NoneType class does not have the encode attribute, which means that you used soup.find_all() to not match the parameters in the brackets. You can try matching the tag first, and then the style. You may find the reason, but it is not possible to use regular rules