python - 爬取人民日报的评论员文章,碰到问题了,求解答。
PHPz
PHPz 2017-04-17 17:58:53
0
3
671

新手,在学习python爬虫,环境是python3.4,想爬取人民日报评论员文章,现在只怕去了一个网页,代码如下,

import requests
from bs4 import BeautifulSoup
import re

myUrl = "http://cpc.people.com.cn/pinglun/n1/201/0613/c78779-28428425.html"
response = requests.get(myUrl)
soup = BeautifulSoup(response.text, "lxml", from_encoding="gbk")
print(soup.title.string.encode('ISO-8859-1').decode('gbk'))

for a in soup.find_all(style="text-indent: 2em;"):
    print(a.string.encode('ISO-8859-1').decode('gbk'))

网页上出错的源代码如下:
<span style="text-indent: 2em; display: block;" id="paper_num">《 人民日报 》( 2016年06月13日 01 版)</span>
我的出错提示如下:
Traceback (most recent call last):
File "pa_chong_lx.py", line 21, in <module>

print(a.string.encode('ISO-8859-1').decode('gbk'))

AttributeError: 'NoneType' object has no attribute 'encode'
原因分析:
我查找的关键词是style="text-indent: 2em;,这段代码<span style="text-indent: 2em; display: block;" id="paper_num">《 人民日报 》( 2016年06月13日 01 版)</span> 格式与前边的主题文章代码不一样,所以出错,求解答怎么改。

新手,因为编码的问题卡了好久,感觉一步一个坑,步步是坑!python虽然简单,但也正是简单,我不知道哪里出错了,或者是知道错误但不知道怎么改正。

PHPz
PHPz

学习是最好的投资!

reply all(3)
Peter_Zhu

The link in the original code is no longer valid. I took the article in http://cpc.people.com.cn/n1/2016/0628/c404684-28502214.html as an example.

Working code:

#! /usr/bin/env python
# -*- coding: utf-8 -*-
# @Last Modified time: 2016-06-30 12:32:52

import requests
from bs4 import BeautifulSoup


myUrl = "http://cpc.people.com.cn/n1/2016/0628/c404684-28502214.html"
response = requests.get(myUrl)

response.encoding = response.apparent_encoding

soup = BeautifulSoup(response.text)
print soup.title.string

for a in soup.find_all(style="text-indent: 2em;"):
    if a.string:
        print a.string

Run result:

The encoding problem encountered here is very common. Simply put, requests guessed the wrong encoding method of the web page.

After requests obtain the response, the obtained data will be decoded according to the encoding given in the headers. If the response header does not specify an encoding, the default is ISO-8859-1 (encoding attribute). Fortunately, requests can also guess the encoding scheme based on the content, and the guessed result is stored in the apparent_encoding attribute. For People's Daily comments, here is GB2312. Therefore, you only need to specify encoding = apparent_encoding, and then get the text to get the correct decoding result. (Note that apparent_encoding is not guaranteed to be 100% correct)

Requests document part can refer to Response Content
For understanding of encoding, you can refer to: Human-Computer Interaction Character Encoding and Five Minutes to Defeat Python Character Encoding.
For details on requests encoding analysis, please refer to Python + Requests encoding issues

Coding is indeed a pitfall, but once you figure it out, it’s easy to avoid it.

大家讲道理

Find a common element and then use regular expression to filter the data

伊谢尔伦

The reason for the error is that the NoneType class does not have the encode attribute, which means that you used soup.find_all() to not match the parameters in the brackets. You can try matching the tag first, and then the style. You may find the reason, but it is not possible to use regular rules

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template