python - 爬取人民日报的评论员文章，碰到问题了，求解答。

Question

新手，在学习python爬虫，环境是python3.4，想爬取人民日报评论员文章，现在只怕去了一个网页，代码如下， {代码...} 网页上出错的源代码如下：&lt;span style="text-indent: 2em; display: block;" id="paper_nu...

天蓬老师 · Answer

原來程式碼中的連結已經失效，我以 http://cpc.people.com.cn/n1/2016/0628/c404684-28502214.html 中文章為範例。

可以正常運作的程式碼：

#! /usr/bin/env python
# -*- coding: utf-8 -*-
# @Last Modified time: 2016-06-30 12:32:52

import requests
from bs4 import BeautifulSoup


myUrl = "http://cpc.people.com.cn/n1/2016/0628/c404684-28502214.html"
response = requests.get(myUrl)

response.encoding = response.apparent_encoding

soup = BeautifulSoup(response.text)
print soup.title.string

for a in soup.find_all(style="text-indent: 2em;"):
    if a.string:
        print a.string

運行結果：

這裡遇到的編碼問題很常見，簡單來說就是 requests 猜錯了網頁的編碼方式。

requests 取得response 後，會根據 headers 中給出的編碼來解碼拿到的數據，如果響應 header 沒有指定編碼，則預設指定為 ISO-8859-1(encoding 屬性)。還好 requests 還可以根據內容猜測編碼方案，推測的結果保存在 apparent_encoding 屬性中，針對人民日報評論，這裡是 GB2312。所以，只需要製定 encoding = apparent_encoding，然後取得text 即可得到正確的解碼結果。（注意apparent_encoding並不能保證 100%正確）

requests 文件部分可以參考Response Content
關於編碼的理解，可以參考：人機互動之字符編碼和五分鐘戰勝 Python 字符編碼。
關於requests 編碼解析的詳細內容，參考Python + Requests 編碼問題

編碼確實是個坑，不過搞清楚了，就很容易避過去。

大家讲道理 · Answer

找到一個公共的元素，然後用正規來篩選資料吧

伊谢尔伦 · Answer

報錯原因NoneType類別沒有encode屬性，表示你用soup.find_all()沒有匹配到括號內的參數，你試試先匹配一下tag，再匹配style，可能會找到原因，實在不行上正則