python爬蟲beta版之抓取知乎單頁-Python教學-PHP中文網

首頁

後端開發

Python教學

python爬蟲beta版之抓取知乎單頁

高洛峰

Dec 02, 2016 pm 04:51 PM

python

鑑於之前用python寫爬蟲，幫運營人員抓取過京東的商品品牌以及分類，這次也是用python來搞簡單的抓取單頁面版，後期再補充哈。

#-*- coding: UTF-8 -*- 
import requests
import sys
from bs4 import BeautifulSoup

#－－－－－－知乎答案收集－－－－－－－－－－

#获取网页body里的内容
def get_content(url , data = None):
    header={
        &#39;Accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8&#39;,
        &#39;Accept-Encoding&#39;: &#39;gzip, deflate, sdch&#39;,
        &#39;Accept-Language&#39;: &#39;zh-CN,zh;q=0.8&#39;,
        &#39;Connection&#39;: &#39;keep-alive&#39;,
        &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235&#39;
    }

    req = requests.get(url, headers=header)
    req.encoding = &#39;utf-8&#39;
    bs = BeautifulSoup(req.text, "html.parser")  # 创建BeautifulSoup对象
    body = bs.body # 获取body部分
    return body

#获取问题标题
def get_title(html_text):
     data = html_text.find(&#39;span&#39;, {&#39;class&#39;: &#39;zm-editable-content&#39;})
     return data.string.encode(&#39;utf-8&#39;)

#获取问题内容
def get_question_content(html_text):
     data = html_text.find(&#39;div&#39;, {&#39;class&#39;: &#39;zm-editable-content&#39;})
     if data.string is None:
         out = &#39;&#39;;
         for datastring in data.strings:
             out = out + datastring.encode(&#39;utf-8&#39;)
         print &#39;内容：\n&#39; + out
     else:
         print &#39;内容：\n&#39; + data.string.encode(&#39;utf-8&#39;)

#获取点赞数
def get_answer_agree(body):
    agree = body.find(&#39;span&#39;,{&#39;class&#39;: &#39;count&#39;})
    print &#39;点赞数：&#39; + agree.string.encode(&#39;utf-8&#39;) + &#39;\n&#39;

#获取答案
def get_response(html_text):
     response = html_text.find_all(&#39;div&#39;, {&#39;class&#39;: &#39;zh-summary summary clearfix&#39;})
     for index in range(len(response)):
         #获取标签
         answerhref = response[index].find(&#39;a&#39;, {&#39;class&#39;: &#39;toggle-expand&#39;})
         if not(answerhref[&#39;href&#39;].startswith(&#39;javascript&#39;)):
             url = &#39;http://www.zhihu.com/&#39; + answerhref[&#39;href&#39;]
             print url
             body = get_content(url)
             get_answer_agree(body)
             answer = body.find(&#39;div&#39;, {&#39;class&#39;: &#39;zm-editable-content clearfix&#39;})
             if answer.string is None:
                 out = &#39;&#39;;
                 for datastring in answer.strings:
                     out = out + &#39;\n&#39; + datastring.encode(&#39;utf-8&#39;)
                 print out
             else:
                 print answer.string.encode(&#39;utf-8&#39;)


html_text = get_content(&#39;https://www.zhihu.com/question/43879769&#39;)
title = get_title(html_text)
print "标题：\n" + title + &#39;\n&#39;
questiondata = get_question_content(html_text)
print &#39;\n&#39;
data = get_response(html_text)

登入後複製

輸出結果：

python爬蟲beta版之抓取知乎單頁

本網站聲明

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

熱AI工具

熱工具

熱門話題

gmail信箱登陸入口在哪裡

7706

Java教學

1640

CakePHP 教程

1394

Laravel 教程

1288

PHP教程

1231

Related knowledge

PHP和Python：解釋了不同的範例 Apr 18, 2025 am 12:26 AM

PHP主要是過程式編程，但也支持面向對象編程（OOP）；Python支持多種範式，包括OOP、函數式和過程式編程。 PHP適合web開發，Python適用於多種應用，如數據分析和機器學習。

在PHP和Python之間進行選擇：指南 Apr 18, 2025 am 12:24 AM

PHP適合網頁開發和快速原型開發，Python適用於數據科學和機器學習。 1.PHP用於動態網頁開發，語法簡單，適合快速開發。 2.Python語法簡潔，適用於多領域，庫生態系統強大。

Python vs. JavaScript：學習曲線和易用性 Apr 16, 2025 am 12:12 AM

Python更適合初學者，學習曲線平緩，語法簡潔；JavaScript適合前端開發，學習曲線較陡，語法靈活。 1.Python語法直觀，適用於數據科學和後端開發。 2.JavaScript靈活，廣泛用於前端和服務器端編程。

vs code 可以在 Windows 8 中運行嗎 Apr 15, 2025 pm 07:24 PM

VS Code可以在Windows 8上運行，但體驗可能不佳。首先確保系統已更新到最新補丁，然後下載與系統架構匹配的VS Code安裝包，按照提示安裝。安裝後，注意某些擴展程序可能與Windows 8不兼容，需要尋找替代擴展或在虛擬機中使用更新的Windows系統。安裝必要的擴展，檢查是否正常工作。儘管VS Code在Windows 8上可行，但建議升級到更新的Windows系統以獲得更好的開發體驗和安全保障。

visual studio code 可以用於 python 嗎 Apr 15, 2025 pm 08:18 PM

VS Code 可用於編寫 Python，並提供許多功能，使其成為開發 Python 應用程序的理想工具。它允許用戶：安裝 Python 擴展，以獲得代碼補全、語法高亮和調試等功能。使用調試器逐步跟踪代碼，查找和修復錯誤。集成 Git，進行版本控制。使用代碼格式化工具，保持代碼一致性。使用 Linting 工具，提前發現潛在問題。