python最簡單的網頁爬蟲教程-Python教學-PHP中文網

python最簡單的網頁爬蟲教程

黄舟

發布： 2017-08-13 10:41:39

原創

2115 人瀏覽過

在我們日常上網瀏覽網頁的時候，常常會看到一些好看的圖片，我們就希望把這些圖片保存下載，或者用戶用來做桌面壁紙，或者用來做設計的素材。以下這篇文章就來跟大家介紹了關於利用python實現最簡單的網頁爬蟲的相關資料，需要的朋友可以參考借鑒，下面來一起看看吧。

前言

網路爬蟲（又稱為網頁蜘蛛，網路機器人，在FOAF社群中間，更常的稱為網頁追逐者），是一種依照一定的規則，自動抓取萬維網資訊的程式或腳本。最近對python爬蟲有了強烈興趣，在此分享自己的學習路徑，歡迎大家提出建議。我們相互交流，共同進步。話不多說了，來一起看看詳細的介紹：

1.開發工具

##筆者使用的工具是sublime text3，它的短小精悍（可能男人都不喜歡這個字）讓我十分著迷。推薦大家使用，當然如果你的電腦設定不錯，pycharm可能更適合你。

sublime text3搭建python開發環境推薦查看這篇文章：

[sublime搭建python開發環境]

爬蟲顧名思義，就是像蟲子一樣，爬在Internet這張大網上。如此，我們便可以獲得自己想要的東西。

既然要爬在Internet上，那麼我們就需要了解URL，法號“統一資源定位器”，小名“連結”。其結構主要由三個部分組成：

（1）協定：如我們在網址中常見的HTTP協定。

（2）網域或IP位址：域名，如：www.baidu.com，IP位址，即將網域解析後對應的IP。

（3）路徑：即目錄或檔案等。

3.urllib發展最簡單的爬蟲

（1）urllib簡介

ModuleIntroduceurllib.errorException classes raised by urllib.request.#urllib.parseParse URLs into or assemble them from components.#urllib.requestExtensible library for opening URLs.urllib.responseResponse classes used by urllib.#urllib.robotparserLoad a robots.txt file and answer questions about fetchability of other URLs.

（2 ）開發最簡單的爬蟲

百度首頁簡潔大方，很適合我們爬蟲。

爬蟲程式碼如下：

from urllib import request

def visit_baidu():
 URL = "http://www.baidu.com"
 # open the URL
 req = request.urlopen(URL)
 # read the URL 
 html = req.read()
 # decode the URL to utf-8
 html = html.decode("utf_8")
 print(html)

if __name__ == &#39;__main__&#39;:
 visit_baidu()

登入後複製

結果如下圖：

#我們可以透過在百度首頁空白處右鍵，查看審查元素來和我們的運行結果比較。

當然，request也可以產生一個request對象，這個物件可以用urlopen方法開啟。

程式碼如下：

from urllib import request

def vists_baidu():
 # create a request obkect
 req = request.Request(&#39;http://www.baidu.com&#39;)
 # open the request object
 response = request.urlopen(req)
 # read the response 
 html = response.read()
 html = html.decode(&#39;utf-8&#39;)
 print(html)

if __name__ == &#39;__main__&#39;:
 vists_baidu()

登入後複製

運行結果和剛才相同。

（3）錯誤處理

錯誤處理透過urllib模組來處理，主要有URLError和HTTPError錯誤，其中HTTPError錯誤是URLError錯誤的子類，即HTTRPError也可以透過URLError捕獲。

HTTPError可以透過其code屬性來捕獲。

處理HTTPError的程式碼如下：

from urllib import request
from urllib import error

def Err():
 url = "https://segmentfault.com/zzz"
 req = request.Request(url)

 try:
 response = request.urlopen(req)
 html = response.read().decode("utf-8")
 print(html)
 except error.HTTPError as e:
 print(e.code)
if __name__ == &#39;__main__&#39;:
 Err()

登入後複製

執行結果如圖：

404為列印出來的錯誤代碼，關於此詳細資訊大家可以自行百度。

URLError可以透過其reason屬性來捕獲。 chuliHTTPError的程式碼如下：

#############

from urllib import request
from urllib import error

def Err():
 url = "https://segmentf.com/"
 req = request.Request(url)

 try:
 response = request.urlopen(req)
 html = response.read().decode("utf-8")
 print(html)
 except error.URLError as e:
 print(e.reason)
if __name__ == &#39;__main__&#39;:
 Err()

登入後複製

###執行結果如圖：################################################################################### ######既然為了處理錯誤，那麼最好兩個錯誤都寫入程式碼中，畢竟越細緻越清晰。要注意的是，HTTPError是URLError的子類，所以一定要將HTTPError放在URLError的前面，否則都會輸出URLError的，如將404輸出為Not Found。 #########程式碼如下：###############

from urllib import request
from urllib import error

# 第一种方法，URLErroe和HTTPError
def Err():
 url = "https://segmentfault.com/zzz"
 req = request.Request(url)

 try:
 response = request.urlopen(req)
 html = response.read().decode("utf-8")
 print(html)
 except error.HTTPError as e:
 print(e.code)
 except error.URLError as e:
 print(e.reason)

登入後複製

###大家可以更改url來查看各種錯誤的輸出形式。 ############總結##########

以上是python最簡單的網頁爬蟲教程的詳細內容。更多資訊請關注PHP中文網其他相關文章！