配置python 2.7
<code> bs4 requests</code>
安裝 用pip安裝 sudo pip install bs4
sudo pip install requests
簡單說明bs4的使用因為是爬取網頁 所以就介紹find 跟find_all
find跟find_all的不同在於返回的東西不同 find返回的是匹配到的第一個標籤及標籤裡的內容
find_all回傳的是一個列表
例如我們寫一個test.html 用來測試find跟find_all的差別。內容是:
<html> <head> </head> <body> <div id="one"><a></a></div> <div id="two"><a href="#">abc</a></div> <div id="three"><a href="#">three a</a><a href="#">three a</a><a href="#">three a</a></div> <div id="four"><a href="#">four<p>four p</p><p>four p</p><p>four p</p> a</a></div> </body> </html>
<code class="xml"><span class="hljs-tag"> </span></code>
from bs4 import BeautifulSoup import lxml if __name__=='__main__': s = BeautifulSoup(open('test.html'),'lxml') print s.prettify() print "------------------------------" print s.find('div') print s.find_all('div') print "------------------------------" print s.find('div',id='one') print s.find_all('div',id='one') print "------------------------------" print s.find('div',id="two") print s.find_all('div',id="two") print "------------------------------" print s.find('div',id="three") print s.find_all('div',id="three") print "------------------------------" print s.find('div',id="four") print s.find_all('div',id="four") print "------------------------------"
<code class="python"><span class="hljs-keyword"> </span></code>
所以我們在使用時候要注意到底要的是什麼,否則會出現報錯
接下來就是透過requests 取得網頁資訊了,我不太懂別人為什麼要寫heard跟其他的東西
<span style="color: #0000ff">def</span><span style="color: #000000"> get_html(): url </span>= <span style="color: #800000">"</span><span style="color: #800000"></span><span style="color: #800000">"</span><span style="color: #000000"> two_html </span>= [<span style="color: #800000">'</span><span style="color: #800000">sanwen</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">shige</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">zawen</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">suibi</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">rizhi</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">novel</span><span style="color: #800000">'</span><span style="color: #000000">] </span><span style="color: #0000ff">for</span> doc <span style="color: #0000ff">in</span><span style="color: #000000"> two_html: i</span>=1 <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">sanwen</span><span style="color: #800000">'</span><span style="color: #000000">: </span><span style="color: #0000ff">print</span> <span style="color: #800000">"</span><span style="color: #800000">running sanwen -----------------------------</span><span style="color: #800000">"</span> <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">shige</span><span style="color: #800000">'</span><span style="color: #000000">: </span><span style="color: #0000ff">print</span> <span style="color: #800000">"</span><span style="color: #800000">running shige ------------------------------</span><span style="color: #800000">"</span> <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">zawen</span><span style="color: #800000">'</span><span style="color: #000000">: </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running zawen -------------------------------</span><span style="color: #800000">'</span> <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">suibi</span><span style="color: #800000">'</span><span style="color: #000000">: </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running suibi -------------------------------</span><span style="color: #800000">'</span> <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">rizhi</span><span style="color: #800000">'</span><span style="color: #000000">: </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running ruzhi -------------------------------</span><span style="color: #800000">'</span> <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">nove</span><span style="color: #800000">'</span><span style="color: #000000">: </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running xiaoxiaoshuo -------------------------</span><span style="color: #800000">'</span> <span style="color: #0000ff">while</span>(i<10<span style="color: #000000">): par </span>= {<span style="color: #800000">'</span><span style="color: #800000">p</span><span style="color: #800000">'</span><span style="color: #000000">:i} res </span>= requests.get(url+doc+<span style="color: #800000">'</span><span style="color: #800000">/</span><span style="color: #800000">'</span>,params=<span style="color: #000000">par) </span><span style="color: #0000ff">if</span> res.status_code==200<span style="color: #000000">: soup(res.text) i</span>+=i
<code class="python"><span class="hljs-function"><span class="hljs-keyword"> </span></span></code>
<span style="color: #0000ff">def</span><span style="color: #000000"> soup(html_text): s </span>= BeautifulSoup(html_text,<span style="color: #800000">'</span><span style="color: #800000">lxml</span><span style="color: #800000">'</span><span style="color: #000000">) link </span>= s.find(<span style="color: #800000">'</span><span style="color: #800000">div</span><span style="color: #800000">'</span>,class_=<span style="color: #800000">'</span><span style="color: #800000">categorylist</span><span style="color: #800000">'</span>).find_all(<span style="color: #800000">'</span><span style="color: #800000">li</span><span style="color: #800000">'</span><span style="color: #000000">) </span><span style="color: #0000ff">for</span> i <span style="color: #0000ff">in</span><span style="color: #000000"> link: </span><span style="color: #0000ff">if</span> i!=s.find(<span style="color: #800000">'</span><span style="color: #800000">li</span><span style="color: #800000">'</span>,class_=<span style="color: #800000">'</span><span style="color: #800000">page</span><span style="color: #800000">'</span><span style="color: #000000">): title </span>= i.find_all(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span>)[1<span style="color: #000000">] author </span>= i.find_all(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span>)[2<span style="color: #000000">].text url </span>= title.attrs[<span style="color: #800000">'</span><span style="color: #800000">href</span><span style="color: #800000">'</span><span style="color: #000000">] sign </span>= re.compile(r<span style="color: #800000">'</span><span style="color: #800000">(//)|/</span><span style="color: #800000">'</span><span style="color: #000000">) match </span>=<span style="color: #000000"> sign.search(title.text) file_name </span>=<span style="color: #000000"> title.text </span><span style="color: #0000ff">if</span><span style="color: #000000"> match: file_name </span>= sign.sub(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span>,str(title.text))
<code class="python"><span class="hljs-function"><span class="hljs-keyword"> </span></span></code>
<span style="color: #0000ff">def</span><span style="color: #000000"> get_content(url): res </span>= requests.get(<span style="color: #800000">'</span><span style="color: #800000"></span><span style="color: #800000">'</span>+<span style="color: #000000">url) </span><span style="color: #0000ff">if</span> res.status_code==200<span style="color: #000000">: soup </span>= BeautifulSoup(res.text,<span style="color: #800000">'</span><span style="color: #800000">lxml</span><span style="color: #800000">'</span><span style="color: #000000">) contents </span>= soup.find(<span style="color: #800000">'</span><span style="color: #800000">div</span><span style="color: #800000">'</span>,class_=<span style="color: #800000">'</span><span style="color: #800000">content</span><span style="color: #800000">'</span>).find_all(<span style="color: #800000">'</span><span style="color: #800000">p</span><span style="color: #800000">'</span><span style="color: #000000">) content </span>= <span style="color: #800000">''</span> <span style="color: #0000ff">for</span> i <span style="color: #0000ff">in</span><span style="color: #000000"> contents: content</span>+=i.text+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span> <span style="color: #0000ff">return</span> content
<code class="python"><span class="hljs-function"><span class="hljs-keyword"> </span></span></code>
f = open(file_name+<span style="color: #800000">'</span><span style="color: #800000">.txt</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">w</span><span style="color: #800000">'</span><span style="color: #000000">) </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running w txt</span><span style="color: #800000">'</span>+file_name+<span style="color: #800000">'</span><span style="color: #800000">.txt</span><span style="color: #800000">'</span><span style="color: #000000"> f.write(title.text</span>+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span><span style="color: #000000">) f.write(author</span>+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span><span style="color: #000000">) content</span>=<span style="color: #000000">get_content(url) f.write(content) f.close()</span>
三個函數獲取散文網的散文,不過有問題,問題在於不知道為什麼有些散文丟失了我只能獲取到大概400多篇文章,這跟散文網的文章是差很多很多的,但是確實是一頁一頁的取得來的,這個問題希望大佬幫忙看看。可能應該要做網頁無法存取的處理,當然我覺得跟我宿捨這個破網有關係
f = open(file_name+<span style="color: #800000">'</span><span style="color: #800000">.txt</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">w</span><span style="color: #800000">'</span><span style="color: #000000">) </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running w txt</span><span style="color: #800000">'</span>+file_name+<span style="color: #800000">'</span><span style="color: #800000">.txt</span><span style="color: #800000">'</span><span style="color: #000000"> f.write(title.text</span>+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span><span style="color: #000000">) f.write(author</span>+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span><span style="color: #000000">) content</span>=<span style="color: #000000">get_content(url) f.write(content) f.close()</span>
差點忘了效果圖 #程式碼雖亂,我卻從未止步
以上是使用python爬取散文網的文章的詳細內容。更多資訊請關注PHP中文網其他相關文章!