這篇文章主要跟大家介紹了利用python爬取散文網文章的相關資料,文中介紹的非常詳細,對大家具有一定的參考學習價值,需要的朋友們下面來一起看看吧。
本文主要介紹給大家的是python爬取散文網文章的相關內容,分享出來供大家參考學習,下面一起來看看詳細的介紹:
效果圖如下:
#設定python 2.7
# #
bs4 requests
sudo pip install bs4
sudo pip install requests
例如我們寫一個test.html 用來測試find跟find_all的差別。
內容是:
<html> <head> </head> <body> <p id="one"><a></a></p> <p id="two"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >abc</a></p> <p id="three"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a></p> <p id="four"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >four<p>four p</p><p>four p</p><p>four p</p> a</a></p> </body> </html>
from bs4 import BeautifulSoup import lxml if __name__=='__main__': s = BeautifulSoup(open('test.html'),'lxml') print s.prettify() print "------------------------------" print s.find('p') print s.find_all('p') print "------------------------------" print s.find('p',id='one') print s.find_all('p',id='one') print "------------------------------" print s.find('p',id="two") print s.find_all('p',id="two") print "------------------------------" print s.find('p',id="three") print s.find_all('p',id="three") print "------------------------------" print s.find('p',id="four") print s.find_all('p',id="four") print "------------------------------"
def get_html(): url = "https://www.sanwen.net/" two_html = ['sanwen','shige','zawen','suibi','rizhi','novel'] for doc in two_html: i=1 if doc=='sanwen': print "running sanwen -----------------------------" if doc=='shige': print "running shige ------------------------------" if doc=='zawen': print 'running zawen -------------------------------' if doc=='suibi': print 'running suibi -------------------------------' if doc=='rizhi': print 'running ruzhi -------------------------------' if doc=='nove': print 'running xiaoxiaoshuo -------------------------' while(i<10): par = {'p':i} res = requests.get(url+doc+'/',params=par) if res.status_code==200: soup(res.text) i+=i
res.status_code不是200的進行處理,導致的問題是會不顯示錯誤,爬取的內容會有丟失。然後分析散文網的網頁,發現是www.sanwen.net/rizhi/&p=1
#
def soup(html_text): s = BeautifulSoup(html_text,'lxml') link = s.find('p',class_='categorylist').find_all('li') for i in link: if i!=s.find('li',class_='page'): title = i.find_all('a')[1] author = i.find_all('a')[2].text url = title.attrs['href'] sign = re.compile(r'(//)|/') match = sign.search(title.text) file_name = title.text if match: file_name = sign.sub('a',str(title.text))
def get_content(url): res = requests.get('https://www.sanwen.net'+url) if res.status_code==200: soup = BeautifulSoup(res.text,'lxml') contents = soup.find('p',class_='content').find_all('p') content = '' for i in contents: content+=i.text+'\n' return content
#
f = open(file_name+'.txt','w') print 'running w txt'+file_name+'.txt' f.write(title.text+'\n') f.write(author+'\n') content=get_content(url) f.write(content) f.close()
f = open(file_name+'.txt','w') print 'running w txt'+file_name+'.txt' f.write(title.text+'\n') f.write(author+'\n') content=get_content(url) f.write(content) f.close()
以上是python爬取文章實例教程的詳細內容。更多資訊請關注PHP中文網其他相關文章!