Python数据采集--Beautifulsoup的使用

巴扎黑
Lepaskan: 2017-07-17 15:53:31
asal
1873 orang telah melayarinya

Python网络数据采集1-Beautifulsoup的使用

来自此书: [美]Ryan Mitchell 《Python网络数据采集》,例子是照搬的,觉得跟着敲一遍还是有作用的,所以记录下来。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(res.text, 'lxml')print(soup.h1)
Salin selepas log masuk
<h1>An Interesting Title</h1>
Salin selepas log masuk
Salin selepas log masuk

使用urllib访问页面是这样的,read返回的是字节,需要解码为utf-8的文本。像这样a.read().decode(&#39;utf-8&#39;),不过在使用bs4解析时候,可以直接传入urllib库返回的响应对象。

import urllib.request

a = urllib.request.urlopen(&#39;https://www.pythonscraping.com/pages/page1.html&#39;)
soup = BeautifulSoup(a, &#39;lxml&#39;)print(soup.h1)
Salin selepas log masuk
<h1>An Interesting Title</h1>
Salin selepas log masuk
Salin selepas log masuk

抓取所有CSS class属性为green的span标签,这些是人名。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/warandpeace.html&#39;)

soup = BeautifulSoup(res.text, &#39;lxml&#39;)
green_names = soup.find_all(&#39;span&#39;, class_=&#39;green&#39;)for name in green_names:print(name.string)
Salin selepas log masuk


Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
...
Salin selepas log masuk

孩子(child)和后代(descendant)是不一样的。孩子标签就是父标签的直接下一代,而后代标签则包括了父标签下面所有的子子孙孙。通俗来说,descendant包括了child。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)
gifts = soup.find(&#39;table&#39;, id=&#39;giftList&#39;).childrenfor name in gifts:print(name)
Salin selepas log masuk


<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
Salin selepas log masuk

找到表格后,选取当前结点为tr,并找到这个tr之后的兄弟节点,由于第一个tr为表格标题,这样的写法能提取出所有除开表格标题的正文数据。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)
gifts = soup.find(&#39;table&#39;, id=&#39;giftList&#39;).tr.next_siblingsfor name in gifts:print(name)
Salin selepas log masuk


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
Salin selepas log masuk

查找商品的价格,可以根据商品的图片找到其父标签<td>,其上一个兄弟标签就是价格。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)
price = soup.find(&#39;img&#39;, src=&#39;../img/gifts/img1.jpg&#39;).parent.previous_sibling.stringprint(price)
Salin selepas log masuk


$15.00
Salin selepas log masuk

采集所有商品图片,为了避免其他图片乱入。使用正则表达式精确搜索。

import reimport requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)
imgs= soup.find_all(&#39;img&#39;, src=re.compile(r&#39;../img/gifts/img.*.jpg&#39;))for img in imgs:print(img[&#39;src&#39;])
Salin selepas log masuk


../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
Salin selepas log masuk

find_all()还可以传入函数,对这个函数有个要求:就是其返回值必须是布尔类型,若是True则保留,若是False则剔除。

import reimport requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)# lambda tag: tag.name==&#39;img&#39;tags = soup.find_all(lambda tag: tag.has_attr(&#39;src&#39;))for tag in tags:print(tag)
Salin selepas log masuk


<img src="../img/gifts/logo.jpg" style="float:left;"/>
<img src="../img/gifts/img1.jpg"/>
<img src="../img/gifts/img2.jpg"/>
<img src="../img/gifts/img3.jpg"/>
<img src="../img/gifts/img4.jpg"/>
<img src="../img/gifts/img6.jpg"/>
Salin selepas log masuk

tag是一个Element对象,has_attr用来判断是否有该属性。tag.name则是获取标签名。在上面的网页中,下面的写法返回的结果一样。
lambda tag: tag.has_attr('src')lambda tag: tag.name=='img'


Atas ialah kandungan terperinci Python数据采集--Beautifulsoup的使用. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!

Label berkaitan:
sumber:php.cn
Kenyataan Laman Web ini
Kandungan artikel ini disumbangkan secara sukarela oleh netizen, dan hak cipta adalah milik pengarang asal. Laman web ini tidak memikul tanggungjawab undang-undang yang sepadan. Jika anda menemui sebarang kandungan yang disyaki plagiarisme atau pelanggaran, sila hubungi admin@php.cn
Cadangan popular
Tutorial Popular
Lagi>
Muat turun terkini
Lagi>
kesan web
Kod sumber laman web
Bahan laman web
Templat hujung hadapan