python利用beautifulSoup实现爬虫

PHP中文网
Release: 2017-06-01 10:20:14
Original
1711 people have browsed it

以前讲过利用phantomjs做爬虫抓网页 www.jb51.net/article/55789.htm 是配合选择器做的

利用 beautifulSoup(文档 :www.crummy.com/software/BeautifulSoup/bs4/doc/)这个python模块,可以很轻松的抓取网页内容


# coding=utf-8
import urllib
from bs4 import BeautifulSoup

url ='http://www.baidu.com/s'
values ={'wd':'网球'}
encoded_param = urllib.urlencode(values)
full_url = url +'?'+ encoded_param
response = urllib.urlopen(full_url)
soup =BeautifulSoup(response)
alinks = soup.find_all('a')
Copy after login

上面可以抓取百度搜出来结果是网球的记录。

beautifulSoup内置了很多非常有用的方法。

几个比较好用的特性:

构造一个node元素

代码如下:

soup = BeautifulSoup('
Extremely bold
')
tag = soup.b
type(tag)
#
Copy after login

属性可以使用attr拿到,结果是字典

代码如下:

tag.attrs
# {u'class': u'boldest'}
Copy after login

或者直接tag.class取属性也可。

也可以自由操作属性


tag['class'] = 'verybold'
tag['id'] = 1
tag
#Extremely bolddel tag['class']
del tag['id']
tag
#Extremely boldtag['class']
# KeyError: 'class'
print(tag.get('class'))
# None
Copy after login

还可以随便操作,查找dom元素,比如下面的例子

1.构建一份文档


html_doc = """The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;
and they lived at the bottom of a well...."""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
Copy after login

2.各种搞


soup.head
#The Dormouse's storysoup.title
#The Dormouse's storysoup.body.b
# The Dormouse's storysoup.a
# Elsiesoup.find_all('a')
# [Elsie,
# Lacie,
# Tillie]
head_tag = soup.head
head_tag
#The Dormouse's storyhead_tag.contents
[The Dormouse's story]

title_tag = head_tag.contents[0]
title_tag
#The Dormouse's storytitle_tag.contents
# [u'The Dormouse's story']
len(soup.contents)
# 1
soup.contents[0].name
# u'html'
text = title_tag.contents[0]
text.contents

for child in title_tag.children:
  print(child)
head_tag.contents
# [The Dormouse's story]
for child in head_tag.descendants:
  print(child)
#The Dormouse's story# The Dormouse's story

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
title_tag.string
# u'The Dormouse's story'
Copy after login
Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template