Python は beautifulSoup を使用してクローラーを実装します-Python チュートリアル-php.cn

Python は beautifulSoup を使用してクローラーを実装します

PHP中文网

リリース： 2017-06-01 10:20:14

オリジナル

1776 人が閲覧しました

Web ページ www.jb51.net/article/55789.htm をクロールするために phantomjs を使用することについては以前に説明しました。これは、セレクター

を使用して行われます (ドキュメント: www.crummy.com/software)。 /BeautifulSoup/ bs4/doc/) この Python モジュールは Web コンテンツを簡単にキャプチャできます

# coding=utf-8
import urllib
from bs4 import BeautifulSoup

url =&#39;http://www.baidu.com/s&#39;
values ={&#39;wd&#39;:&#39;网球&#39;}
encoded_param = urllib.urlencode(values)
full_url = url +&#39;?&#39;+ encoded_param
response = urllib.urlopen(full_url)
soup =BeautifulSoup(response)
alinks = soup.find_all(&#39;a&#39;)

ログイン後にコピー

Baidu の検索結果をキャプチャでき、結果はテニスの記録です。

beautifulSoup には非常に便利なメソッドが多数組み込まれています。

いくつかの便利な機能:

ノード要素を構築します

コードは次のとおりです:

soup = BeautifulSoup(&#39;
Extremely bold
&#39;)
tag = soup.b
type(tag)
#

ログイン後にコピー

属性は attr を使用して取得できます。、結果は辞書

で、コードは次のとおりです:

tag.attrs
# {u&#39;class&#39;: u&#39;boldest&#39;}

ログイン後にコピー

または、tag.class から属性を直接取得できます。

属性を自由に操作することもできます

tag[&#39;class&#39;] = &#39;verybold&#39;
tag[&#39;id&#39;] = 1
tag
#Extremely bolddel tag[&#39;class&#39;]
del tag[&#39;id&#39;]
tag
#Extremely boldtag[&#39;class&#39;]
# KeyError: &#39;class&#39;
print(tag.get(&#39;class&#39;))
# None

ログイン後にコピー

次の例のように dom 要素を検索するために自由に操作することもできます

1. ドキュメントを作成します

html_doc = """The Dormouse&#39;s storyThe Dormouse&#39;s storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;
and they lived at the bottom of a well...."""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

ログイン後にコピー

2. さまざまな作業を行います