来自 Python网络数据采集的例子:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
html = urlopen("http://en.wikipedia.org"+articleUrl)
bsObj = BeautifulSoup(html)
return bsObj.find("p", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
print(newArticle)
links = getLinks(newArticle)
问题一: return bsObj.find("p", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
这段代码里面, find函数后面为什么可以加findAll,即写成 XXX.find().findAall() 的形式?
问题二:newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
此段代码 像 links[].attrs[] 之类的写法是如何依据的?可以这样写的原理?
新人求教~~谢谢!
It is recommended to use the Shenjianshou Cloud Crawler (http://www.shenjianshou.cn). The crawler is completely written and executed on the cloud. There is no need to configure any development environment, and rapid development and implementation are possible.
A complex crawler can be implemented with just a few lines of javascript, and it also provides many functional functions: anti-crawler, js rendering, data publishing, chart analysis, anti-leeching, etc. These problems that are often encountered in the process of developing crawlers are all solved by Archer will help you solve it.
The find function also returns an HTML document and can be connected to the find function and find_all function;
After the array value is obtained, it can be treated directly as the element of the value, for example: