Python crawler package BeautifulSoup Detailed explanation of recursive crawling examples
Summary:
The main purpose of the crawler is to crawl the required content along the network. Their essence is a recursive process. They first need to obtain the content of the web page, then analyze the page content and find another URL, then obtain the page content of this URL, and repeat this process.
Let’s take Wikipedia as an example.
We want to extract all links pointing to other entries in the Kevin Bacon entry in Wikipedia.
# -*- coding: utf-8 -*- # @Author: HaonanWu # @Date: 2016-12-25 10:35:00 # @Last Modified by: HaonanWu # @Last Modified time: 2016-12-25 10:52:26 from urllib2 import urlopen from bs4 import BeautifulSoup html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon') bsObj = BeautifulSoup(html, "html.parser") for link in bsObj.findAll("a"): if 'href' in link.attrs: print link.attrs['href']
The above code can extract all hyperlinks on the page.
/wiki/Wikipedia:Protection_policy#semi #mw-head #p-search /wiki/Kevin_Bacon_(disambiguation) /wiki/File:Kevin_Bacon_SDCC_2014.jpg /wiki/San_Diego_Comic-Con /wiki/Philadelphia /wiki/Pennsylvania /wiki/Kyra_Sedgwick
First of all, the extracted URLs may have some duplicates
Secondly, there are some URLs that we don’t need , such as sidebar, header, footer, directory bar link, etc.
So through observation, we can find that all links pointing to the entry page have three characteristics:
They are all in the div tag with the id of bodyContent
The URL link is not URL links containing colons
are all relative paths starting with /wiki/ (the complete absolute path starting with http will also be crawled)
from urllib2 import urlopen from bs4 import BeautifulSoup import datetime import random import re pages = set() random.seed(datetime.datetime.now()) def getLinks(articleUrl): html = urlopen("http://en.wikipedia.org"+articleUrl) bsObj = BeautifulSoup(html, "html.parser") return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) links = getLinks("/wiki/Kevin_Bacon") while len(links) > 0: newArticle = links[random.randint(0, len(links)-1)].attrs["href"] if newArticle not in pages: print(newArticle) pages.add(newArticle) links = getLinks(newArticle)
The parameter of getLinks is /wiki/
The main function calls recursive getlinks and randomly accesses an unvisited URL until there are no more entries or it stops actively.
This code can crawl the entire Wikipedia
from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html = urlopen("http://en.wikipedia.org"+pageUrl) bsObj = BeautifulSoup(html, "html.parser") try: print(bsObj.h1.get_text()) print(bsObj.find(id ="mw-content-text").findAll("p")[0]) print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href']) except AttributeError: print("This page is missing something! No worries though!") for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")): if 'href' in link.attrs: if link.attrs['href'] not in pages: #We have encountered a new page newPage = link.attrs['href'] print("----------------\n"+newPage) pages.add(newPage) getLinks(newPage) getLinks("")
Generally speaking, Python’s recursion limit is 1000 times, so you need Artificially set a larger recursion counter, or use other means to allow the code to still run after 1,000 iterations.
Thank you for reading, I hope it can help you, thank you for your support of this site!
For more Python crawler package BeautifulSoup recursive crawling examples and related articles, please pay attention to the PHP Chinese website!