Detailed explanation of recursive crawling examples of Python crawler package BeautifulSoup-Python Tutorial-php.cn

Detailed explanation of recursive crawling examples of Python crawler package BeautifulSoup

高洛峰

Release： 2017-02-03 15:59:29

Original

2399 people have browsed it

Python crawler package BeautifulSoup Detailed explanation of recursive crawling examples

Summary:

The main purpose of the crawler is to crawl the required content along the network. Their essence is a recursive process. They first need to obtain the content of the web page, then analyze the page content and find another URL, then obtain the page content of this URL, and repeat this process.

Let’s take Wikipedia as an example.

We want to extract all links pointing to other entries in the Kevin Bacon entry in Wikipedia.

# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-25 10:35:00
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-25 10:52:26
from urllib2 import urlopen
from bs4 import BeautifulSoup
 
html = urlopen(&#39;http://en.wikipedia.org/wiki/Kevin_Bacon&#39;)
bsObj = BeautifulSoup(html, "html.parser")
 
for link in bsObj.findAll("a"):
  if &#39;href&#39; in link.attrs:
    print link.attrs[&#39;href&#39;]

Copy after login

The above code can extract all hyperlinks on the page.

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick

Copy after login

First of all, the extracted URLs may have some duplicates

Secondly, there are some URLs that we don’t need , such as sidebar, header, footer, directory bar link, etc.

So through observation, we can find that all links pointing to the entry page have three characteristics:

They are all in the div tag with the id of bodyContent

The URL link is not URL links containing colons

are all relative paths starting with /wiki/ (the complete absolute path starting with http will also be crawled)

from urllib2 import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
 
pages = set()
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
  html = urlopen("http://en.wikipedia.org"+articleUrl)
  bsObj = BeautifulSoup(html, "html.parser")
  return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
 
links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
  newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
  if newArticle not in pages:
    print(newArticle)
    pages.add(newArticle)
    links = getLinks(newArticle)

Copy after login

The parameter of getLinks is /wiki/, and the URL of the page is obtained by merging it with the absolute path of Wikipedia. Capture all URLs pointing to other terms through regular expressions and return them to the main function.

The main function calls recursive getlinks and randomly accesses an unvisited URL until there are no more entries or it stops actively.

This code can crawl the entire Wikipedia

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
 
pages = set()
def getLinks(pageUrl):
  global pages
  html = urlopen("http://en.wikipedia.org"+pageUrl)
  bsObj = BeautifulSoup(html, "html.parser")
  try:
    print(bsObj.h1.get_text())
    print(bsObj.find(id ="mw-content-text").findAll("p")[0])
    print(bsObj.find(id="ca-edit").find("span").find("a").attrs[&#39;href&#39;])
  except AttributeError:
    print("This page is missing something! No worries though!")
 
  for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
    if &#39;href&#39; in link.attrs:
      if link.attrs[&#39;href&#39;] not in pages:
        #We have encountered a new page
        newPage = link.attrs[&#39;href&#39;]
        print("----------------\n"+newPage)
        pages.add(newPage)
        getLinks(newPage)
getLinks("")

Copy after login

Generally speaking, Python’s recursion limit is 1000 times, so you need Artificially set a larger recursion counter, or use other means to allow the code to still run after 1,000 iterations.

Thank you for reading, I hope it can help you, thank you for your support of this site!

For more Python crawler package BeautifulSoup recursive crawling examples and related articles, please pay attention to the PHP Chinese website!