It was my first time to learn crawler technology. I read a joke on Zhihu about how to crawl to the Encyclopedia of Embarrassing Things, so I decided to make one myself.
Achieve goals: 1. Crawling to the jokes in the Encyclopedia of Embarrassing Things
2. Crawling one paragraph every time and crawling to the next page every time you press Enter
Technical implementation: Based on the implementation of python, using the Requests library, re library, and the BeautifulSoup method of the bs4 library to implement
Main content: First, we need to clarify the ideas for crawling implementation , let’s build the main framework. In the first step, we first write a method to obtain web pages using the Requests library. In the second step, we use the BeautifulSoup method of the bs4 library to analyze the obtained web page information and use regular expressions to match relevant paragraph information. . The third step is to print out the obtained information. We all execute the above methods through a main function .
First, import the relevant libraries
import requests from bs4 import BeautifulSoup import bs4 import re
Second, first obtain the web page information
def getHTMLText(url): try: user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = {'User-Agent': user_agent} r = requests.get(url,headers = headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return ""
Third, put the information into r and then analyze it
soup = BeautifulSoup(html,"html.parser")
What we need is the content and publisher of the joke. By viewing the source code on the web page, we know that the publisher of the joke is:
'p', attrs={'class': 'content'}中
The content of the joke is in
'p', attrs={'class': 'author clearfix'}中
, so we pass bs4 Library method to extract the specific content of these two tags
def fillUnivlist(lis,li,html,count): soup = BeautifulSoup(html,"html.parser") try: a = soup.find_all('p', attrs={'class': 'content'}) ll = soup.find_all('p', attrs={'class': 'author clearfix'})
Then obtain the information through specific regular expressions
for sp in a: patten = re.compile(r'<span>(.*?)</span>',re.S) Info = re.findall(patten,str(sp)) lis.append(Info) count = count + 1 for mc in ll: namePatten = re.compile(r'<h2>(.*?)</h2>', re.S) d = re.findall(namePatten, str(mc)) li.append(d)
What we need to pay attention to is the return of find_all and re’s findall method They are all a list. When using regular expressions, we only roughly extract and do not remove the line breaks in the tags
Next, we only need to combine the contents of the two lists and output them
def printUnivlist(lis,li,count): for i in range(count): a = li[i][0] b = lis[i][0] print ("%s:"%a+"%s"%b)
Then I make an input control function, enter Q to return an error, exit, enter Enter to return correct, and load the next page of paragraphs
def input_enter(): input1 = input() if input1 == 'Q': return False else: return True
We realize the input control through the main function. If If the control function returns an error, the output will not be performed. If the return value is correct, the output will continue. We load the next page through a for loop.
def main(): passage = 0 enable = True for i in range(20): mc = input_enter() if mc==True: lit = [] li = [] count = 0 passage = passage + 1 qbpassage = passage print(qbpassage) url = 'http://www.qiushibaike.com/8hr/page/' + str(qbpassage) + '/?s=4966318' a = getHTMLText(url) fillUnivlist(lit, li, a, count) number = fillUnivlist(lit, li, a, count) printUnivlist(lit, li, number) else: break
Here we need to note that every for loop will refresh lis[] and li[], so that the paragraph content of the webpage can be correctly output every time
Here is the source code :
import requests from bs4 import BeautifulSoup import bs4 import re def getHTMLText(url): try: user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = {'User-Agent': user_agent} r = requests.get(url,headers = headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivlist(lis,li,html,count): soup = BeautifulSoup(html,"html.parser") try: a = soup.find_all('p', attrs={'class': 'content'}) ll = soup.find_all('p', attrs={'class': 'author clearfix'}) for sp in a: patten = re.compile(r'(.*?)',re.S) Info = re.findall(patten,str(sp)) lis.append(Info) count = count + 1 for mc in ll: namePatten = re.compile(r'(.*?)
', re.S) d = re.findall(namePatten, str(mc)) li.append(d) except: return "" return count def printUnivlist(lis,li,count): for i in range(count): a = li[i][0] b = lis[i][0] print ("%s:"%a+"%s"%b) def input_enter(): input1 = input() if input1 == 'Q': return False else: return True def main(): passage = 0 enable = True for i in range(20): mc = input_enter() if mc==True: lit = [] li = [] count = 0 passage = passage + 1 qbpassage = passage print(qbpassage) url = 'http://www.qiushibaike.com/8hr/page/' + str(qbpassage) + '/?s=4966318' a = getHTMLText(url) fillUnivlist(lit, li, a, count) number = fillUnivlist(lit, li, a, count) printUnivlist(lit, li, number) else: break main()
This is my first time doing it and there are still many areas that can be optimized. I hope everyone can point it out.
The above is the detailed content of Detailed explanation of the method of crawling to the Encyclopedia of Embarrassing Things using Python's crawler technology. For more information, please follow other related articles on the PHP Chinese website!