


Detailed explanation of the method of crawling to the Encyclopedia of Embarrassing Things using Python's crawler technology
It was my first time to learn crawler technology. I read a joke on Zhihu about how to crawl to the Encyclopedia of Embarrassing Things, so I decided to make one myself.
Achieve goals: 1. Crawling to the jokes in the Encyclopedia of Embarrassing Things
2. Crawling one paragraph every time and crawling to the next page every time you press Enter
Technical implementation: Based on the implementation of python, using the Requests library, re library, and the BeautifulSoup method of the bs4 library to implement
Main content: First, we need to clarify the ideas for crawling implementation , let’s build the main framework. In the first step, we first write a method to obtain web pages using the Requests library. In the second step, we use the BeautifulSoup method of the bs4 library to analyze the obtained web page information and use regular expressions to match relevant paragraph information. . The third step is to print out the obtained information. We all execute the above methods through a main function .
First, import the relevant libraries
1 2 3 4 |
|
Second, first obtain the web page information
1 2 3 4 5 6 7 8 9 10 |
|
Third, put the information into r and then analyze it
1 |
|
What we need is the content and publisher of the joke. By viewing the source code on the web page, we know that the publisher of the joke is:
1 |
|
The content of the joke is in
1 |
|
, so we pass bs4 Library method to extract the specific content of these two tags
1 2 3 4 5 |
|
Then obtain the information through specific regular expressions
1 2 3 4 5 6 7 8 9 |
|
What we need to pay attention to is the return of find_all and re’s findall method They are all a list. When using regular expressions, we only roughly extract and do not remove the line breaks in the tags
Next, we only need to combine the contents of the two lists and output them
1 2 3 4 5 |
|
Then I make an input control function, enter Q to return an error, exit, enter Enter to return correct, and load the next page of paragraphs
1 2 3 4 5 6 |
|
We realize the input control through the main function. If If the control function returns an error, the output will not be performed. If the return value is correct, the output will continue. We load the next page through a for loop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Here we need to note that every for loop will refresh lis[] and li[], so that the paragraph content of the webpage can be correctly output every time
Here is the source code :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
|
This is my first time doing it and there are still many areas that can be optimized. I hope everyone can point it out.
The above is the detailed content of Detailed explanation of the method of crawling to the Encyclopedia of Embarrassing Things using Python's crawler technology. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

In Python, how to dynamically create an object through a string and call its methods? This is a common programming requirement, especially if it needs to be configured or run...

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

Fastapi ...
