This article mainly introduces the actual practice of using python3 to crawl page content using the requests module. It has certain reference value. If you are interested, you can learn more
1. Install pip
My personal desktop system uses linuxmint. The system does not have pip installed by default. Considering that pip will be used to install the requests module later, I will install pip as the first step here.
$ sudo apt install python-pip
The installation is successful, check the PIP version:
$ pip -V
2. Install the requests module
Here I installed it through pip:
$ pip install requests
Run import requests, if no error is prompted, then It means the installation has been successful!
Verify whether the installation is successful
3. Install beautifulsoup4
Beautiful Soup is a tool that can be downloaded from HTML or XML Python library for extracting data from files. It enables customary document navigation, ways to find and modify documents through your favorite converter. Beautiful Soup will save you hours or even days of work.
$ sudo apt-get install python3-bs4
Note: I am using the python3 installation method here. If you are using python2, you can use the following command to install it.
$ sudo pip install beautifulsoup4
4.A brief analysis of the requests module
1) Send a request
First of all, of course, you must import Requests Module:
>>> import requests
Then, get the target crawled web page. Here I take the following as an example:
>>> r = requests.get('http://www.jb51.net/article/124421.htm')
Here returns a response object named r. We can get all the information we want from this object. The get here is the response method of http, so you can also replace it with put, delete, post, and head by analogy.
2) Pass URL parameters
Sometimes we want to pass some kind of data for the query string of the URL. If you build the URL by hand, the data is placed in the URL as key/value pairs, followed by a question mark. For example, cnblogs.com/get?key=val. Requests allow you to use the params keyword argument to provide these parameters as a dictionary of strings.
For example, when we google search for the keyword "python crawler", parameters such as newwindow (new window opens), q and oq (search keywords) can be manually formed into the URL, then you can use the following code :
>>> payload = {'newwindow': '1', 'q': 'python爬虫', 'oq': 'python爬虫'} >>> r = requests.get("https://www.google.com/search", params=payload)
3) Response content
Get the page response content through r.text or r.content.
>>> import requests >>> r = requests.get('https://github.com/timeline.json') >>> r.text
Requests automatically decode content from the server. Most unicode character sets can be decoded seamlessly. Here is a little addition about the difference between r.text and r.content. To put it simply:
resp.text returns Unicode data;
resp.content returns data of bytes type. It is binary data;
So if you want to get text, you can pass r.text. If you want to get pictures or files, you can pass r.content.
4) Get the web page encoding
>>> r = requests.get('http://www.cnblogs.com/') >>> r.encoding 'utf-8'
5) Get the response status code
We can detect the response status code:
>>> r = requests.get('http://www.cnblogs.com/') >>> r.status_code 200
5. Case Demonstration
The company has just introduced an OA system recently. Here I take its official documentation page as an example, and Only capture useful information such as article titles and content on the page.
Demo environment
Operating system: linuxmint
Python version: python 3.5.2
Using modules: requests, beautifulsoup4
Code As follows:
#!/usr/bin/env python # -*- coding: utf-8 -*- _author_ = 'GavinHsueh' import requests import bs4 #要抓取的目标页码地址 url = 'http://www.ranzhi.org/book/ranzhi/about-ranzhi-4.html' #抓取页码内容,返回响应对象 response = requests.get(url) #查看响应状态码 status_code = response.status_code #使用BeautifulSoup解析代码,并锁定页码指定标签内容 content = bs4.BeautifulSoup(response.content.decode("utf-8"), "lxml") element = content.find_all(id='book') print(status_code) print(element)
The program runs and returns the crawling result:
The crawl is successful
About the problem of garbled crawling results
In fact, at first I directly used the python2 that comes with the system by default, but I struggled for a long time with the problem of garbled encoding of the content returned by crawling, google Various solutions have failed. After being "made crazy" by python2, I had no choice but to use python3 honestly. Regarding the problem of garbled content in python2's crawled pages, seniors are welcome to share their experiences to help future generations like me avoid detours.
Postscript
Python has many crawler-related modules, in addition to the requests module, there are also urllib, pycurl, and tornado, etc. In comparison, I personally feel that the requests module is relatively simple and easy to use. Through text, you can quickly learn to use python's requests module to crawl page content. My ability is limited. If there are any mistakes in the article, please feel free to let me know. Secondly, if you have any questions about the content of the page crawled by python, you are also welcome to discuss with everyone.
The above is the detailed content of Detailed example of python3 using the requests module to crawl page content. For more information, please follow other related articles on the PHP Chinese website!