HTTP protocol:
HTTP (Hypertext Transfer Protocol): Hypertext Transfer Protocol. URL is the Internet path for accessing resources through HTTP protocol. One URL corresponds to one data resource.
HTTP protocol operation on resources:
The Requests library provides all basic request methods of HTTP . Official introduction:
The 6 main methods of the Requests library:
Exceptions in the Requests library:
There are two important objects in the Requests library: Request and Response. The Request object supports multiple request methods; the Response object contains all the information returned by the server, as well as the requested Request information.
Attributes of the Response object:
Among them, r.encoding refers to: if it does not exist in the header charset, the encoding is considered to be ISO-8859-1.
r.raise_for_status() can directly know whether r.status_code is equal to 200.
Comparison between HTTP protocol and Requests library:
Crawling web pages General code framework:
1 try:2 r = requests.get(url,timeout = 30)3 r.raise_for_status()4 # 如果状态不是200,引发HTTPError异常5 r.encoding = r.apparent_encoding6 return r.text7 except:8 return '产生异常'
For example, to obtain information on the PMCAFF homepage:
1 import requests 2 3 def getHtmlText(url): 4 try: 5 r = requests.get(url,timeout = 30) 6 r.raise_for_status() 7 r.encoding = r.apparent_encoding 8 return r.text 9 except:10 return '产生异常'11 12 if __name__ == '__main__':13 url = ''14 print(getHtmlText(url))
Crawl the web page General code framework: Operating environment: Mac, Python 3.6, PyCharm 2016.2
Reference: Chinese University MOOC course "Python Web Crawler and Information Extraction"
----- End -----
Author: Du Wangdan, WeChat public account: Du Wangdan, Internet product manager.
The above is the detailed content of Python crawler: HTTP protocol, Requests library. For more information, please follow other related articles on the PHP Chinese website!