Use python package: requests.
Specific method: (Recommended learning: Python video tutorial)
First, define your own headers. Note that the User-Agent field in the headers can be used to design a list according to your own needs for random replacement.
Web page features of ajax data: There are some ajax requests in the XHR network flow in NetWork, among which their request_url must be an ajax request interface, and the referer in the headers is the url before the jump. When constructing itself The referer field needs to be set in the headers.
Take the search for "java" on Lagou.com's homepage as an example:
The ajax data crawler and the ordinary web crawler have one more url, one is the url of the referer. Placed in headers. The other one is ajax_url, which is also the main access url.
The most important point is that ajax generally returns json data, so the parsing form of the captured data is different. Simply convert the result set into a json result set, and the access method is ordinary dictionary or list access.
The other one is the access parameter. If the request contains param, it will be constructed. If not, ignore it. There are parameters in this example, pay attention to the parameter dictionary construction method
A simple complete code is shown below
from urllib.request import quote,unquote import random import requests keyword = quote('java').strip() print(keyword, type(keyword)) city = quote('郑州').strip() print(unquote(city)) refer_url = 'https://www.lagou.com/jobs/list_%s?city=%s&cl=false&fromSearch=true&labelWords=&suginput=' % (keyword, city) ajax_url = 'https://www.lagou.com/jobs/positionAjax.json?city=%s&needAddtionalResult=false' %city user_agents=[ 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299', ] data ={ 'first': 'true', 'pn': '1', 'kd': keyword, } headers={ 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Connection': 'keep-alive', 'Content-Length': '46', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Host': 'www.lagou.com', 'Origin': 'https://www.lagou.com', 'Referer': refer_url, 'User-Agent': user_agents[random.randrange(0,3)], 'X-Anit-Forge-Code': '0', 'X-Anit-Forge-Token': 'None', 'X-Requested-With': 'XMLHttpRequest', } resp = requests.post(ajax_url,data=data, headers=headers) result = resp.json() print(result) # print(result) #result 就是最终获得的json格式数据 item = result['content']['positionResult']['result'][0] print(item) #item就是单个招聘条目信息 print("程序结束")
For more Python related technical articles, please visitPython tutorial Column for learning!
The above is the detailed content of How to crawl ajax in python. For more information, please follow other related articles on the PHP Chinese website!