When we use python to crawl, we may have encountered that the HTML code obtained by direct request from some web pages does not contain the data we need, which is what we see in the browser. .
This is because the information is loaded through Ajax and generated through js rendering. At this time we need to analyze the request of this web page.
The previous article explained to you what is Cookie and the operation process of simulated login. Today I will show you how to analyze the Ajax request of the web page.
What is Ajax
AJAX stands for "Asynchronous Javascript And XML" (Asynchronous JavaScript and XML), which refers to a creation Web development technology for interactive web applications.
AJAX = Asynchronous JavaScript and XML (a subset of Standard Universal Markup Language).
AJAX is a technology for creating fast, dynamic web pages.
AJAX is a technology that can update parts of a web page without reloading the entire web page.
To put it simply, the web page is loaded. The URL in the browser address bar has not changed. It is a web page loaded asynchronously by javascript, which should be ajax. AJAX generally sends requests through the XMLHttpRequest object interface, and XMLHttpRequest is generally abbreviated as XHR.
Analyzing Guoke.com website
Our target website will be analyzed using Guoke.com.
We can see that this webpage does not have a page turning button, and when we keep pulling down the request, the webpage will automatically load more content for us. However, when we observe the web page URL, we find that it does not change with the loading request of the web page. And when we directly request this url, obviously we can only get the html content of the first page.
#So how do we get the data of all pages?
We open the developer tools (F12) in Chrome. We click Network and click the XHR tab. Then we refresh the web page and pull down the request. At this time we can see the XHR tag, and a request will pop up every time the web page is loaded.
When we click on the first request, we can see its parameters:
retrieve_type:by_subject limit:20 offset:18 -:1500265766286
When we click on the second request, the parameters are as follows:
retrieve_type:by_subject limit:20 offset:38 -:1500265766287
The limit parameter is each The page limits the number of articles loaded, and offset is the number of pages. Looking down, we will find that the offset parameter of each request will be increased by 20.
We then look at the response content of each request. This is data in the format. When we click on the result button, we can see the data information of 20 articles. In this way, we have successfully found the location of the information we need. We can see the URL address where the json data is stored in the request header. http://www.guokr.com/apis/minisite/article.json?retrieve_type=by_subject&limit=20&offset=18
##Crawling process
Analyze the Ajax request to obtain the article URL information of each page; parse each article to obtain the required data; save the obtained data in the database; start multiple processes and crawl in large quantities Pick.Start
Our tool still uses requests and BeautifulSoup parsing. First we need to analyze the Ajax request to obtain the information of all pages. Through the above analysis of the web page, we can get the URL address of the json data loaded by Ajax:http://www. guokr.com/apis/minisite/article.json?retrieve_type=by_subject&limit=20&offset=18
We need to construct this URL.# 导入可能要用到的模块 import requests from urllib.parse import urlencode from requests.exceptions import ConnectionError # 获得索引页的信息 def get_index(offset): base_url = 'http://www.guokr.com/apis/minisite/article.json?' data = { 'retrieve_type': "by_subject", 'limit': "20", 'offset': offset } params = urlencode(data) url = base_url + params try: resp = requests.get(url) if resp.status_code == 200: return resp.text return None except ConnectionError: print('Error.') return None
import json # 解析json,获得文章url def parse_json(text): try: result = json.loads(text) if result: for i in result.get('result'): # print(i.get('url')) yield i.get('url') except: pass
既然获得了文章的url,那么对于获得文章的数据就显得很简单了。这里不在进行详细的叙述。我们的目标是获得文章的标题,作者和内容。
由于有的文章里面包含一些图片,我们直接过滤掉文章内容里的图片就好了。
from bs4 import BeautifulSoup # 解析文章页 def parse_page(text): try: soup = BeautifulSoup(text, 'lxml') content = soup.find('div', class_="content") title = content.find('h1', id="articleTitle").get_text() author = content.find('div', class_="content-th-info").find('a').get_text() article_content = content.find('div', class_="document").find_all('p') all_p = [i.get_text() for i in article_content if not i.find('img') and not i.find('a')] article = '\n'.join(all_p) # print(title,'\n',author,'\n',article) data = { 'title': title, 'author': author, 'article': article } return data except: pass
这里在进行多进程抓取的时候,BeautifulSoup也会出现一个错误,依然直接过滤。我们把得到的数据保存为字典的形式,方便保存数据库。
接下来就是保存数据库的操作了,这里我们使用Mongodb进行数据的存储。
import pymongo from config import * client = pymongo.MongoClient(MONGO_URL, 27017) db = client[MONGO_DB] def save_database(data): if db[MONGO_TABLE].insert(data): print('Save to Database successful', data) return True return False
我们把数据库的名字,和表名保存到config配置文件中,在把配置信息导入文件,这样会方便代码的管理。
最后呢,由于果壳网数据还是比较多的,如果想要大量的抓取,我们可以使用多进程。
from multiprocessing import Pool # 定义一个主函数 def main(offset): text = get_index(offset) all_url = parse_json(text) for url in all_url: resp = get_page(url) data = parse_page(resp) if data: save_database(data) if __name__ == '__main__': pool = Pool() offsets = ([0] + [i*20+18 for i in range(500)]) pool.map(main, offsets) pool.close() pool.join()
函数的参数offset就是页数了。经过我的观察,果壳网最后一页页码是 12758,有 637 页。这里我们就抓取 500 页。进程池的map方法和Python内置的map方法使用类似。
好了,对于一些使用Ajax加载的网页,我们就可以这么抓取了。
The above is the detailed content of How to analyze Ajax requests for web pages rendered by JS. For more information, please follow other related articles on the PHP Chinese website!