python - 如何爬取URL不变的网站内容
伊谢尔伦
伊谢尔伦 2017-04-18 10:13:25
0
2
1665
<a href="javascript:__doPostBack('AspNetPager1','3')" class="Pager" title="转到第3页" style="margin-right:5px;">[3]</a>
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }

对于这种翻页方式,怎么用爬虫爬取呢?网站翻页后URL没有发生改变。我之前使用bs4和selenium模拟翻页操作再爬取,可是数据量太大,这种方法速度太慢。80%的时间都浪费在翻页上。

伊谢尔伦
伊谢尔伦

小伙看你根骨奇佳,潜力无限,来学PHP伐。

reply all(2)
小葫芦

This problem needs to be analyzed specifically on the website. Different websites will have different handling methods.
Now assume that in a more common situation, this method can be used:

  1. Turn on browser debugging mode

  2. Click the next page to view the Response of the corresponding network request. This response is usually the URL of the next page

  3. View the request headers and request parameters of the request, analyze and find the pattern

  4. Use python to simulate HTTP requests to get URLs in batches

  5. Crawling information, recommend LXML for HTML parsing

As for how to simulate HTTP requests, please refer to python to simulate HTTP requests

Peter_Zhu

Maybe there is an AJAX request, just grab the request directly

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!