网页爬虫 - python如何爬取js生成的数据?
高洛峰
高洛峰 2017-04-17 17:56:25
0
9
468

我想要爬取豆瓣音乐music.douban.com上的 新碟榜 和 近期热门歌单 ,看源代码好像都是js生成的,请教大家有什么办法可以爬到这些数据?谢谢!

高洛峰
高洛峰

拥有18年软件开发和IT教学经验。曾任多家上市公司技术总监、架构师、项目经理、高级软件工程师等职务。 网络人气名人讲师,...

reply all(9)
小葫芦

I use Jsoup to write a crawler, and I usually encounter HTML that returns no content. But the browser displays some content. They all analyze the http request log of the page. Analyze the JS code of the page to solve it.

1. Some page elements are hidden ->Change the selector to solve the problem
2. Some data are stored in js/json objects->Intercept the corresponding strings, analyze and solve the problem
3. Call through the api interface->Fake requests Get data

There is another ultimate method
4. Use a headless browser like phantomjs or casperjs

黄舟

Some of the answers mentioned that it is possible to analyze the interface and crawl the interface directly. Moreover, crawling the interface directly does not require you to parse the HTML yourself, because most interfaces return json. I feel happy just thinking about it~

However, there are still other methods, such as using Phantomjs, which is simple and easy to use. Python is not omnipotent and will have greater value when combined with other tools. I also have some small projects that are such a combination.

This is an official example code, which can be achieved with a little modification.

console.log('Loading a web page');
var page = require('webpage').create();
var url = 'http://phantomjs.org/';
page.open(url, function (status) {
  //Page is loaded!
  phantom.exit();
});

Renovated

var page = require('webpage').create();
var url = 'http://phantomjs.org/';
page.open(url, function (status) {
  page.evaluate(function() {
    // 页面被执行完之后,一般js生成的内容也可以获得了,但是Ajax生成的内容则不一定
    document.getElementById('xxx'); // 可以操作DOM,这里你就可以尝试获取你想要的内容了
    // ...
  })
  phantom.exit();
});

But in fact, in many cases, you need to wait for Ajax to be executed before starting to parse the content of the page. At this time, you can use an official sample code. Using this function, you can wait for all requests for this page to be loaded before continuing. Processing, then you can get the fully loaded page, and then you can do whatever you need to do.

洪涛

Find the data interface yourself

左手右手慢动作

They should all be generated by the API interface

小葫芦

Example of using selenium to mine new disc charts:

from selenium import webdriver
dirver = webdriver.Chrome()
dirver.get('https://music.douban.com/')
for i in dirver.find_elements_by_css_selector('.new-albums .album-title'):
    print i.text

Results:
Open today
Jay Chou's Bedside Story
H.A.M.
3집 EX'ACT
Wild
Dangerous Woman
In the dark
Last Year Was Complicated

阿神

Chrome, press F12, click, view the request, it is easy to find the URL and parameters, just construct it yourself, and then parse the returned content.

Ty80

Index.html, this line of js is quoted.

<script type="text/javascript" src="https://img3.doubanio.com/misc/mixed_static/37fa28b9fa94889c.js"></script><script type="text/javascript">

Open this js file and you can see

  React.render(React.createElement(component, {"moreUrl":"\/chart","sections":[{"albums":[{"name":"今日營業中","performers":"林宥.................
Ty80

Open chrome to inspect the element and look for js in the network. Generally, the js with a special name may be what you are looking for. For example, this one,

黄舟

The most direct way is to use selenium

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template