Scraping Pages with Dynamic Content Using Node.js
For web scrapers, dynamic content can pose challenges. One such example is when a page's elements are created after the initial page load. In such scenarios, a standard scraping method may not suffice.
Consider this issue when using cheerio in Node.js. The following code attempts to scrape elements from a page, but the dynamic elements are not present when the cheerio load occurs:
var request = require('request'); var cheerio = require('cheerio'); var url = "http://www.bdtong.co.kr/index.php?c_category=C02"; request(url, function (err, res, html) { var $ = cheerio.load(html); $('.listMain > li').each(function () { console.log($(this).find('a').attr('href')); }); });
This code often returns an empty response because the elements are not yet present in the page's HTML when cheerio loads. So, how can we retrieve these elements using Node.js?
Solution: Utilizing PhantomJS
To handle dynamic content, we can employ PhantomJS, a headless web browser that can execute JavaScript. PhantomJS allows us to simulate a browser interacting with the page and retrieve elements as they become available. Here's an example using PhantomJS:
var phantom = require('phantom'); phantom.create(function (ph) { ph.createPage(function (page) { var url = "http://www.bdtong.co.kr/index.php?c_category=C02"; page.open(url, function() { page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() { page.evaluate(function() { $('.listMain > li').each(function () { console.log($(this).find('a').attr('href')); }); }, function(){ ph.exit() }); }); }); }); });
In this code, we first include jQuery into the page using PhantomJS, allowing us to interact with the elements dynamically. We then evaluate the JavaScript code that logs the elements' href attributes to the console.
The above is the detailed content of How Can I Scrape Dynamic Web Page Content Using Node.js?. For more information, please follow other related articles on the PHP Chinese website!