Have you ever encountered a web page requiring actions like “clicking a button” to reveal more content? Such pages are called "dynamic webpages," as they load more content based on user interaction. In contrast, static webpages display all their content at once without requiring user actions.
Scraping content from dynamic pages can be daunting as it requires simulating user interactions, such as clicking a button to access additional hidden content. In this tutorial, you'll learn how to scrape data from a webpage with infinite scrolling via a "Load more" button.
To follow along with this tutorial, you need:
In addition, you’ll need to have a basic understanding of HTML, CSS, and JavaScript. You’ll also need a web browser like Chrome.
Create a new folder, then open it in your code editor. Spot the “terminal” tab in your code editor and open a new terminal. Here’s how you can spot it using Visual Studio Code.
Next, run the following command in the terminal to install the needed packages for this build.
$ npm install cheerio puppeteer
Create a new file inside your project folder in the code editor and name it dynamicScraper.js.
Excellent work, buddy!
Puppeteer is a powerful Node.js library that allows you to control headless Chrome browsers, making it ideal for interacting with webpages. With Puppeteer, you can target a webpage using the URL, access the contents, and easily extract data from that page.
In this section, you’ll learn how to open a page using a headless browser, access the content, and retrieve the HTML content of that page. You can find the target website for this tutorial here.
Note: You’re to write all the code inside of the dynamicScraper.js.
Start by importing Puppeteer using the require() Node.js built-in function, which helps you load up modules: core modules, third-party libraries (like Puppeteer), or custom modules (like your local JS files).
$ npm install cheerio puppeteer
Next, define a variable for storing your target URL. Doing this isn’t mandatory, but it makes your code cleaner, as you just have to reference this global variable from anywhere in your code.
const puppeteer = require('puppeteer');
The next step is to create the function that’ll launch the headless browser and retrieve the HTML contents of the target page. You should opt for the Immediately Invoked Function Expression (IIFE) method to make things much faster.
Define an asynchronous IIFE with a try-and-catch block:
const url = 'https://www.scrapingcourse.com/button-click';
Note: You should write every other code for this tutorial segment inside the try block.
Right inside the IIFE, create a new instance of Puppeteer and open a new page for the interaction.
Launch a new instance of the puppeteer library using the launch method and pass the headless mode to it. The headless mode can be either set to true or false. Setting the headless mode to true makes the headless browser not visible when the puppeteer is launched, but setting it to false makes the browser visible.
After you’ve launched Puppeteer, you also want to call the newPage method, which triggers the opening of a new tab in the headless browser.
(async () => { try { // Code goes here } catch (error) { console.error('Error:', error.message); } })();
Now, query the newPage method to target the expected URL and open that website in this new tab using the page.goto method. Beyond that, you want to ensure that Puppeteer only considers the page ready for interaction and extraction of data if and only if the page has loaded all of its essential resources (like images and JS).
To ensure the page is ready, Puppeteer provides an option called waitUntil, which can take in various values that define different conditions for loading the page:
load: This waits for the load event to fire, which occurs after the HTML document and its resources (e.g., images, CSS, JS) have been loaded. However, this may not account for additional JavaScript-rendered content that loads after the load event.
domcontentloaded: This waits for the DOMContentLoaded event, which is triggered once the initial HTML is parsed. But this loads before external resources (like images or additional JS) load.
networkidle2: This waits until there are no more than two active network requests (ongoing HTTP requests (e.g., loading images, scripts, or other resources)) for 500 milliseconds. This value is preferred when dealing with pages that make small, continuous requests but don't affect the main content.
// Launch Puppeteer const browser = await puppeteer.launch({ headless: false }); // Headless mode const page = await browser.newPage(); // Open a new page
Finally, you just need to retrieve all the HTML content of the current page using the page.content(). Most importantly, you should close the browser instance to avoid unnecessary usage of memory, which can slow your system down. Use browser.close() at the end of your script to close the browser.
$ npm install cheerio puppeteer
With the present code you have, the browser will load up and close very fast, and you might not even be able to view the page well. In this case, you can delay the browser for a few seconds using the page.waitForTimeout method. This method should come just before the browser.close method.
const puppeteer = require('puppeteer');
Here’s the entire code for this section:
const url = 'https://www.scrapingcourse.com/button-click';
Save your file and run the script inside of your terminal using the command below:
(async () => { try { // Code goes here } catch (error) { console.error('Error:', error.message); } })();
The script will open up a headless browser like the one below:
The browser loads up, Puppeteer fetches its entire HTML content, and Console logs the content to the terminal.
Here’s the output you should get in your terminal:
// Launch Puppeteer const browser = await puppeteer.launch({ headless: false }); // Headless mode const page = await browser.newPage(); // Open a new page
Next, you want to loop to simulate the clicks. The simulation will use a for loop that runs i number of times, where i will be the clicks variable.
// Navigate to the target URL await page.goto(url, { waitUntil: 'networkidle2', // Ensure the page is fully loaded });
Note: Your remaining code for this section should be written inside the try block in the for loop.
To help with debugging and tracking the output, log out the current click attempt.
// Get the full HTML content of the page const html = await page.content(); // Log the entire HTML content console.log(html); // Close the browser await browser.close();
Next, you want to be able to locate the “Load more” button and click it at least three times. But before simulating the click, you should ensure the “Load more” button is available.
Puppeteer provides the waitForSelector() method to check the visibility of an element before using it.
For the “Load more” button, you’ll have to first locate it using the value of the id selector on it and then check for the visibility status like this:
// Delay for 10 seconds to allow you to see the browser await page.waitForTimeout(10000);
Now that you know the “Load more” button is available, you can click it using the Puppeteer click() method.
const puppeteer = require('puppeteer'); const url = 'https://www.scrapingcourse.com/button-click'; (async () => { try { // Launch Puppeteer const browser = await puppeteer.launch({ headless: false }); // Headless mode const page = await browser.newPage(); // Open a new page // Navigate to the target URL await page.goto(url, { waitUntil: 'networkidle2', // Ensure the page is fully loaded }); // Get the entire HTML content of the page const html = await page.content(); // Log the entire HTML content console.log(html); // Delay for 10 seconds to allow you to see the browser await page.waitForTimeout(10000); // Close the browser await browser.close(); } catch (error) { console.error('Error fetching the page:', error.message); } })();
Once you simulate a click on the “Load more” button, you should wait for the content to load up before simulating another click since the data might depend on a server request. You must introduce a delay between the requests using the setTimeout().
The code below notifies the script to wait at least two seconds before simulating another click on the “Load more” button.
$ node dynamicScraper.js
To wrap things up for this section, you want to fetch the current HTML content after each click using the content() method and then log out the output to the terminal.
<title>Load More Button Challenge - ScrapingCourse.com</title> <header> <!-- Navigation Bar --> <nav> <a href="/"> <img src="logo.svg" alt="How to Scrape Data from a Page with Infinite Scroll"> <span>Scraping Course</span> </a> </nav> </header> <main> <!-- Product Grid --> <div> <p>Note that the code structure above is what your output should look like.</p> <p>Wow! You should be proud of yourself for getting this far. You’ve just completed your first attempt at scraping the contents of a webpage. </p> <h2> Simulate the LOad More Products Process </h2> <p>Here, you want to access more products, and to do that, you need to click on the “Load more” button multiple times until you’ve either exhausted the list of all products or gotten the desired number of products you want to access. </p> <p>To access this button and click on it, you must first locate the element using any CSS selectors (the class, id, attribute of the element, or tag name). </p> <p>This tutorial aims to get at least 48 products from the target website, and to do that, you’ll have to click on the “Load more” button at least three times.</p> <p>Start by locating the “Load more” button using any of the CSS selectors on it. Go to the target website, find the “Load more” button, right-click, and select the inspect option. </p> <p><img src="https://img.php.cn/upload/article/000/000/000/173587927350910.jpg" alt="How to Scrape Data from a Page with Infinite Scroll"></p> <p>Selecting the inspect option will open up developer tools just like the page below:</p> <p><img src="https://img.php.cn/upload/article/000/000/000/173587927639663.jpg" alt="How to Scrape Data from a Page with Infinite Scroll"></p> <p>The screenshot above shows that the “Load more” button element has an id attribute with the value "load-more-btn". You can use this id selector to locate the button during the simulation and click on it multiple times.</p> <p>Back to the code, still inside the try block, after the line of code that logs out the previous HTML content for the default 12 products on the page.</p> <p>Define the number of times you want to click the button. Recall that each click loads an additional 12 products. For 48 products, three clicks are required to load the remaining 36.<br> </p> <pre class="brush:php;toolbar:false">// Number of times to click "Load More" const clicks = 3;
Your complete code up until now:
for (let i = 0; i <p>Here’s the output of simulating the button click three times to get 48 products:<br> </p> <pre class="brush:php;toolbar:false">console.log(`Clicking the 'Load More' button - Attempt ${i + 1}`);
Now, you should only be concerned about interacting with just the output of all 48 products. To do this, you need to clean up the previous code in the last section.
You’ll also need to bring down the html variable after the for loop block so you just get only one output with all 48 products.
Your clean-up code should be identical to this code snippet:
$ npm install cheerio puppeteer
Now, let’s get into the HTML parsing using Cheerio.
First of all, Cheerio needs to have access to the HTML content it wants to parse, and for that, it provides a load() method that takes in that HTML content, making it accessible using jQuery-like syntax.
Create an instance of the Cheerio library with the HTML content:
const puppeteer = require('puppeteer');
You can now use $ to query and manipulate elements in the loaded HTML.
Next, initialize an array to store the product information. This array will hold the extracted data, and each product will be stored as an object with its name, price, image, and link.
const url = 'https://www.scrapingcourse.com/button-click';
Recall that each product has a class .product-item. You’ll use this with the variable instance of Cheerio ($) to get each product and then perform some manipulations.
The .each() method is used to iterate through each matched element with the .product-item class selector.
(async () => { try { // Code goes here } catch (error) { console.error('Error:', error.message); } })();
Let’s retrieve the product detail from each product using the class selector of that particular detail. For instance, to get the product name, you’ll need to find the child element in each product with the class selector .product-item. Retrieve the text content of that child element and trim it in case of any whitespaces.
// Launch Puppeteer const browser = await puppeteer.launch({ headless: false }); // Headless mode const page = await browser.newPage(); // Open a new page
Leveraging this concept, let’s get the price, image URL, and link using their class attribute.
// Navigate to the target URL await page.goto(url, { waitUntil: 'networkidle2', // Ensure the page is fully loaded });
Now that you've all the expected information, the next thing is to push each parsed product information as an individual object to the products array.
// Get the full HTML content of the page const html = await page.content(); // Log the entire HTML content console.log(html); // Close the browser await browser.close();
Finally, log out the products array to get the expected output in the terminal.
// Delay for 10 seconds to allow you to see the browser await page.waitForTimeout(10000);
Your entire code should look like this code snippet:
const puppeteer = require('puppeteer'); const url = 'https://www.scrapingcourse.com/button-click'; (async () => { try { // Launch Puppeteer const browser = await puppeteer.launch({ headless: false }); // Headless mode const page = await browser.newPage(); // Open a new page // Navigate to the target URL await page.goto(url, { waitUntil: 'networkidle2', // Ensure the page is fully loaded }); // Get the entire HTML content of the page const html = await page.content(); // Log the entire HTML content console.log(html); // Delay for 10 seconds to allow you to see the browser await page.waitForTimeout(10000); // Close the browser await browser.close(); } catch (error) { console.error('Error fetching the page:', error.message); } })();
Here’s what your output should look like when you save and run the script:
$ node dynamicScraper.js
The next step is to export the parsed product information, which is presently in a JavaScript Object Notation (Json) format, into a Comma-Separated Values (CSV) format. We’ll use the json2csv library to convert the parsed data into its corresponding CSV format.
Start by importing the required modules.
Node.js provides the file system (fs) module for file handling, such as writing data to a file. After importing the fs module, you should destructure the parse() method from the json2csv library.
$ npm install cheerio puppeteer
CSV files usually require column headers; carefully write this in the same order as your parsed information. Here, the parsed data is the products array, where each element is an object with four keys (name, price, image, and link). You should use these object keys to name your column headers for proper mapping.
Define the fields (Column headers) for your CSV file:
const puppeteer = require('puppeteer');
Now that you’ve defined your fields, the following line of action is to convert the current parsed information to a CSV format. The parse() method works in this format: parse(WHAT_YOU_WANT_TO_CONVERT, { YOUR_COLUMN_HEADERS }).
const url = 'https://www.scrapingcourse.com/button-click';
You now have to save this CSV information into a new file with the .csv file extension. When using Node.js, you can handle file creation using the writeFileSync() method on the fs module. This method takes two parameters: the file name and the data.
(async () => { try { // Code goes here } catch (error) { console.error('Error:', error.message); } })();
Your complete code for this section should look like this:
// Launch Puppeteer const browser = await puppeteer.launch({ headless: false }); // Headless mode const page = await browser.newPage(); // Open a new page
You should see an automatic addition of a file named products.csv to your file structure once you save and run the script.
The output - products.csv:
This tutorial delved into the intricacies of scraping data from a page that requires simulation to access its hidden contents. You learned how to perform web scraping on dynamic pages using Node.js and some additional libraries, parse your scraped data into a more organized format, and unpack it into a CSV file.
The above is the detailed content of How to Scrape Data from a Page with Infinite Scroll. For more information, please follow other related articles on the PHP Chinese website!