How to Scrape Data from a Page with Infinite Scroll-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

How to Scrape Data from a Page with Infinite Scroll

Mary-Kate Olsen

Jan 03, 2025 pm 12:41 PM

Have you ever encountered a web page requiring actions like “clicking a button” to reveal more content? Such pages are called "dynamic webpages," as they load more content based on user interaction. In contrast, static webpages display all their content at once without requiring user actions.

Scraping content from dynamic pages can be daunting as it requires simulating user interactions, such as clicking a button to access additional hidden content. In this tutorial, you'll learn how to scrape data from a webpage with infinite scrolling via a "Load more" button.

Prerequisites

To follow along with this tutorial, you need:

Node.js: Install the version tagged "LTS" (Long Time Support), which is more stable than the latest version.
Npm: This is a package manager used to install packages. The good news is that “npm” is automatically installed with Node.js, which makes things much faster.
Cheerio: For parsing HTML
Puppeteer: You’ll use this to control a headless browser.
An IDE for building the Scraper: You can get any code editor, like Visual Studio Code.

In addition, you’ll need to have a basic understanding of HTML, CSS, and JavaScript. You’ll also need a web browser like Chrome.

Initialize the Project

Create a new folder, then open it in your code editor. Spot the “terminal” tab in your code editor and open a new terminal. Here’s how you can spot it using Visual Studio Code.

How to Scrape Data from a Page with Infinite Scroll

Next, run the following command in the terminal to install the needed packages for this build.

$ npm install cheerio puppeteer

Copy after login

Create a new file inside your project folder in the code editor and name it dynamicScraper.js.

Excellent work, buddy!

Accessing the Content of the Page

Puppeteer is a powerful Node.js library that allows you to control headless Chrome browsers, making it ideal for interacting with webpages. With Puppeteer, you can target a webpage using the URL, access the contents, and easily extract data from that page.

In this section, you’ll learn how to open a page using a headless browser, access the content, and retrieve the HTML content of that page. You can find the target website for this tutorial here.

Note: You’re to write all the code inside of the dynamicScraper.js.

Start by importing Puppeteer using the require() Node.js built-in function, which helps you load up modules: core modules, third-party libraries (like Puppeteer), or custom modules (like your local JS files).

$ npm install cheerio puppeteer

Copy after login

Next, define a variable for storing your target URL. Doing this isn’t mandatory, but it makes your code cleaner, as you just have to reference this global variable from anywhere in your code.

const puppeteer = require('puppeteer');

Copy after login

The next step is to create the function that’ll launch the headless browser and retrieve the HTML contents of the target page. You should opt for the Immediately Invoked Function Expression (IIFE) method to make things much faster.

Define an asynchronous IIFE with a try-and-catch block:

const url = 'https://www.scrapingcourse.com/button-click';

Copy after login

Note: You should write every other code for this tutorial segment inside the try block.

Right inside the IIFE, create a new instance of Puppeteer and open a new page for the interaction.

Launch a new instance of the puppeteer library using the launch method and pass the headless mode to it. The headless mode can be either set to true or false. Setting the headless mode to true makes the headless browser not visible when the puppeteer is launched, but setting it to false makes the browser visible.

After you’ve launched Puppeteer, you also want to call the newPage method, which triggers the opening of a new tab in the headless browser.

(async () => {
    try {
        // Code goes here
    } catch (error) {
        console.error('Error:', error.message);
    }
})();

Copy after login

Now, query the newPage method to target the expected URL and open that website in this new tab using the page.goto method. Beyond that, you want to ensure that Puppeteer only considers the page ready for interaction and extraction of data if and only if the page has loaded all of its essential resources (like images and JS).

To ensure the page is ready, Puppeteer provides an option called waitUntil, which can take in various values that define different conditions for loading the page:

load: This waits for the load event to fire, which occurs after the HTML document and its resources (e.g., images, CSS, JS) have been loaded. However, this may not account for additional JavaScript-rendered content that loads after the load event.
domcontentloaded: This waits for the DOMContentLoaded event, which is triggered once the initial HTML is parsed. But this loads before external resources (like images or additional JS) load.
networkidle2: This waits until there are no more than two active network requests (ongoing HTTP requests (e.g., loading images, scripts, or other resources)) for 500 milliseconds. This value is preferred when dealing with pages that make small, continuous requests but don't affect the main content.

// Launch Puppeteer
const browser = await puppeteer.launch({ headless: false }); // Headless mode
const page = await browser.newPage(); // Open a new page

Copy after login

Finally, you just need to retrieve all the HTML content of the current page using the page.content(). Most importantly, you should close the browser instance to avoid unnecessary usage of memory, which can slow your system down. Use browser.close() at the end of your script to close the browser.

$ npm install cheerio puppeteer

Copy after login

With the present code you have, the browser will load up and close very fast, and you might not even be able to view the page well. In this case, you can delay the browser for a few seconds using the page.waitForTimeout method. This method should come just before the browser.close method.

const puppeteer = require('puppeteer');

Copy after login

Here’s the entire code for this section:

const url = 'https://www.scrapingcourse.com/button-click';

Copy after login

Save your file and run the script inside of your terminal using the command below:

(async () => {
    try {
        // Code goes here
    } catch (error) {
        console.error('Error:', error.message);
    }
})();

Copy after login

The script will open up a headless browser like the one below:

How to Scrape Data from a Page with Infinite Scroll

The browser loads up, Puppeteer fetches its entire HTML content, and Console logs the content to the terminal.

Here’s the output you should get in your terminal:

// Launch Puppeteer
const browser = await puppeteer.launch({ headless: false }); // Headless mode
const page = await browser.newPage(); // Open a new page

Copy after login

Next, you want to loop to simulate the clicks. The simulation will use a for loop that runs i number of times, where i will be the clicks variable.

// Navigate to the target URL
await page.goto(url, {
    waitUntil: 'networkidle2', // Ensure the page is fully loaded
});

Copy after login

Note: Your remaining code for this section should be written inside the try block in the for loop.

To help with debugging and tracking the output, log out the current click attempt.

// Get the full HTML content of the page
const html = await page.content();

// Log the entire HTML content
console.log(html);

// Close the browser
await browser.close();

Copy after login

Next, you want to be able to locate the “Load more” button and click it at least three times. But before simulating the click, you should ensure the “Load more” button is available.

Puppeteer provides the waitForSelector() method to check the visibility of an element before using it.

For the “Load more” button, you’ll have to first locate it using the value of the id selector on it and then check for the visibility status like this:

// Delay for 10 seconds to allow you to see the browser
await page.waitForTimeout(10000);

Copy after login

Now that you know the “Load more” button is available, you can click it using the Puppeteer click() method.

const puppeteer = require('puppeteer');

const url = 'https://www.scrapingcourse.com/button-click';

(async () => {
    try {
        // Launch Puppeteer
        const browser = await puppeteer.launch({ headless: false }); // Headless mode
        const page = await browser.newPage(); // Open a new page

        // Navigate to the target URL
        await page.goto(url, {
            waitUntil: 'networkidle2', // Ensure the page is fully loaded
        });

        // Get the entire HTML content of the page
        const html = await page.content();

        // Log the entire HTML content
        console.log(html);

        // Delay for 10 seconds to allow you to see the browser
        await page.waitForTimeout(10000);

        // Close the browser
        await browser.close();
    } catch (error) {
        console.error('Error fetching the page:', error.message);
    }
})();

Copy after login

Once you simulate a click on the “Load more” button, you should wait for the content to load up before simulating another click since the data might depend on a server request. You must introduce a delay between the requests using the setTimeout().

The code below notifies the script to wait at least two seconds before simulating another click on the “Load more” button.

$ node dynamicScraper.js

Copy after login

To wrap things up for this section, you want to fetch the current HTML content after each click using the content() method and then log out the output to the terminal.



    <title>Load More Button Challenge - ScrapingCourse.com</title>


    <header>
        <!-- Navigation Bar -->
        <nav>
            <a href="/">
                <img src="/static/imghw/default1.png" data-src="logo.svg" class="lazy" alt="How to Scrape Data from a Page with Infinite Scroll">
                <span>Scraping Course</span>
            </a>
        </nav>
    </header>

    <main>
        <!-- Product Grid -->
        <div>



<p>Note that the code structure above is what your output should look like.</p>

<p>Wow! You should be proud of yourself for getting this far. You’ve just completed your first attempt at scraping the contents of a webpage. </p>

<h2>
  
  
  Simulate the LOad More Products Process
</h2>

<p>Here, you want to access more products, and to do that, you need to click on the “Load more” button multiple times until you’ve either exhausted the list of all products or gotten the desired number of products you want to access. </p>

<p>To access this button and click on it, you must first locate the element using any CSS selectors (the class, id, attribute of the element, or tag name). </p>

<p>This tutorial aims to get at least 48 products from the target website, and to do that, you’ll have to click on the “Load more” button at least three times.</p>

<p>Start by locating the “Load more” button using any of the CSS selectors on it. Go to the target website, find the “Load more” button, right-click, and select the inspect option. </p>

<p><img src="/static/imghw/default1.png" data-src="https://img.php.cn/upload/article/000/000/000/173587927350910.jpg" class="lazy" alt="How to Scrape Data from a Page with Infinite Scroll"></p>

<p>Selecting the inspect option will open up developer tools just like the page below:</p>

<p><img src="/static/imghw/default1.png" data-src="https://img.php.cn/upload/article/000/000/000/173587927639663.jpg" class="lazy" alt="How to Scrape Data from a Page with Infinite Scroll"></p>

<p>The screenshot above shows that the “Load more” button element has an id attribute with the value "load-more-btn". You can use this id selector to locate the button during the simulation and click on it multiple times.</p>

<p>Back to the code, still inside the try block, after the line of code that logs out the previous HTML content for the default 12 products on the page.</p>

<p>Define the number of times you want to click the button. Recall that each click loads an additional 12 products. For 48 products, three clicks are required to load the remaining 36.<br>
</p>

<pre class="brush:php;toolbar:false">// Number of times to click "Load More"
const clicks = 3;

Copy after login

Your complete code up until now:

for (let i = 0; i 



<p>Here’s the output of simulating the button click three times to get 48 products:<br>
</p>

<pre class="brush:php;toolbar:false">console.log(`Clicking the 'Load More' button - Attempt ${i + 1}`);

Copy after login

Now, you should only be concerned about interacting with just the output of all 48 products. To do this, you need to clean up the previous code in the last section.

You’ll also need to bring down the html variable after the for loop block so you just get only one output with all 48 products.

Your clean-up code should be identical to this code snippet:

$ npm install cheerio puppeteer

Copy after login

Now, let’s get into the HTML parsing using Cheerio.

First of all, Cheerio needs to have access to the HTML content it wants to parse, and for that, it provides a load() method that takes in that HTML content, making it accessible using jQuery-like syntax.

Create an instance of the Cheerio library with the HTML content:

const puppeteer = require('puppeteer');

Copy after login

You can now use $ to query and manipulate elements in the loaded HTML.

Next, initialize an array to store the product information. This array will hold the extracted data, and each product will be stored as an object with its name, price, image, and link.

const url = 'https://www.scrapingcourse.com/button-click';

Copy after login

Recall that each product has a class .product-item. You’ll use this with the variable instance of Cheerio ($) to get each product and then perform some manipulations.

The .each() method is used to iterate through each matched element with the .product-item class selector.

(async () => {
    try {
        // Code goes here
    } catch (error) {
        console.error('Error:', error.message);
    }
})();

Copy after login

Let’s retrieve the product detail from each product using the class selector of that particular detail. For instance, to get the product name, you’ll need to find the child element in each product with the class selector .product-item. Retrieve the text content of that child element and trim it in case of any whitespaces.

// Launch Puppeteer
const browser = await puppeteer.launch({ headless: false }); // Headless mode
const page = await browser.newPage(); // Open a new page

Copy after login

$(element).find('.product-name'): Searches within the current .product-item for the child element with the class .product-name.
.text(): Retrieves the text content inside the element.
.trim(): Removes unnecessary whitespace from the text.

Leveraging this concept, let’s get the price, image URL, and link using their class attribute.

// Navigate to the target URL
await page.goto(url, {
    waitUntil: 'networkidle2', // Ensure the page is fully loaded
});

Copy after login

Now that you've all the expected information, the next thing is to push each parsed product information as an individual object to the products array.

// Get the full HTML content of the page
const html = await page.content();

// Log the entire HTML content
console.log(html);

// Close the browser
await browser.close();

Copy after login

Finally, log out the products array to get the expected output in the terminal.

// Delay for 10 seconds to allow you to see the browser
await page.waitForTimeout(10000);

Copy after login

Your entire code should look like this code snippet:

const puppeteer = require('puppeteer');

const url = 'https://www.scrapingcourse.com/button-click';

(async () => {
    try {
        // Launch Puppeteer
        const browser = await puppeteer.launch({ headless: false }); // Headless mode
        const page = await browser.newPage(); // Open a new page

        // Navigate to the target URL
        await page.goto(url, {
            waitUntil: 'networkidle2', // Ensure the page is fully loaded
        });

        // Get the entire HTML content of the page
        const html = await page.content();

        // Log the entire HTML content
        console.log(html);

        // Delay for 10 seconds to allow you to see the browser
        await page.waitForTimeout(10000);

        // Close the browser
        await browser.close();
    } catch (error) {
        console.error('Error fetching the page:', error.message);
    }
})();

Copy after login

Here’s what your output should look like when you save and run the script:

$ node dynamicScraper.js

Copy after login

Export Product Information to CSV

The next step is to export the parsed product information, which is presently in a JavaScript Object Notation (Json) format, into a Comma-Separated Values (CSV) format. We’ll use the json2csv library to convert the parsed data into its corresponding CSV format.

Start by importing the required modules.

Node.js provides the file system (fs) module for file handling, such as writing data to a file. After importing the fs module, you should destructure the parse() method from the json2csv library.

$ npm install cheerio puppeteer

Copy after login

CSV files usually require column headers; carefully write this in the same order as your parsed information. Here, the parsed data is the products array, where each element is an object with four keys (name, price, image, and link). You should use these object keys to name your column headers for proper mapping.

Define the fields (Column headers) for your CSV file:

const puppeteer = require('puppeteer');

Copy after login

Now that you’ve defined your fields, the following line of action is to convert the current parsed information to a CSV format. The parse() method works in this format: parse(WHAT_YOU_WANT_TO_CONVERT, { YOUR_COLUMN_HEADERS }).

const url = 'https://www.scrapingcourse.com/button-click';

Copy after login

You now have to save this CSV information into a new file with the .csv file extension. When using Node.js, you can handle file creation using the writeFileSync() method on the fs module. This method takes two parameters: the file name and the data.

(async () => {
    try {
        // Code goes here
    } catch (error) {
        console.error('Error:', error.message);
    }
})();

Copy after login

Your complete code for this section should look like this:

// Launch Puppeteer
const browser = await puppeteer.launch({ headless: false }); // Headless mode
const page = await browser.newPage(); // Open a new page

Copy after login

You should see an automatic addition of a file named products.csv to your file structure once you save and run the script.

The output - products.csv:
How to Scrape Data from a Page with Infinite Scroll

Conclusion

This tutorial delved into the intricacies of scraping data from a page that requires simulation to access its hidden contents. You learned how to perform web scraping on dynamic pages using Node.js and some additional libraries, parse your scraped data into a more organized format, and unpack it into a CSV file.

The above is the detailed content of How to Scrape Data from a Page with Infinite Scroll. For more information, please follow other related articles on the PHP Chinese website!