Home > Backend Development > Python Tutorial > Scraping Data with DevTools and HAR Files

Scraping Data with DevTools and HAR Files

Linda Hamilton
Release: 2024-12-31 08:16:11
Original
781 people have browsed it

Data scraping is a game-changer for anyone looking to extract meaningful information from websites. With tools like Chrome DevTools and HAR files, you can uncover hidden APIs and capture valuable data streams effortlessly. In this post, I’ll share how I used these tools to scrape product data from Blinkit, a grocery delivery platform, and show you how you can do it too.

Why I Chose Data Scraping for My Grocery App

While building a grocery delivery app, I faced a major challenge—lack of real data. Creating my own dataset from scratch would have been extremely time-consuming and offered no real advantage to the project. I needed a quicker, more practical solution, which led me to the idea of scraping data. By extracting product details from Blinkit, I could get accurate, real-world data to test and refine my app without wasting resources.

Common methods to scrape data on the web

  1. Manual Copy-Pasting

    • Simple but tedious. Suitable for extracting small amounts of data.
  2. Web Scraping Tools

    • Tools like Scrapy, BeautifulSoup, or Puppeteer automate the process of extracting data from websites.
    • Best for structured data extraction on a larger scale.
  3. API Integration

    • Some websites offer public APIs for accessing their data directly and legally.
    • Requires knowledge of API endpoints and authentication processes.
  4. Browser DevTools

    • Inspect network requests, capture HAR files, or analyze page elements directly in the browser.
    • Great for identifying hidden APIs or JSON data.
  5. Headless Browsers

    • Use headless browser libraries like Puppeteer or Selenium to automate navigation and scraping.
    • Ideal for sites requiring JavaScript rendering or interaction.
  6. Parsing HAR Files

    • HAR files capture all network activity for a webpage. They can be parsed to extract APIs, JSON responses, or other data.
    • Useful for sites with dynamic content or hidden data.
  7. HTML Parsing

    • Extract data by parsing HTML content using libraries like BeautifulSoup (Python) or Cheerio (Node.js).
    • Effective for simple, static websites.
  8. Data Extraction from PDFs or Images

    • Tools like PyPDF2, Tesseract (OCR), or Adobe APIs help extract text from files when data isn’t available online.
  9. Automated Scripts

    • Custom scripts written in Python, Node.js, or similar languages to scrape, parse, and store data.
    • Offers complete control over the scraping process.
  10. Third-Party APIs

    • Use services like DataMiner, Octoparse, or Scrapy Cloud to handle scraping tasks for you.
    • Saves time but may have limitations based on service plans.

I Chose HAR File Parsing

What is a HAR File?

Scraping Data with DevTools and HAR Files

A HAR (HTTP Archive) file is a JSON-formatted archive file that records the network activity of a web page. It contains detailed information about every HTTP request and response, including headers, query parameters, payloads, and timings. HAR files are often used for debugging, performance analysis, and, in this case, data scraping.

Structure of a HAR File

A HAR file consists of several sections, with the primary ones being:

Scraping Data with DevTools and HAR Files

  1. Log

    • The root object of a HAR file, containing metadata about the recorded session and the captured entries.
  2. Entries

    • An array of objects where each entry represents an individual HTTP request and its corresponding response.

Key properties include:

  • request: Details about the request, such as URL, headers, method, and query parameters.
  • response: Information about the response, including status code, headers, and content.
  • timings: The breakdown of the time spent during the request-response cycle (e.g., DNS, connect, wait, receive).
  1. Pages

    • Contains data about the web pages loaded during the session, such as the page title, load time, and the timestamp of when the page was opened.
  2. Creator

    • Metadata about the tool or browser used to generate the HAR file, including its name and version.

Why I Chose HAR File Parsing

HAR files provide a comprehensive snapshot of all network activity on a webpage. This makes them perfect for identifying hidden APIs, capturing JSON payloads, and extracting the exact data required for scraping. The structured JSON format also simplifies the parsing process using tools like Python or JavaScript libraries.

The Plan: Scraping Data Using HAR File Parsing

Scraping Data with DevTools and HAR Files

To extract product data from Blinkit efficiently, I followed a structured plan:

  1. Browsing and Capturing Network Activity
    • Opened Blinkit’s site and launched Chrome DevTools.
    • Browsed various product pages to capture all necessary API calls in the Network tab.

Scraping Data with DevTools and HAR Files

  1. Exporting the HAR File

    • Saved the recorded network activity as a HAR file for offline analysis.
  2. Parsing the HAR File

    • Used Python to parse the HAR file and extract relevant data.
    • Created three key functions to streamline the process:
  • Function 1: Filter Relevant Responses
    • Extracted all responses matching the endpoint /listing?catId=* to get product-related data.

Scraping Data with DevTools and HAR Files

  • Function 2: Clean and Extract Data
    • Processed the filtered responses to extract key fields like id, name, category, and more.

Scraping Data with DevTools and HAR Files

  • Function 3: Save Images Locally
    • Identified all product image URLs in the data and downloaded them to local files for reference.

Scraping Data with DevTools and HAR Files

  1. Execution and Results
    • The entire process, including some trial and error, took around 30–40 minutes.
    • Successfully scraped data for approximately 600 products, including names, categories, and images.

Scraping Data with DevTools and HAR Files

This approach allowed me to gather the necessary data for my grocery delivery app quickly and efficiently.

Conclusion

Data scraping, when done efficiently, can save a lot of time and effort, especially when you need real-world data to test or build an application. By leveraging Chrome DevTools and HAR files, I was able to quickly extract valuable product data from Blinkit without manually creating a dataset. The process, while requiring some trial and error, was straightforward and provided a practical solution to a common problem faced by developers. With this method, I was able to gather 600 product details in under an hour, helping me move forward with my grocery delivery app project.

Data scraping, however, should always be approached ethically and responsibly. Always ensure you comply with a website’s terms of service and legal guidelines before scraping. If done right, scraping can be a powerful tool for collecting data and improving your projects.

The above is the detailed content of Scraping Data with DevTools and HAR Files. For more information, please follow other related articles on the PHP Chinese website!

source:dev.to
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template