Home > Backend Development > Python Tutorial > Web scraping com selenium

Web scraping com selenium

Susan Sarandon
Release: 2025-01-23 18:11:17
Original
513 people have browsed it

This text is already well organized and written in correct Portuguese. The only suggestion would be to improve clarity in some points and add a little more context for the reader who is not familiar with web scraping and the IBGE website. A revised version follows:

Web scraping com selenium


Automating IBGE Inflation Data Collection with Selenium and Python

This tutorial demonstrates how to automate the collection of inflation data from IBGE (Brazilian Institute of Geography and Statistics) using the Selenium library in Python. The objective is to extract data on the percentage variation of the IPCA (Broad National Consumer Price Index) from the SIDRA website (IBGE Automatic Recovery System).


Steps for Data Collection

Before you start, make sure you have Python installed on your system, along with the package manager pip.


1. Environment Preparation

1.1 Create the Project:

Create a new folder for your project. Inside it, create a Jupyter Notebook file (.ipynb) or a Python file (.py). Jupyter Notebook makes it easy to view and run code step by step.

1.2 Installation of Libraries:

Open your terminal or command prompt, navigate to your project folder and run the following commands to install the necessary libraries:

<code class="language-bash">pip install notebook selenium webdriver-manager pandas</code>
Copy after login
Copy after login

Create a virtual environment (recommended) to isolate the dependencies of this project:

<code class="language-bash">python -m venv venv  # Cria o ambiente virtual
venv\Scripts\activate  # Ativa o ambiente virtual (Windows)
source venv/bin/activate # Ativa o ambiente virtual (Linux/macOS)</code>
Copy after login
Copy after login

After activating the virtual environment, run the library installation commands again. To save dependencies in a requirements.txt file, use:

<code class="language-bash">pip freeze > requirements.txt</code>
Copy after login

This allows you to easily reproduce the environment on another computer.

1.3 ChromeDriver Download:

Download the version of ChromeDriver compatible with your Google Chrome version. You can find the download link on the official ChromeDriver website by searching for the version corresponding to your version of Chrome (go to chrome://settings/help to check your version). After downloading, unzip the file and remember where it was saved.


2. ChromeDriver Configuration

2.1 Add to PATH (Windows):

To make using ChromeDriver easier, add the path of your ChromeDriver installation folder to the PATH environment variable. Follow the steps:

  1. Search for "environment variables" in the start menu.
  2. Click on "Edit system environment variables".
  3. In the "System variables" section, select "Path" and click "Edit".
  4. Click "New" and add the full path of the folder where the ChromeDriver is located (ex: C:caminhoparachromedriver).
  5. Save the changes and restart the terminal or command prompt.

2.2 Verification:

To check if ChromeDriver is configured correctly, open your terminal and type:

<code class="language-bash">pip install notebook selenium webdriver-manager pandas</code>
Copy after login
Copy after login

ChromeDriver version should be displayed.


3. Python Script for Automation

The Python code below uses Selenium to access the SIDRA page, select the data and extract the IPCA percentage variation information. Remember to replace 'C:\caminho\para\chromedriver.exe' with the correct path for your ChromeDriver.

<code class="language-bash">python -m venv venv  # Cria o ambiente virtual
venv\Scripts\activate  # Ativa o ambiente virtual (Windows)
source venv/bin/activate # Ativa o ambiente virtual (Linux/macOS)</code>
Copy after login
Copy after login

4. Execution and Results

Run the Python script. If everything is configured correctly, the script will:

  1. Access the SIDRA page.
  2. Select all data.
  3. Extract percentage change values.
  4. Print the values ​​to the console.
  5. Save the page's HTML in a file pagina_carregada.html (useful for debugging).

The extracted data can be processed further, for example to create graphs or reports.


Final Considerations

This tutorial provides a basis for automating IBGE data collection. Remember that the site structure may change, requiring adjustments to the XPath code. It's important to monitor changes to your site and update your script as needed. Furthermore, respect the terms of use of the IBGE website when collecting data.

This version improves clarity, adds important information about environment configuration, and provides a more complete introduction for users with less web scraping experience. The structure has also been slightly reorganized for better fluidity.

The above is the detailed content of Web scraping com selenium. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template