Magic Mushrooms: exploring and treating null data with Mage

王林
Release: 2024-08-18 06:02:02
Original
694 people have browsed it

Mage is a powerful tool for ETL tasks, with features that enable data exploration and mining, quick visualizations through graph templates and several other features that transform your work with data into something magical.

When processing data, during an ETL process it is common to find missing data that can generate problems in the future, depending on the activity we are going to carry out with the dataset, null data can be quite disruptive.

To identify the absence of data in our dataset, we can use Python and the pandas library to check the data that presents null values, in addition we can create graphs that show even more clearly the impact of these null values ​​in our dataset.

Our pipeline consists of 4 steps: starting with loading the data, two processing steps and exporting the data.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Data Loader

In this article we will use the dataset: Binary Prediction of Poisonous Mushrooms which is available on Kaggle as part of a competition. Let's use the training dataset available on the website.

Let's create a Data Loader step using python to be able to load the data we are going to use. Before this step, I created a table in the Postgres database, which I have locally on my machine, to be able to load the data. As the data is in Postgres, we will use the already defined Postgres load template within Mage.

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from os import path

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@data_loader
def load_data_from_postgres(*args, **kwargs):
    """
    Template for loading data from a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql
    """
    query = 'SELECT * FROM mushroom'  # Specify your SQL query here
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        return loader.load(query)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """

    assert output is not None, 'The output is undefined'
Copy after login

Within the function load_data_from_postgres() we will define the query that we will use to load the table in the database. In my case, I configured the bank information in the file io_config.yaml where it is defined as the default configuration, so we only need to pass the default name to the variable config_profile.

After executing the block, we will use the Add chart feature, which will provide information about our data through already defined templates. Just click on the icon next to the play button, marked in the image with a yellow line.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

We will select two options to explore our dataset further, the summay_overview and feature_profiles options. Through summary_overview, we obtain information about the number of columns and rows in the dataset. We can also view the total number of columns by type, for example the total number of categorical, numeric and Boolean columns. Feature_profiles, on the other hand, presents more descriptive information about the data, such as: type, minimum value, maximum value, among other information, and we can even visualize the missing values, which are the focus of our treatment.

To be able to focus more on missing data, let's use the template: % of missing values, a bar graph with the percentage of data that is missing, in each of the columns.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

The graph presents 4 columns where missing values ​​correspond to more than 80% of its content, and other columns that present missing values ​​but in smaller quantities, this information now allows us to seek different strategies to deal with this null data.

Transformer Drop Columns

For columns that have more than 80% of null values, the strategy we will follow will be to perform a drop columns in the dataframe, selecting the columns that we are going to exclude from the dataframe. Using the TRANSFORMER Block in the Python language, we will select the option Colum removal .

from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:
    """
    Execute Transformer Action: ActionType.REMOVE
    Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns
    """
    action = build_transformer_action(
        df,
        action_type=ActionType.REMOVE,
        arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'],        
        axis=Axis.COLUMN,
    )
    return BaseAction(action).execute(df)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.

    """
    assert output is not None, 'The output is undefined'
Copy after login

Within the function execute_transformer_action() we will insert a list with the name of the columns that we want to exclude from the dataset, in the arguments variable, after this step, just execute the block.

Transformer Fill in Missing Values

Now for the columns that have less than 80% of null values, we will use the strategy Fill in Missing Values, as in some cases despite having missing data, replacing these with values ​​such as average, or fashion, it may be able to meet the need for data without causing many changes to the dataset, depending on your final objective.

Existem algumas tarefas, como a de classificação, onde a substituição dos dados faltantes por um valor que seja relevante (moda, média, mediana) para o dataset, possa contribuir com o algoritmo de classificação, que poderia chegar a outras conclusões caso o dados fossem apagados como na outra estratégia de utilizamos.

Para tomar uma decisão com relação a qual medida vamos utilizar, vamos recorrer novamente a funcionalidade Add chart do Mage. Usando o template Most frequent values podemos visualizar a moda e a frequência desse valor em cada uma das colunas.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Seguindos passos semelhantes aos anteriores, vamos usar o tranformer Fill in missing values, para realizar a tarefa de subtiruir os dados faltantes usando a moda de cada uma das colunas: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.

from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy
from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:

    """
    Execute Transformer Action: ActionType.IMPUTE
    Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values

    """
    action = build_transformer_action(
        df,
        action_type=ActionType.IMPUTE,
        arguments=df.columns,  # Specify columns to impute
        axis=Axis.COLUMN,
        options={'strategy': ImputationStrategy.MODE},  # Specify imputation strategy
    )

    return BaseAction(action).execute(df)


@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """
    assert output is not None, 'The output is undefined'
Copy after login

Na função execute_transformer_action() , definimos a estratégia para a substituição dos dados num dicionário do Python. Para mais opções de substituição, basta acessar a documentação do transformer: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.

Data Exporter

Ao realizar todas as transformações, vamos salvar nosso dataset agora tratado, na mesma base do Postgres mas agora com um nome diferente para podermos diferenciar. Usando o bloco Data Exporter e selecionando o Postgres, vamos definir o shema e a tabela onde queremos salvar, lembrando que as configurações do banco são salvas previamente no arquivo io_config.yaml.

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from pandas import DataFrame
from os import path

if 'data_exporter' not in globals():
    from mage_ai.data_preparation.decorators import data_exporter

@data_exporter
def export_data_to_postgres(df: DataFrame, **kwargs) -> None:

    """
    Template for exporting data to a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql

    """

    schema_name = 'public'  # Specify the name of the schema to export data to
    table_name = 'mushroom_clean'  # Specify the name of the table to export data to
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        loader.export(
            df,
            schema_name,
            table_name,
            index=False,  # Specifies whether to include index in exported table
            if_exists='replace', #Specify resolution policy if table name already exists
        )

Copy after login

Obrigado e até a próxima ?

repo -> https://github.com/DeadPunnk/Mushrooms/tree/main

The above is the detailed content of Magic Mushrooms: exploring and treating null data with Mage. For more information, please follow other related articles on the PHP Chinese website!

source:dev.to
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template