Mage is a powerful tool for ETL tasks, with features that enable data exploration and mining, quick visualizations through graph templates and several other features that transform your work with data into something magical.
When processing data, during an ETL process it is common to find missing data that can generate problems in the future, depending on the activity we are going to carry out with the dataset, null data can be quite disruptive.
To identify the absence of data in our dataset, we can use Python and the pandas library to check the data that presents null values, in addition we can create graphs that show even more clearly the impact of these null values in our dataset.
Our pipeline consists of 4 steps: starting with loading the data, two processing steps and exporting the data.
In this article we will use the dataset: Binary Prediction of Poisonous Mushrooms which is available on Kaggle as part of a competition. Let's use the training dataset available on the website.
Let's create a Data Loader step using python to be able to load the data we are going to use. Before this step, I created a table in the Postgres database, which I have locally on my machine, to be able to load the data. As the data is in Postgres, we will use the already defined Postgres load template within Mage.
from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from os import path if 'data_loader' not in globals(): from mage_ai.data_preparation.decorators import data_loader if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @data_loader def load_data_from_postgres(*args, **kwargs): """ Template for loading data from a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ query = 'SELECT * FROM mushroom' # Specify your SQL query here config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: return loader.load(query) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
Within the function load_data_from_postgres() we will define the query that we will use to load the table in the database. In my case, I configured the bank information in the file io_config.yaml where it is defined as the default configuration, so we only need to pass the default name to the variable config_profile.
After executing the block, we will use the Add chart feature, which will provide information about our data through already defined templates. Just click on the icon next to the play button, marked in the image with a yellow line.
We will select two options to explore our dataset further, the summay_overview and feature_profiles options. Through summary_overview, we obtain information about the number of columns and rows in the dataset. We can also view the total number of columns by type, for example the total number of categorical, numeric and Boolean columns. Feature_profiles, on the other hand, presents more descriptive information about the data, such as: type, minimum value, maximum value, among other information, and we can even visualize the missing values, which are the focus of our treatment.
To be able to focus more on missing data, let's use the template: % of missing values, a bar graph with the percentage of data that is missing, in each of the columns.
The graph presents 4 columns where missing values correspond to more than 80% of its content, and other columns that present missing values but in smaller quantities, this information now allows us to seek different strategies to deal with this null data.
For columns that have more than 80% of null values, the strategy we will follow will be to perform a drop columns in the dataframe, selecting the columns that we are going to exclude from the dataframe. Using the TRANSFORMER Block in the Python language, we will select the option Colum removal .
from mage_ai.data_cleaner.transformer_actions.base import BaseAction from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action from pandas import DataFrame if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @transformer def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame: """ Execute Transformer Action: ActionType.REMOVE Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns """ action = build_transformer_action( df, action_type=ActionType.REMOVE, arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'], axis=Axis.COLUMN, ) return BaseAction(action).execute(df) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
Within the function execute_transformer_action() we will insert a list with the name of the columns that we want to exclude from the dataset, in the arguments variable, after this step, just execute the block.
Now for the columns that have less than 80% of null values, we will use the strategy Fill in Missing Values, as in some cases despite having missing data, replacing these with values such as average, or fashion, it may be able to meet the need for data without causing many changes to the dataset, depending on your final objective.
Existem algumas tarefas, como a de classificação, onde a substituição dos dados faltantes por um valor que seja relevante (moda, média, mediana) para o dataset, possa contribuir com o algoritmo de classificação, que poderia chegar a outras conclusões caso o dados fossem apagados como na outra estratégia de utilizamos.
Para tomar uma decisão com relação a qual medida vamos utilizar, vamos recorrer novamente a funcionalidade Add chart do Mage. Usando o template Most frequent values podemos visualizar a moda e a frequência desse valor em cada uma das colunas.
Seguindos passos semelhantes aos anteriores, vamos usar o tranformer Fill in missing values, para realizar a tarefa de subtiruir os dados faltantes usando a moda de cada uma das colunas: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.
from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy from mage_ai.data_cleaner.transformer_actions.base import BaseAction from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action from pandas import DataFrame if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @transformer def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame: """ Execute Transformer Action: ActionType.IMPUTE Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values """ action = build_transformer_action( df, action_type=ActionType.IMPUTE, arguments=df.columns, # Specify columns to impute axis=Axis.COLUMN, options={'strategy': ImputationStrategy.MODE}, # Specify imputation strategy ) return BaseAction(action).execute(df) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
Na função execute_transformer_action() , definimos a estratégia para a substituição dos dados num dicionário do Python. Para mais opções de substituição, basta acessar a documentação do transformer: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.
Ao realizar todas as transformações, vamos salvar nosso dataset agora tratado, na mesma base do Postgres mas agora com um nome diferente para podermos diferenciar. Usando o bloco Data Exporter e selecionando o Postgres, vamos definir o shema e a tabela onde queremos salvar, lembrando que as configurações do banco são salvas previamente no arquivo io_config.yaml.
from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from pandas import DataFrame from os import path if 'data_exporter' not in globals(): from mage_ai.data_preparation.decorators import data_exporter @data_exporter def export_data_to_postgres(df: DataFrame, **kwargs) -> None: """ Template for exporting data to a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ schema_name = 'public' # Specify the name of the schema to export data to table_name = 'mushroom_clean' # Specify the name of the table to export data to config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: loader.export( df, schema_name, table_name, index=False, # Specifies whether to include index in exported table if_exists='replace', #Specify resolution policy if table name already exists )
repo -> https://github.com/DeadPunnk/Mushrooms/tree/main
The above is the detailed content of Magic Mushrooms: exploring and treating null data with Mage. For more information, please follow other related articles on the PHP Chinese website!