Magic Mushrooms：使用 Mage 探索和處理空數據-Python教學-PHP中文網

Mage 是一款用於 ETL 任務的強大工具，具有支援資料探索和挖掘的功能、透過圖形模板進行快速視覺化以及其他一些功能，可將您的資料工作轉變為神奇的東西。

處理數據時，在 ETL 過程中，通常會發現丟失的數據，這些數據可能會在將來產生問題，具體取決於我們要對數據集執行的活動，空數據可能會造成相當大的破壞。

為了識別資料集中是否缺少數據，我們可以使用 Python 和 pandas 函式庫來檢查出現空值的數據，此外我們還可以建立圖表來更清楚地顯示這些空值的影響我們的資料集。

我們的管道由 4 個步驟組成：從資料載入開始，兩個處理步驟和資料匯出。

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

資料載入器

在本文中，我們將使用資料集：有毒蘑菇的二進位預測，該資料集可在 Kaggle 上作為競賽的一部分獲得。讓我們使用網站上提供的訓練資料集。

讓我們使用 python 建立一個資料載入器步驟，以便能夠載入我們將要使用的資料。在此步驟之前，我在本機電腦上的 Postgres 資料庫中建立了一個表，以便能夠載入資料。由於資料位於 Postgres 中，我們將使用 Mage 中已定義的 Postgres 載入範本。

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from os import path

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@data_loader
def load_data_from_postgres(*args, **kwargs):
    """
    Template for loading data from a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql
    """
    query = 'SELECT * FROM mushroom'  # Specify your SQL query here
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        return loader.load(query)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """

    assert output is not None, 'The output is undefined'

登入後複製

在函數 load_data_from_postgres() 中，我們將定義用於載入資料庫中的表的查詢。就我而言，我在檔案 io_config.yaml 中配置了銀行訊息，它被定義為預設配置，因此我們只需將預設名稱傳遞給變數 config_profile 即可。

執行該區塊後，我們將使用新增圖表功能，該功能將透過已定義的範本提供有關我們的資料的資訊。只需點擊播放按鈕旁邊的圖示（在圖像中用黃線標記）即可。

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

我們將選擇兩個選項來進一步探索我們的資料集：summay_overview 和 feature_profiles 選項。透過summary_overview，我們可以獲得資料集中的列數和行數信息，我們還可以按類型查看列的總數，例如分類列、數字列和布林列的總數。另一方面，Feature_profiles 呈現了更多關於數據的描述性信息，例如：類型、最小值、最大值等信息，我們甚至可以將缺失值可視化，這是我們處理的重點。

為了能夠更專注於缺失數據，讓我們使用範本：缺失值百分比，這是一個長條圖，每列中包含缺失資料的百分比。

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

該圖顯示了4 列，其中缺失值對應於其內容的80% 以上，而其他列則呈現缺失值，但數量較少，這些資訊現在允許我們尋求不同的策略來處理此問題空數據。

變壓器跌落柱

對於空值超過 80% 的列，我們將遵循的策略是在資料框中執行刪除列，選擇我們要從資料框中排除的列。使用 Python 語言中的

TRANSFORMER 區塊，我們將選擇選項列刪除 .

from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:
    """
    Execute Transformer Action: ActionType.REMOVE
    Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns
    """
    action = build_transformer_action(
        df,
        action_type=ActionType.REMOVE,
        arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'],        
        axis=Axis.COLUMN,
    )
    return BaseAction(action).execute(df)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.

    """
    assert output is not None, 'The output is undefined'

登入後複製

在函數

execute_transformer_action() 中，我們將在參數變數中插入一個列表，其中包含要從資料集中排除的列的名稱，在此步驟之後，只需執行該區塊即可。

變壓器填充缺失值

現在，對於空值少於80% 的列，我們將使用策略

填充缺失值，在某些情況下，儘管存在缺失數據，但仍將其替換為諸如平均，或時尚，它可能能夠滿足數據需求，而不會對數據集造成太多更改，這取決於您的最終目標。

Existem algumas tarefas, como a de classificação, onde a substituição dos dados faltantes por um valor que seja relevante (moda, média, mediana) para o dataset, possa contribuir com o algoritmo de classificação, que poderia chegar a outras conclusões caso o dados fossem apagados como na outra estratégia de utilizamos.

Para tomar uma decisão com relação a qual medida vamos utilizar, vamos recorrer novamente a funcionalidade Add chart do Mage. Usando o template Most frequent values podemos visualizar a moda e a frequência desse valor em cada uma das colunas.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Seguindos passos semelhantes aos anteriores, vamos usar o tranformer Fill in missing values, para realizar a tarefa de subtiruir os dados faltantes usando a moda de cada uma das colunas: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.

from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy
from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:

    """
    Execute Transformer Action: ActionType.IMPUTE
    Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values

    """
    action = build_transformer_action(
        df,
        action_type=ActionType.IMPUTE,
        arguments=df.columns,  # Specify columns to impute
        axis=Axis.COLUMN,
        options={'strategy': ImputationStrategy.MODE},  # Specify imputation strategy
    )

    return BaseAction(action).execute(df)


@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """
    assert output is not None, 'The output is undefined'

登入後複製

Na função execute_transformer_action() , definimos a estratégia para a substituição dos dados num dicionário do Python. Para mais opções de substituição, basta acessar a documentação do transformer: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.

Data Exporter

Ao realizar todas as transformações, vamos salvar nosso dataset agora tratado, na mesma base do Postgres mas agora com um nome diferente para podermos diferenciar. Usando o bloco Data Exporter e selecionando o Postgres, vamos definir o shema e a tabela onde queremos salvar, lembrando que as configurações do banco são salvas previamente no arquivo io_config.yaml.

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from pandas import DataFrame
from os import path

if 'data_exporter' not in globals():
    from mage_ai.data_preparation.decorators import data_exporter

@data_exporter
def export_data_to_postgres(df: DataFrame, **kwargs) -> None:

    """
    Template for exporting data to a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql

    """

    schema_name = 'public'  # Specify the name of the schema to export data to
    table_name = 'mushroom_clean'  # Specify the name of the table to export data to
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        loader.export(
            df,
            schema_name,
            table_name,
            index=False,  # Specifies whether to include index in exported table
            if_exists='replace', #Specify resolution policy if table name already exists
        )

登入後複製