Magic Mushrooms: Mage を使用した null データの探索と処理-Python チュートリアル-php.cn

Mage は ETL タスク用の強力なツールであり、データの探索とマイニングを可能にする機能、グラフテンプレートによる迅速な視覚化、およびデータを使った作業を魔法のようなものに変えるその他のいくつかの機能を備えています。

データを処理するとき、ETL プロセス中に、将来問題を引き起こす可能性のある欠落データが見つかるのが一般的です。データセットで実行しようとしているアクティビティによっては、null データが非常に破壊的になる可能性があります。

データセット内にデータが存在しないことを特定するには、Python と pandas ライブラリを使用して、null 値を示すデータをチェックできます。さらに、これらの null 値の影響をより明確に示すグラフを作成できます。私たちのデータセット。

私たちのパイプラインは 4 つのステップで構成されています: データのロードから始まり、2 つの処理ステップ、データのエクスポートです。

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

データローダー

この記事では、コンテストの一環として Kaggle で利用できるデータセット「毒キノコのバイナリ予測」を使用します。ウェブサイトで入手可能なトレーニングデータセットを使用してみましょう。

使用するデータをロードできるように、Python を使用してデータローダーステップを作成しましょう。この手順の前に、データをロードできるように、マシン上にローカルにある Postgres データベースにテーブルを作成しました。データは Postgres にあるため、Mage 内ですでに定義されている Postgres ロードテンプレートを使用します。

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from os import path

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@data_loader
def load_data_from_postgres(*args, **kwargs):
    """
    Template for loading data from a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql
    """
    query = 'SELECT * FROM mushroom'  # Specify your SQL query here
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        return loader.load(query)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """

    assert output is not None, 'The output is undefined'

ログイン後にコピー

関数 load_data_from_postgres() 内で、データベースにテーブルをロードするために使用するクエリを定義します。私の場合、デフォルト設定として定義されているファイル io_config.yaml で銀行情報を設定したため、デフォルト名を変数 config_profile に渡すだけで済みます。

ブロックを実行した後、チャートの追加機能を使用します。これにより、すでに定義されたテンプレートを通じてデータに関する情報が提供されます。画像内で黄色の線でマークされている、再生ボタンの横にあるアイコンをクリックするだけです。

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

データセットをさらに調査するために、summay_overview オプションと feature_profiles オプションの 2 つのオプションを選択します。 summary_overview を通じて、データセット内の列と行の数に関する情報を取得します。また、カテゴリ列、数値列、ブール列の合計数など、タイプごとの列の合計数を表示することもできます。一方、Feature_profiles は、タイプ、最小値、最大値などのデータに関するより記述的な情報を表示し、処理の焦点である欠損値を視覚化することもできます。

欠損データにさらに焦点を当てることができるように、欠損値の %、欠損データのパーセンテージを各列に示す棒グラフのテンプレートを使用しましょう。

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

グラフには、欠損値がその内容の 80% 以上に相当する 4 つの列が表示されます。また、欠損値がより少ない量で表示される他の列も表示されます。この情報により、これに対処するためのさまざまな戦略を模索できるようになります。ヌルデータ。

変圧器ドロップカラム

80% を超える null 値が含まれる列の場合、データフレーム内で列の削除を実行し、データフレームから除外する列を選択する戦略がとられます。 Python 言語の

TRANSFORMER ブロックを使用して、オプション列の削除 を選択します。

from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:
    """
    Execute Transformer Action: ActionType.REMOVE
    Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns
    """
    action = build_transformer_action(
        df,
        action_type=ActionType.REMOVE,
        arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'],        
        axis=Axis.COLUMN,
    )
    return BaseAction(action).execute(df)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.

    """
    assert output is not None, 'The output is undefined'

ログイン後にコピー

関数

execute_transformer_action() 内で、データセットから除外する列の名前を含むリストを引数変数に挿入します。このステップの後は、ブロックを実行するだけです。

トランスフォーマーの欠損値の入力

NULL 値が 80% 未満の列については、

欠損値を埋める戦略を使用します。場合によっては、欠損データがあるにもかかわらず、これらを次のような値で置き換えます。最終的な目的に応じて、データセットに多くの変更を加えることなく、データのニーズを満たすことができる場合があります。

Existem algumas tarefas, como a de classificação, onde a substituição dos dados faltantes por um valor que seja relevante (moda, média, mediana) para o dataset, possa contribuir com o algoritmo de classificação, que poderia chegar a outras conclusões caso o dados fossem apagados como na outra estratégia de utilizamos.

Para tomar uma decisão com relação a qual medida vamos utilizar, vamos recorrer novamente a funcionalidade Add chart do Mage. Usando o template Most frequent values podemos visualizar a moda e a frequência desse valor em cada uma das colunas.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Seguindos passos semelhantes aos anteriores, vamos usar o tranformer Fill in missing values, para realizar a tarefa de subtiruir os dados faltantes usando a moda de cada uma das colunas: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.

from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy
from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:

    """
    Execute Transformer Action: ActionType.IMPUTE
    Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values

    """
    action = build_transformer_action(
        df,
        action_type=ActionType.IMPUTE,
        arguments=df.columns,  # Specify columns to impute
        axis=Axis.COLUMN,
        options={'strategy': ImputationStrategy.MODE},  # Specify imputation strategy
    )

    return BaseAction(action).execute(df)


@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """
    assert output is not None, 'The output is undefined'

ログイン後にコピー

Na função execute_transformer_action() , definimos a estratégia para a substituição dos dados num dicionário do Python. Para mais opções de substituição, basta acessar a documentação do transformer: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.

Data Exporter

Ao realizar todas as transformações, vamos salvar nosso dataset agora tratado, na mesma base do Postgres mas agora com um nome diferente para podermos diferenciar. Usando o bloco Data Exporter e selecionando o Postgres, vamos definir o shema e a tabela onde queremos salvar, lembrando que as configurações do banco são salvas previamente no arquivo io_config.yaml.

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from pandas import DataFrame
from os import path

if 'data_exporter' not in globals():
    from mage_ai.data_preparation.decorators import data_exporter

@data_exporter
def export_data_to_postgres(df: DataFrame, **kwargs) -> None:

    """
    Template for exporting data to a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql

    """

    schema_name = 'public'  # Specify the name of the schema to export data to
    table_name = 'mushroom_clean'  # Specify the name of the table to export data to
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        loader.export(
            df,
            schema_name,
            table_name,
            index=False,  # Specifies whether to include index in exported table
            if_exists='replace', #Specify resolution policy if table name already exists
        )

ログイン後にコピー