Mage 是一款用於 ETL 任務的強大工具,具有支援資料探索和挖掘的功能、透過圖形模板進行快速視覺化以及其他一些功能,可將您的資料工作轉變為神奇的東西。
處理數據時,在 ETL 過程中,通常會發現丟失的數據,這些數據可能會在將來產生問題,具體取決於我們要對數據集執行的活動,空數據可能會造成相當大的破壞。
為了識別資料集中是否缺少數據,我們可以使用 Python 和 pandas 函式庫來檢查出現空值的數據,此外我們還可以建立圖表來更清楚地顯示這些空值的影響我們的資料集。
我們的管道由 4 個步驟組成:從資料載入開始,兩個處理步驟和資料匯出。
在本文中,我們將使用資料集:有毒蘑菇的二進位預測,該資料集可在 Kaggle 上作為競賽的一部分獲得。讓我們使用網站上提供的訓練資料集。
讓我們使用 python 建立一個資料載入器步驟,以便能夠載入我們將要使用的資料。在此步驟之前,我在本機電腦上的 Postgres 資料庫中建立了一個表,以便能夠載入資料。由於資料位於 Postgres 中,我們將使用 Mage 中已定義的 Postgres 載入範本。
from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from os import path if 'data_loader' not in globals(): from mage_ai.data_preparation.decorators import data_loader if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @data_loader def load_data_from_postgres(*args, **kwargs): """ Template for loading data from a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ query = 'SELECT * FROM mushroom' # Specify your SQL query here config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: return loader.load(query) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
在函數 load_data_from_postgres() 中,我們將定義用於載入資料庫中的表的查詢。就我而言,我在檔案 io_config.yaml 中配置了銀行訊息,它被定義為預設配置,因此我們只需將預設名稱傳遞給變數 config_profile 即可。
執行該區塊後,我們將使用新增圖表功能,該功能將透過已定義的範本提供有關我們的資料的資訊。只需點擊播放按鈕旁邊的圖示(在圖像中用黃線標記)即可。
我們將選擇兩個選項來進一步探索我們的資料集:summay_overview 和 feature_profiles 選項。透過summary_overview,我們可以獲得資料集中的列數和行數信息,我們還可以按類型查看列的總數,例如分類列、數字列和布林列的總數。另一方面,Feature_profiles 呈現了更多關於數據的描述性信息,例如:類型、最小值、最大值等信息,我們甚至可以將缺失值可視化,這是我們處理的重點。
為了能夠更專注於缺失數據,讓我們使用範本:缺失值百分比,這是一個長條圖,每列中包含缺失資料的百分比。
該圖顯示了4 列,其中缺失值對應於其內容的80% 以上,而其他列則呈現缺失值,但數量較少,這些資訊現在允許我們尋求不同的策略來處理此問題空數據。
變壓器跌落柱TRANSFORMER 區塊,我們將選擇選項 列刪除 .
from mage_ai.data_cleaner.transformer_actions.base import BaseAction from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action from pandas import DataFrame if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @transformer def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame: """ Execute Transformer Action: ActionType.REMOVE Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns """ action = build_transformer_action( df, action_type=ActionType.REMOVE, arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'], axis=Axis.COLUMN, ) return BaseAction(action).execute(df) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
execute_transformer_action() 中,我們將在參數變數中插入一個列表,其中包含要從資料集中排除的列的名稱,在此步驟之後,只需執行該區塊即可。
變壓器填充缺失值填充缺失值,在某些情況下,儘管存在缺失數據,但仍將其替換為諸如平均,或時尚,它可能能夠滿足數據需求,而不會對數據集造成太多更改,這取決於您的最終目標。
Existem algumas tarefas, como a de classificação, onde a substituição dos dados faltantes por um valor que seja relevante (moda, média, mediana) para o dataset, possa contribuir com o algoritmo de classificação, que poderia chegar a outras conclusões caso o dados fossem apagados como na outra estratégia de utilizamos.
Para tomar uma decisão com relação a qual medida vamos utilizar, vamos recorrer novamente a funcionalidade Add chart do Mage. Usando o template Most frequent values podemos visualizar a moda e a frequência desse valor em cada uma das colunas.
Seguindos passos semelhantes aos anteriores, vamos usar o tranformer Fill in missing values, para realizar a tarefa de subtiruir os dados faltantes usando a moda de cada uma das colunas: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.
from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy from mage_ai.data_cleaner.transformer_actions.base import BaseAction from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action from pandas import DataFrame if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test @transformer def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame: """ Execute Transformer Action: ActionType.IMPUTE Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values """ action = build_transformer_action( df, action_type=ActionType.IMPUTE, arguments=df.columns, # Specify columns to impute axis=Axis.COLUMN, options={'strategy': ImputationStrategy.MODE}, # Specify imputation strategy ) return BaseAction(action).execute(df) @test def test_output(output, *args) -> None: """ Template code for testing the output of the block. """ assert output is not None, 'The output is undefined'
Na função execute_transformer_action() , definimos a estratégia para a substituição dos dados num dicionário do Python. Para mais opções de substituição, basta acessar a documentação do transformer: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.
Ao realizar todas as transformações, vamos salvar nosso dataset agora tratado, na mesma base do Postgres mas agora com um nome diferente para podermos diferenciar. Usando o bloco Data Exporter e selecionando o Postgres, vamos definir o shema e a tabela onde queremos salvar, lembrando que as configurações do banco são salvas previamente no arquivo io_config.yaml.
from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from pandas import DataFrame from os import path if 'data_exporter' not in globals(): from mage_ai.data_preparation.decorators import data_exporter @data_exporter def export_data_to_postgres(df: DataFrame, **kwargs) -> None: """ Template for exporting data to a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ schema_name = 'public' # Specify the name of the schema to export data to table_name = 'mushroom_clean' # Specify the name of the table to export data to config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: loader.export( df, schema_name, table_name, index=False, # Specifies whether to include index in exported table if_exists='replace', #Specify resolution policy if table name already exists )
repo -> https://github.com/DeadPunnk/Mushrooms/tree/main
以上是Magic Mushrooms:使用 Mage 探索和處理空數據的詳細內容。更多資訊請關注PHP中文網其他相關文章!