在為表格資料選擇二元分類模型時,我決定快速嘗試一種快速的非深度學習模型:梯度提升決策樹(GBDT)。本文介紹了使用 BigQuery 作為資料來源並使用 XGBoost 演算法進行建模來建立 Jupyter Notebook 腳本的過程。
對於那些喜歡直接跳入腳本而不進行解釋的人,這裡是。請調整project_name、dataset_name和table_name以適合您的專案。
import xgboost as xgb from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.metrics import precision_score, recall_score, f1_score, log_loss from google.cloud import bigquery # Function to load data from BigQuery def load_data_from_bigquery(query): client = bigquery.Client() query_job = client.query(query) df = query_job.to_dataframe() return df def compute_metrics(labels, predictions, prediction_probs): precision = precision_score(labels, predictions, average='macro') recall = recall_score(labels, predictions, average='macro') f1 = f1_score(labels, predictions, average='macro') loss = log_loss(labels, prediction_probs) return { 'precision': precision, 'recall': recall, 'f1': f1, 'loss': loss } # Query in BigQuery query = """ SELECT * FROM `<project_name>.<dataset_name>.<table_name>` """ # Loading data df = load_data_from_bigquery(query) # Target data y = df["reaction"] # Input data X = df.drop(columns=["reaction"], axis=1) # Splitting data into training and validation sets X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1) # Training the XGBoost model model = xgb.XGBClassifier(eval_metric='logloss') # Setting the parameter grid param_grid = { 'max_depth': [3, 4, 5], 'learning_rate': [0.01, 0.1, 0.2], 'n_estimators': [100, 200, 300], 'subsample': [0.8, 0.9, 1.0] } # Initializing GridSearchCV grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy', verbose=1, n_jobs=-1) # Executing the grid search grid_search.fit(X_train, y_train) # Displaying the best parameters print("Best parameters:", grid_search.best_params_) # Model with the best parameters best_model = grid_search.best_estimator_ # Predictions on validation data val_predictions = best_model.predict(X_val) val_prediction_probs = best_model.predict_proba(X_val) # Predictions on training data train_predictions = best_model.predict(X_train) train_prediction_probs = best_model.predict_proba(X_train) # Evaluating the model (validation data) val_metrics = compute_metrics(y_val, val_predictions, val_prediction_probs) print("Optimized Validation Metrics:", val_metrics) # Evaluating the model (training data) train_metrics = compute_metrics(y_train, train_predictions, train_prediction_probs) print("Optimized Training Metrics:", train_metrics)
之前,資料以 CSV 檔案的形式儲存在 Cloud Storage 中,但緩慢的資料載入降低了我們學習過程的效率,促使我們轉向 BigQuery 以實現更快的資料處理。
from google.cloud import bigquery client = bigquery.Client()
此程式碼使用 Google Cloud 憑證初始化 BigQuery 用戶端,該憑證可以透過環境變數或 Google Cloud SDK 設定。
def load_data_from_bigquery(query): query_job = client.query(query) df = query_job.to_dataframe() return df
此函數執行 SQL 查詢並將結果作為 Pandas 中的 DataFrame 傳回,從而實現高效的資料處理。
XGBoost 是一種利用梯度提升的高效能機器學習演算法,廣泛用於分類和迴歸問題。
https://arxiv.org/pdf/1603.02754
import xgboost as xgb model = xgb.XGBClassifier(eval_metric='logloss')
這裡實例化了 XGBClassifier 類,使用對數損失作為評估指標。
from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
該函數將資料拆分為訓練集和驗證集,這對於測試模型效能和避免過度擬合至關重要。
from sklearn.model_selection import GridSearchCV param_grid = { 'max_depth': [3, 4, 5], 'learning_rate': [0.01, 0.1, 0.2], 'n_estimators': [100, 200, 300], 'subsample': [0.8, 0.9, 1.0] } grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy', verbose=1, n_jobs=-1) grid_search.fit(X_train, y_train)
GridSearchCV 執行交叉驗證以找到模型的最佳參數組合。
使用驗證資料集上的精確度、召回率、F1 分數和對數損失來評估模型的效能。
def compute_metrics(labels, predictions, prediction_probs): from sklearn.metrics import precision_score, recall_score, f1_score, log_loss return { 'precision': precision_score(labels, predictions, average='macro'), 'recall': recall_score(labels, predictions, average='macro'), 'f1': f1_score(labels, predictions, average='macro'), 'loss': log_loss(labels, prediction_probs) } val_metrics = compute_metrics(y_val, val_predictions, val_prediction_probs) print("Optimized Validation Metrics:", val_metrics)
運行筆記本時,您將獲得以下輸出,顯示最佳參數和模型評估指標。
Best parameters: {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 300, 'subsample': 0.9} Optimized Validation Metrics: {'precision': 0.8919952583956949, 'recall': 0.753797304483842, 'f1': 0.8078981867164722, 'loss': 0.014006406471894417} Optimized Training Metrics: {'precision': 0.8969556573175115, 'recall': 0.7681976753444204, 'f1': 0.8199353049298048, 'loss': 0.012475375680566196}
在某些情況下,從 Google Cloud Storage 載入資料可能比從 BigQuery 載入資料更合適。以下函數從 Cloud Storage 讀取 CSV 檔案並將其作為 Pandas 中的 DataFrame 傳回,並且可以與 load_data_from_bigquery 函數互換使用。
from google.cloud import storage def load_data_from_gcs(bucket_name, file_path): client = storage.Client() bucket = client.get_bucket(bucket_name) blob = bucket.blob(file_path) data = blob.download_as_text() df = pd.read_csv(io.StringIO(data), encoding='utf-8') return df
使用範例:
bucket_name = '<bucket-name>' file_path = '<file-path>' df = load_data_from_gcs(bucket_name, file_path)
如果您想使用 LightGBM 而不是 XGBoost,只需在相同設定中將 XGBClassifier 替換為 LGBMClassifier。
import lightgbm as lgb model = lgb.LGBMClassifier()
未來的文章將介紹如何使用 BigQuery ML (BQML) 進行訓練。
以上是BigQuery 和 XGBoost 整合:用於二元分類的 Jupyter Notebook 教學的詳細內容。更多資訊請關注PHP中文網其他相關文章!