LightGBM actual combat + random search parameter adjustment: accuracy rate 96.67%-AI-php.cn

LightGBM actual combat + random search parameter adjustment: accuracy rate 96.67%

Hello everyone, I am Peter~

LightGBM is a classic machine learning algorithm. Its background, principles and characteristics are very worthy of study. LightGBM's algorithm yields features such as high efficiency, scalability, and high accuracy. This article will briefly introduce the characteristics and principles of LightGBM as well as some cases based on LightGBM and random search optimization.

LightGBM algorithm

In the field of machine learning, Gradient Boosting Machines (GBMs) are a class of powerful ensemble learning algorithms that gradually add weak learners (usually decision trees) ) to minimize the prediction error and thereby build a powerful model. GBMs are often used to minimize the prediction error and thus build a powerful model, which can be achieved by minimizing the residual or loss function. This algorithm is widely used and often used to minimize the prediction error of strong models built with weak learners such as decision trees.

In the era of big data, the size of data sets has grown dramatically, and traditional GBMs are difficult to scale effectively due to their high computing and storage costs.

For example, for the horizontal segmentation decision tree growth strategy, although it can generate a balanced tree, it often leads to a decrease in the discrimination ability of the model; while for the leaf-based growth strategy, although it can improve the accuracy, it Easy to overfit.
In addition, most GBM implementations need to traverse the entire data set to calculate gradients in each iteration, which is inefficient when the amount of data is huge. Therefore, an algorithm that can efficiently process large-scale data while maintaining model accuracy is needed.

In order to solve these problems, Microsoft launched LightGBM (Light Gradient Boosting Machine) in 2017, a faster, lower memory consumption, and higher performance gradient boosting framework.

Official learning address: https://lightgbm.readthedocs.io/en/stable/

Principle of LightGBM

1. Decision tree algorithm based on histogram:

Principle: LightGBM uses histogram optimization technology to discretize continuous feature values into specific bins (that is, the buckets of the histogram), reducing the amount of data that needs to be calculated when a node is split.
Advantages: This method can increase calculation speed while reducing memory usage.
Implementation details: For each feature, the algorithm maintains a histogram to record the statistical information of the feature in different buckets. When performing node splitting, the information of these histograms can be directly utilized without traversing all the data.

2. Leaf-wise tree growth strategy with depth restriction:

Principle: Unlike traditional horizontal splitting, the leaf-wise growth strategy starts from Select the node with the largest split profit among all current leaf nodes for splitting.
Advantages: This strategy can make the decision tree focus more on the abnormal parts of the data, which can usually lead to better accuracy.
Disadvantages: It can easily lead to overfitting, especially when there is noise in the data.
Improvement measures: LightGBM prevents overfitting by setting a maximum depth limit.

3. One-sided gradient sampling (GOSS):

Principle: For large gradient samples in the data set, the GOSS algorithm only retains a part of the data (usually the large gradient samples), reducing the amount of calculation while ensuring that too much information is not lost.
Advantages: This method can speed up training without significant loss of accuracy.
Application scenarios: Especially suitable for situations with serious data skew.

4. Mutually exclusive feature bundling (EFB):

Principle: EFB is a technology that reduces the number of features and improves computational efficiency. It combines mutually exclusive features (i.e. features that are never non-zero at the same time) to reduce feature dimensionality.
Advantages: Improved memory usage efficiency and training speed.
Implementation details: Through the mutual exclusivity of features, the algorithm can process more features at the same time, thereby reducing the actual number of features processed.

5. Support parallel and distributed learning:

Principle: LightGBM supports multi-threaded learning and can use multiple CPUs for parallel training.
Advantages: Significantly improves the training speed on multi-core processors.
Scalability: It also supports distributed learning and can use multiple machines to jointly train models.

6. Cache optimization:

Principle: The way of reading data is optimized, and more caches can be used to speed up data exchange.
Advantages: Especially on large data sets, cache optimization can significantly improve performance.

7. Supports multiple loss functions:

Features: In addition to commonly used regression and classification loss functions, LightGBM also supports custom loss functions to meet different needs. Business needs.

8. Regularization and pruning:

Principle: L1 and L2 regularization terms are provided to control model complexity and avoid overfitting.
Implementation: The backward pruning strategy is implemented to further prevent overfitting.

9. Model interpretability:

Features: Because it is a model based on decision trees, LightGBM has good model interpretability and can understand the decision-making logic of the model through feature importance and other methods.

Features of LightGBM

Efficiency

Speed advantage: Through histogram optimization and leaf-wise growth strategy, LightGBM greatly improves accuracy while ensuring accuracy. Improved training speed.
Memory usage: LightGBM requires less memory than other GBM implementations, which allows it to handle larger data sets.

Accuracy

Best-priority growth strategy: The leaf-wise growth strategy adopted by LightGBM can fit the data more closely and can usually obtain better results than horizontal segmentation. Good accuracy.
Methods to avoid overfitting: By setting a maximum depth limit and backward pruning, LightGBM can avoid overfitting while improving model accuracy.

Scalability

Parallel and distributed learning: LightGBM is designed to support multi-threading and distributed computing, which allows it to fully utilize the computing power of modern hardware.
Multi-platform support: LightGBM can run on multiple operating systems such as Windows, macOS, and Linux, and supports multiple programming languages such as Python, R, and Java.

Ease of use

Parameter tuning: LightGBM provides a wealth of parameter options to facilitate users to adjust according to specific problems.
Pre-trained model: Users can start from a pre-trained model to speed up their modeling process.
Model interpretation tools: LightGBM provides feature importance evaluation tools to help users understand the decision-making process of the model.

Import library

In [1]:

import numpy as npimport lightgbm as lgbfrom sklearn.model_selection import train_test_split, RandomizedSearchCVfrom sklearn.datasets import load_irisfrom sklearn.metrics import accuracy_scoreimport warningswarnings.filterwarnings("ignore")

Copy after login

Load data

Load the public iris data set:

In [2]:

# 加载数据集data = load_iris()X, y = data.data, data.targety = [int(i) for i in y]# 将标签转换为整数

Copy after login

In [3]:

X[:3]

Copy after login

Out[3]:

array([[5.1, 3.5, 1.4, 0.2], [4.9, 3. , 1.4, 0.2], [4.7, 3.2, 1.3, 0.2]])

Copy after login

In [4]:

y[:10]

Copy after login

Out[4]:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Copy after login

Divided data

In [5]:

# 划分训练集和测试集X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Copy after login

Create LightGBM dataset at the same time:

In [6]:

lgb_train = lgb.Dataset(X_train, label=y_train)

Copy after login

Parameter settings

In [7]:

# 设置参数范围param_dist = {'boosting_type': ['gbdt', 'dart'],# 提升类型梯度提升决策树（gbdt）和Dropouts meet Multiple Additive Regression Trees（dart）'objective': ['binary', 'multiclass'],# 目标；二分类和多分类'num_leaves': range(20, 150),# 叶子节点数量'learning_rate': [0.01, 0.05, 0.1],# 学习率'feature_fraction': [0.6, 0.8, 1.0],# 特征采样比例'bagging_fraction': [0.6, 0.8, 1.0],# 数据采样比例'bagging_freq': range(0, 80),# 数据采样频率'verbose': [-1]# 是否显示训练过程中的详细信息，-1表示不显示}

Copy after login

Random search parameter adjustment

In [8]:

# 初始化模型model = lgb.LGBMClassifier()# 使用随机搜索进行参数调优random_search = RandomizedSearchCV(estimator=model, param_distributinotallow=param_dist, # 参数组合 n_iter=100,  cv=5, # 5折交叉验证 verbose=2,  random_state=42,  n_jobs=-1)# 模型训练random_search.fit(X_train, y_train)Fitting 5 folds for each of 100 candidates, totalling 500 fits

Copy after login

Output the best parameter combination:

In [9]:

# 输出最佳参数print("Best parameters found: ", random_search.best_params_)Best parameters found:{'verbose': -1, 'objective': 'multiclass', 'num_leaves': 87, 'learning_rate': 0.05, 'feature_fraction': 0.6, 'boosting_type': 'gbdt', 'bagging_freq': 22, 'bagging_fraction': 0.6}

Copy after login

Use the best parameter modeling

In [10]:

# 使用最佳参数训练模型best_model = random_search.best_estimator_best_model.fit(X_train, y_train)# 预测y_pred = best_model.predict(X_test)y_pred = [round(i) for i in y_pred]# 将概率转换为类别# 评估模型print('Accuracy: %.4f' % accuracy_score(y_test, y_pred))Accuracy: 0.9667

Copy after login

The above is the detailed content of LightGBM actual combat + random search parameter adjustment: accuracy rate 96.67%. For more information, please follow other related articles on the PHP Chinese website!