Home > Technology peripherals > AI > Practical analysis of machine learning random forest algorithm

Practical analysis of machine learning random forest algorithm

Release: 2023-04-16 22:28:01
1734 people have browsed it

Translator|Zhu Xianzhong

Reviewer|Sun Shujuan

In classic machine learning, the Random Forests algorithm can be described as a "silver bullet" type of algorithm model .

This model is great for several reasons:

  • Requires less data preprocessing than many other algorithms, making this algorithm easier to set up Easy
  • Can be used as a classification or regression model
  • Not easy to overfit
  • Can easily calculate the importance of features

In this article , I would like to better analyze the various components that make up the random forest algorithm. To achieve this, I will break down the random forest algorithm into its most basic components and explain the computational tasks of each component. By the end of the article, we will be able to have a deeper understanding of how random forest algorithms work and how to use them in a more intuitive way. It should be noted that the examples we will use in this article will focus on classification functions, but many of the principles are equally applicable to regression scenarios.

Running the Random Forest Algorithm

Let’s start by calling a classic random forest mode. This is the highest level and is what many people use when training random forests in Python.

Practical analysis of machine learning random forest algorithm

Simulated Data

If I want to run a random forest algorithm to predict my target column, then I just do The following:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=0)

# 训练随机森林算法并计算得分
simple_rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
simple_rf_model.fit(X_train, y_train)
print(f"accuracy: {simple_rf_model.score(X_test, y_test)}")

# accuracy: 0.93
Copy after login

Running a random forest classifier is very simple. As shown in the code above, I just defined the n_estimators parameter and set the parameter random_state to 0. I can tell you from personal experience that many people will stare at the accuracy of 0.93 and not relax. They seemed to feel very satisfied and easily began the frenzied deployment work. But we won't do that today.

First, let’s revisit the following “innocuous” line of code:

simple_rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
Copy after login

Random state is a feature of most data science models that ensures that others can replicate your work . Therefore, we won't worry too much about the random_state parameter.

But, let’s delve into the n_estimators parameter. If we look at the relevant documentation in scikit-learn, we will find the following concise definition:

"The number of trees in the forest."

Tree number research

Now , let us define a random forest more specifically. Random Forest is an ensemble model that is the consensus content of many decision trees. This definition may be incomplete, but we will come back to it later.

Practical analysis of machine learning random forest algorithm

Many trees communicating with each other and reaching consensus

This might make you think that if you break it down into the following , you may get a random forest:

tree1 = DecisionTreeClassifier().fit(X_train, y_train)
tree2 = DecisionTreeClassifier().fit(X_train, y_train)
tree3 = DecisionTreeClassifier().fit(X_train, y_train)

# 预测X_test上的每一棵决策树
predictions_1 = tree1.predict(X_test)
predictions_2 = tree2.predict(X_test)
predictions_3 = tree3.predict(X_test)
print(predictions_1, predictions_2, predictions_3)

# 采取优先级策略
final_prediction = np.array([np.round((predictions_1[i] + predictions_2[i] + predictions_3[i])/3) for i in range(len(predictions_1))])
Copy after login

In the above example, we trained 3 decision trees on X_train, which means n_estimators=3. After training 3 trees, we predicted each tree on the same test set and then ended up with predictions where 2 of the 3 trees were selected.

That makes sense, but it doesn't seem entirely correct. If all decision trees were trained on the same data, wouldn't they all come to the same conclusion, thereby negating the overall advantage?

Detailed explanation of replacement sampling

Let us add this sentence based on the previous definition: "Random forest is an ensemble model, which is the consensus of many unrelated decision trees."

Decision trees can become uncorrelated in two ways:

1. You have a large enough dataset size to sample unique parts of the data into each decision tree. This approach is not popular because it usually requires large amounts of data.

2. You can use a technique called sampling with replacement. Sampling with replacement is when a sample drawn from the population is returned to the sample population before the next sample is drawn.

To explain sampling with replacement, let's say I have 5 marbles of 3 colors, so the overall look is this:

blue, blue, red, green, red
Copy after login

If I want to sample some marbles, I usually Pull a few out of it and maybe you'll end up with:

blue, red
Copy after login

This is because once I picked up the red, I didn't put it back into the original pile of marbles.

However, if I sample with replacement, I can actually pick up any marble twice. Since the red is back in my pile, I still have a chance to pick it up again.

red, red
Copy after login

In the random forest algorithm, the default value is to construct a sample that is approximately 2/3 of the original sample population size. If my original training data is 1000 rows, then the training data sample I feed into the tree is probably around 670 rows. That said, it would be a good parameter to try different sampling rates when building a random forest.

Different from the previous code, the following code is closer to a random forest, where the parameter n_estimators=3.

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# 对于每一棵树从X_train中采用3次放回抽样
df_sample1 = df.sample(frac=.67, replace=True)
df_sample2 = df.sample(frac=.67, replace=True)
df_sample3 = df.sample(frac=.67, replace=True)

X_train_sample1, X_test_sample1, y_train_sample1, y_test_sample1 = train_test_split(df_sample1.drop('target', axis=1), df_sample1['target'], test_size=0.2)
X_train_sample2, X_test_sample2, y_train_sample2, y_test_sample2 = train_test_split(df_sample2.drop('target', axis=1), df_sample2['target'], test_size=0.2)
X_train_sample3, X_test_sample3, y_train_sample3, y_test_sample3 = train_test_split(df_sample3.drop('target', axis=1), df_sample3['target'], test_size=0.2)

tree1 = DecisionTreeClassifier().fit(X_train_sample1, y_train_sample1)
tree2 = DecisionTreeClassifier().fit(X_train_sample2, y_train_sample2)
tree3 = DecisionTreeClassifier().fit(X_train_sample3, y_train_sample3)

# 在X_test上预测每一棵决策树
predictions_1 = tree1.predict(X_test)
predictions_2 = tree2.predict(X_test)
predictions_3 = tree3.predict(X_test)
df = pd.DataFrame([predictions_1, predictions_2, predictions_3]).T
df.columns = ["tree1", "tree2", "tree3"]

# 采取优先级策略 
final_prediction = np.array([np.round((predictions_1[i] + predictions_2[i] + predictions_3[i])/3) for i in range(len(predictions_1))])
preds = pd.DataFrame([predictions_1, predictions_2, predictions_3, final_prediction, y_test]).T.head(20)
preds.columns = ["tree1", "tree2", "tree3", "final", "label"]
Copy after login

Practical analysis of machine learning random forest algorithm


袋装分类器(Bagging Classifier)

Practical analysis of machine learning random forest algorithm


我们现在将引入一种新的算法,一种称为自助聚集(Bootstrap Aggregation,也称为“Bagging”)的有监督的学习算法。但请放心,这又会与随机森林算法联系起来。我们引入这个新概念的原因是,正如我们将要在文章后面的图中看到的,我们到目前为止所做的一切实际上都是装袋分类器所做的!


import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# 集合中所使用的树的数量
n_estimators = 3

# 初始化装袋分类器
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=n_estimators, bootstrap=True)

# 根据训练数据拟合装袋分类器
bag_clf.fit(X_train, y_train)

# 对测试数据进行预测
y_pred = bag_clf.predict(X_test)
pd.DataFrame([y_pred, y_test]).T
Copy after login

装袋分类器BaggingClassifier非常棒,因为您可以将它们与未命名为决策树的评估器一起使用!您可以插入许多算法,Bagging算法会将其转化为集成解决方案。随机森林算法实际上扩展了装袋算法(如果bootstrapping = true),因为它部分地利用Bagging算法来形成不相关的决策树。



特征采样(Feature sampling)意味着不仅对行进行采样,还对列进行采样。与行不同,随机森林的列在没有放回的情况下被采样,这意味着我们不会有重复的列来训练1棵树。


Practical analysis of machine learning random forest algorithm



import numpy as np
import pandas as pd
import math
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
df_sample1 = df.sample(frac=.67, replace=True)
df_sample2 = df.sample(frac=.67, replace=True)
df_sample3 = df.sample(frac=.67, replace=True)

# 分割训练集
X_train_sample1, y_train_sample1 = df_sample1.drop('target', axis=1), df_sample1['target']
X_train_sample2, y_train_sample2 = df_sample2.drop('target', axis=1), df_sample2['target']
X_train_sample3, y_train_sample3 = df_sample3.drop('target', axis=1), df_sample3['target']

# 使用sqrt获取训练和测试的采样特征,现在注意replace如何等于False的
num_features = len(X_train.columns)
X_train_sample1 = X_train_sample1.sample(n=int(math.sqrt(num_features)), replace=False, axis = 1)
X_train_sample2 = X_train_sample2.sample(n=int(math.sqrt(num_features)), replace=False, axis = 1)
X_train_sample3 = X_train_sample3.sample(n=int(math.sqrt(num_features)), replace=False, axis = 1)

# 创建决策树,这次我们对列进行采样
tree1 = DecisionTreeClassifier().fit(X_train_sample1, y_train_sample1)
tree2 = DecisionTreeClassifier().fit(X_train_sample2, y_train_sample2)
tree3 = DecisionTreeClassifier().fit(X_train_sample3, y_train_sample3)

# 预测X_test上的每个决策树
predictions_1 = tree1.predict(X_test[X_train_sample1.columns])
predictions_2 = tree2.predict(X_test[X_train_sample2.columns])
predictions_3 = tree3.predict(X_test[X_train_sample3.columns])
preds = pd.DataFrame([predictions_1, predictions_2, predictions_3]).T
preds.columns = ["tree1", "tree2", "tree3"]

# 使用优先级规则 
final_prediction = np.array([np.round((predictions_1[i] + predictions_2[i] + predictions_3[i])/3) for i in range(len(predictions_1))])
preds = pd.DataFrame([predictions_1, predictions_2, predictions_3, final_prediction, y_test]).T.head(20)
preds.columns = ["tree1", "tree2", "tree3", "final", "label"]
Copy after login


Practical analysis of machine learning random forest algorithm





Practical analysis of machine learning random forest algorithm





介绍到现在,我们需要讨论一个叫做熵(entropy)的新术语。从一种高角度来看,熵是衡量节点中杂质或随机性水平的一种方法。顺便说一句,还有另一种流行的方法来测量节点的杂质,称为基尼系数(Gini impurity),但我们不会在本文中解析该方法,因为它与许多关于熵的概念重叠,尽管计算略有不同。一般的想法是,熵或基尼系数越高,节点中的方差越大,我们的目标是减少这种不确定性。


Practical analysis of machine learning random forest algorithm



from collections import Counter
from math import log2

data = [0, 0, 0, 1, 1, 1, 1, 0, 1, 0]
# 获取标签的长度
len_labels = len(data)
def calculate_entropy(data, len_labels):
# 对每一种分类进行计数
counts = Counter(labels)
# 我们计算分数,这个例子的输出应该是[.5,.5]
probs = [count / num_labels for count in counts.values()]
# 实际熵计算
return - sum(p * log2(p) for p in probs)

calculate_entropy(labels, num_labels)
Copy after login



entropy(parent) — [weighted_average_of_entropy(children)]
Copy after login


Practical analysis of machine learning random forest algorithm



  • 计算父节点的熵
  • 将父节点拆分为子节点
  • 为每个子节点创建权重。这是通过number_of_samples_in_child_node/number_of_ssamples_in_parent_node测量的
  • 计算每个子节点的熵
  • 通过计算weight*entropy_of_child1+weight*entropy_of_child2创建[weighted_average_of_entropy(children)]
  • 从父节点的熵中减去此加权熵


def information_gain(left_labels, right_labels, parent_entropy):
proportion_left_node = float(len(left_labels)) / (len(left_labels) + len(right_labels))
proportion_right_node = 1 - proportion_left_node
# 计算子节点的加权平均值
weighted_average_of_child_nodes = ((proportion_left_node * entropy(left_labels)) + (proportion_right_node * entropy(right_labels)))
return parent_entropy - weighted_average_of_child_nodes
Copy after login





  • 从一个数据集开始,其中有一个要预测的目标变量
  • 计算原始数据集(根节点)的熵(或基尼系数)
  • 查看每个特征并计算信息增益
  • 选择具有最佳信息增益的最佳特征,这与导致熵降低最多的特征相同


import pandas as pd
import numpy as np
from math import log2

def entropy(data, target_col):
# calculate the entropy of the entire dataset
values, counts = np.unique(data[target_col], return_counts=True)
entropy = np.sum([-count/len(data) * log2(count/len(data)) for count in counts])
return entropy

def compute_information_gain(data, feature, target_col):
parent_entropy = entropy(data, target_col)
# 计算在给定特征上拆分的信息增益
split_values = np.unique(data[feature])
# initialize at 0
weighted_child_entropy = 0
# 计算加权熵,记住这与新节点中的点数有关
for value in split_values:
sub_data = data[data[feature] == value]
node_weight = len(sub_data)/len(data)
weighted_child_entropy += node_weight * entropy(sub_data, target_col)
return parent_entropy - weighted_child_entropy

def grow_tree(data, features, target_col, depth=0, max_depth=3):
# 我们将最大深度设置为3以“预修剪”或限制树的复杂性
if depth >= max_depth or len(np.unique(data[target_col])) == 1:
# 如果达到最大深度或所有标签都相同,则停止生长树。所有标签相同意味着熵为0
return np.unique(data[target_col])[0]
# 我们根据信息增益计算最佳特征(或最佳问题)
node = {}
gains = [compute_information_gain(data, feature, target_col) for feature in features]
best_feature = features[np.argmax(gains)]

for value in np.unique(data[best_feature]):
sub_data = data[data[best_feature] == value]
node[value] = grow_tree(sub_data, features, target_col, depth+1, max_depth)

return node

# 模拟一些数据并制作一个数据帧,注意我们是如何建立一个目标的
data = {
'A': [1, 2, 1, 2, 1, 2, 1, 2],
'B': [3, 3, 4, 4, 3, 3, 4, 4],
'C': [5, 5, 5, 5, 6, 6, 6, 6],
'target': [0, 0, 0, 1, 1, 1, 1, 0]
df = pd.DataFrame(data)

# 定义我们的特征和标签
features = ["A", "B", "C"]
target_col = "target"

# 成长树
tree = grow_tree(df, features, target_col, max_depth=3)
Copy after login




Practical analysis of machine learning random forest algorithm







  • 随机森林实际上是一组不相关的决策树,它们做出预测并达成共识。这种共识是回归问题的平均分数和分类问题的优先级规则。
  • 随机森林通过利用装袋算法和特征采样减轻相关性。通过利用这两种技术,单棵决策树可以查看我们集合的特定维度,并根据不同的因素进行预测。
  • 决策树是通过在产生最高信息增益的特征上分割数据来生长的。信息增益被测量为杂质的最高减少。杂质通常通过熵或基尼系统来测量。
  • 随机森林能够通过特征重要性实现有限程度的可解释性,这是特征的平均信息增益的度量。
  • 随机森林也有能力在训练时进行交叉验证,这是一种被称为OOB错误的独特技术。这是可能的,得益于算法对上游数据进行采样的方式。
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=0)

# 训练和评分随机森林
simple_rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
simple_rf_model.fit(X_train, y_train)
print(f"accuracy: {simple_rf_model.score(X_test, y_test)}")

# accuracy: 0.93
Copy after login






原文标题:Demystifying the Random Forest,作者:Siddarth Ramesh

The above is the detailed content of Practical analysis of machine learning random forest algorithm. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
Latest Downloads
Web Effects
Website Source Code
Website Materials
Front End Template