This article brings you an introduction to the four methods of implementing machine learning functions in Python. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
In this article, we will introduce different methods of selecting features from a dataset; and discuss types of feature selection algorithms and their implementation in Python using the Scikit-learn (sklearn) library:
Statistical tests can be used to select those features that have the strongest relationship with the output variable.
The scikit-learn library provides the SelectKBest class that can be used with a different set of statistical tests to select a specific number of features.
The following example uses the chi squared (chi ^ 2) statistic to test non-negative features to select the four best features in the Pima Indians Diabetes dataset:
#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:])
Each The score of the attribute and the four selected attributes (the ones with the highest scores): plas, test, mass and age.
Score per feature:
[111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304]
Features:
[[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]]
RFE via recursive deletion properties and build models on the remaining properties to work on. It uses model accuracy to identify which attributes (and attribute combinations) contribute most to predicting the target attribute. The following example uses RFE and logistic regression algorithms to select the top three features. The choice of algorithm does not matter as long as it is skillful and consistent:
#Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_)
After execution, we will obtain:
Num Features: 3 Selected Features: [ True False False False False True True False] Feature Ranking: [1 2 3 5 6 1 1 4]
You can see that RFE selects the first three features as preg , mass and pedi. These are marked True in the support_array and Option 1 in the ranking_array.
PCA uses linear algebra to transform a data set into a compressed form. Typically, it is considered a data reduction technique. One property of PCA is that you can choose to transform the number of dimensions or principal components in the result.
In the following example, we use PCA and select three principal components:
#Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_)
You can see that the transformed data set (three principal components) has little similarity to the source data At:
Explained Variance: [ 0.88854663 0.06159078 0.02579012] [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [ -2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02 9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
Feature importance is a technique used to select features using a trained supervised classifier. When we train a classifier (such as a decision tree), we evaluate each attribute to create a split; we can use this measure as a feature selector. Let us know it in detail.
Random forests are one of the most popular machine learning methods because of their relatively good accuracy, robustness, and ease of use. They also provide two straightforward feature selection methods - Average Reduction in Impurity and Average Reduction in Precision.
Random forest consists of many decision trees. Each node in the decision tree is a condition on a single feature designed to split the data set into two so that similar response values end up in the same set. The metric that selects (locally) optimal conditions is called Impurity. For classification it is usually the Gini coefficient
impurity or information gain/entropy, and for regression trees it is the variance. Therefore, when training a tree, it can be calculated by how much each feature reduces weighted impurity in the tree. For forests, the impurity reduction for each feature can be averaged and the features ranked according to this measure.
Let us see how to use the Random Forest classifier for feature selection and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset.
This dataset describes 93 fuzzy details for over 61,000 products grouped into 10 product categories (e.g., fashion, electronics, etc.) . The input attribute is some kind of count of distinct events.
The goal is to get predictions for new products as an array of probabilities for each of the 10 classes and evaluate the model using a multi-class log loss (also known as cross-entropy).
We will start by importing all the libraries:
#Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn from sklearn.feature_selection import SelectFromModel np.random.seed(1)
Let us define a way to split the dataset into training and test data; we will train our dataset in the training part, test Part will be used to evaluate the trained model:
#Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing
We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percent accuracy:
#Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc
This is the time to load the dataset. We will load the train.csv file; this file contains over 61,000 training instances. We will use 50000 instances in our example, of which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier:
#Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6))
We pay attention to the data size here; because our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is very large. Let's take a look:
Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB
As you can see, our dataset has 35000 rows and 94 columns, which is over 26 MB of data.
In the next code block, we will configure the random forest classifier; we will use 250 trees, the maximum depth is 30, and the number of random features is 7. The other hyperparameters will be the default values of sklearn:
#Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees = 250 max_feat = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc))
我们模型的准确性是:
特征选择前的模型精度为98.82
正如您所看到的,我们正在获得非常好的准确性,因为我们将近99%的测试数据分类到正确的类别中。这意味着我们正在对15,000个正确类中的14,823个实例进行分类。
那么,现在我的问题是:我们是否应该进一步改进?好吧,为什么不呢?如果可以的话,我们肯定会寻求更多的改进; 在这里,我们将使用功能重要性来选择功能。如您所知,在树木构建过程中,我们使用杂质测量来选择节点。选择具有最低杂质的属性值作为树中的节点。我们可以使用类似的标准进行特征选择。我们可以更加重视杂质较少的功能,这可以使用sklearn库的feature_importances_函数来完成。让我们找出每个功能的重要性:
#Once我们培养的模型中,我们的排名将所有功能的功能在拉链(feat_labels,clf.feature_importances_):
print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366)
正如您在此处所看到的,每个要素都基于其对最终预测的贡献而具有不同的重要性。
我们将使用这些重要性分数来排列我们的功能; 在下面的部分中,我们将选择功能重要性大于0.01的模型训练功能:
#Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)
在这里,我们将根据所选的特征属性转换输入数据集。在下一个代码块中,我们将转换数据集。然后,我们将检查新数据集的大小和形状:
#Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1 = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)
你看到数据集的形状了吗?在功能选择过程之后,我们只剩下20个功能,这将数据库的大小从26 MB减少到5.60 MB。这比原始数据集减少了约80%。
在下一个代码块中,我们将训练一个新的随机森林分类器,它具有与之前相同的超参数,并在测试数据集上进行测试。让我们看看修改训练集后得到的准确度:
#Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97
你能看到!! 我们使用修改后的数据集获得了99.97%的准确率,这意味着我们在正确的类中对14,996个实例进行了分类,而之前我们只正确地对14,823个实例进行了分类。
这是我们在功能选择过程中取得的巨大进步; 我们可以总结下表中的所有结果:
评估标准 | 在选择特征之前 | 选择功能后 |
---|---|---|
功能数量 | 94 | 20 |
数据集的大小 | 26.32 MB | 5.60 MB |
训练时间 | 2.91秒 | 1.71秒 |
准确性 | 98.82% | 99.97% |
上表显示了特征选择的实际优点。您可以看到我们显着减少了要素数量,从而降低了数据集的模型复杂性和维度。尺寸减小后我们的训练时间缩短,最后,我们克服了过度拟合问题,获得了比以前更高的精度。
The above is the detailed content of An introduction to four methods to implement machine learning functions in Python. For more information, please follow other related articles on the PHP Chinese website!