PCA principal component analysis (dimensionality reduction) skills in Python
PCA (Principal Component Analysis) principal component analysis is a very commonly used data dimensionality reduction technology. The PCA algorithm can be used to process data to discover the inherent characteristics of the data and provide a more accurate and effective data collection for subsequent data analysis and modeling.
Below we will introduce some techniques for using PCA principal component analysis in Python.
Before performing PCA dimensionality reduction analysis, you first need to normalize the data. This is because the PCA algorithm calculates the principal components through variance maximization, rather than simply the size of the element values, so it fully takes into account the impact of the corresponding variance of each element.
There are many methods in Python for data normalization. The most basic method is to standardize the data into a standard normal distribution with a mean of 0 and a variance of 1 through the StandardScaler class of the sklearn library. The code is as follows:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() data_std = scaler.fit_transform(data)
In this way, we can get a data that has been normalized The processed data collection data_std.
The code for using PCA to reduce the dimensionality of data is very simple. The PCA module has been integrated in the sklearn library. We only need to set the number of principal components retained after dimensionality reduction when calling the PCA class. For example, the following code reduces the data to 2 principal components:
from sklearn.decomposition import PCA pca = PCA(n_components=2) data_pca = pca.fit_transform(data_std)
Among them, data_pca returns the new data after PCA dimensionality reduction processing.
When actually using PCA for data dimensionality reduction, we need to choose the appropriate number of principal components to achieve the best Dimensionality reduction effect. Usually, we can judge by plotting the cumulative variance contribution rate graph.
The cumulative variance contribution rate represents the percentage of the sum of the variances of the first n principal components to the total variance, for example:
import numpy as np pca = PCA() pca.fit(data_std) cum_var_exp = np.cumsum(pca.explained_variance_ratio_)
By drawing the cumulative variance contribution rate graph, we can observe the number of principal components The changing trend of the cumulative variance contribution rate when gradually increasing from 1 to estimate the appropriate number of principal components. The code is as follows:
import matplotlib.pyplot as plt plt.bar(range(1, 6), pca.explained_variance_ratio_, alpha=0.5, align='center') plt.step(range(1, 6), cum_var_exp, where='mid') plt.ylabel('Explained variance ratio') plt.xlabel('Principal components') plt.show()
The red line in the figure represents the cumulative variance contribution rate, the x-axis represents the number of principal components, and the y-axis represents the proportion of variance explained. It can be found that the variance contribution rate of the first two principal components is close to 1, so selecting two principal components can meet the needs of most analysis tasks.
Finally, we can use the scatter function of the matplotlib library to visualize the data after PCA dimensionality reduction. For example, the following code reduces the data from the original 4 dimensions to 2 dimensions through PCA, and then displays it visually:
import matplotlib.pyplot as plt x = data_pca[:, 0] y = data_pca[:, 1] labels = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'brown', 'orange'] for i, label in enumerate(np.unique(labels)): plt.scatter(x[labels == label], y[labels == label], c=colors[i], label=label, alpha=0.7) plt.legend() plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.show()
The colors and labels in the figure correspond to the numerical labels in the original data respectively. Through visualization With dimensionally reduced data, we can better understand the structure and characteristics of the data.
In short, using PCA principal component analysis technology can help us reduce the dimensionality of the data and thereby better understand the structure and characteristics of the data. Through Python's sklearn and matplotlib libraries, we can implement and visualize the PCA algorithm very conveniently.
The above is the detailed content of PCA principal component analysis (dimensionality reduction) techniques in Python. For more information, please follow other related articles on the PHP Chinese website!