Explanation Factor analysis is a classic multivariate statistical analysis method that is often used to explore potential factors in data sets. For example, we can use explanatory factor analysis to identify factors that influence brand awareness or discover factors that influence consumer behavior in a certain market. In Python, we can use a variety of libraries to implement explanatory factor analysis. This article will introduce in detail how to use Python to implement this algorithm.
To implement explanatory factor analysis in Python, we first need to install several necessary libraries. Among them, we need to use the NumPy library for data processing and calculations; use the Pandas library to load and process data; and use the statsmodels library to run explanatory factor analysis.
You can use Python's package manager (such as pip) to install these libraries. Run the following command in the terminal:
!pip install numpy pandas statsmodels
To demonstrate factor analysis, in this article we use the credit card data set from the UCI machine learning library. This data set contains each customer’s credit card and other financial data, such as account balances, credit limits, etc. You can download the dataset from the following URL: https://archive.ics.uci.edu/ml/datasets/default of credit card clients
After downloading, we need to use the Pandas library to load the dataset into Python. In this article, we will use the following code to load the data:
import pandas as pd # 加载数据 data = pd.read_excel('default of credit card clients.xls', skiprows=1) # 删除第一列(ID) data = data.drop(columns=['ID'])
Note that we use skiprows=1
to skip the first line in the file because that line does not belong to the real data . We then used the drop
function to drop the first column in the dataset, as this column only contains IDs and is not useful for our data analysis.
Before performing explanatory factor analysis, we first need to perform some processing on the data. According to our example, we need to perform an illustrative factor analysis on the customer's credit history. Therefore, we need to split the dataset into credit history and other financial data. In this article, we consider credit history as the variable we want to study.
# 获取信用记录数据 credit_data = data.iloc[:, 5:11] # 对数据进行标准化(均值0,标准差1) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() credit_data = pd.DataFrame(scaler.fit_transform(credit_data), columns=credit_data.columns)
We use the iloc
function to select the credit record column from the dataset. Then, we use the StandardScaler
function to standardize the credit record data (mean is 0, standard deviation is 1). Standardization is a necessary step for explaining factor analysis.
After the data processing is completed, we can use the statsmodels
library to run explanatory factor analysis. In this article, we will use the maximum likelihood estimation algorithm to determine the number of factors.
# 运行说明因子分析 from factor_analyzer import FactorAnalyzer # 定义模型 fa = FactorAnalyzer() # 拟合模型 fa.fit(credit_data) # 获取因子载荷 loadings = pd.DataFrame(fa.loadings_, index=credit_data.columns, columns=['Factor {}'.format(i) for i in range(1, len(credit_data.columns)+1)]) # 获取方差贡献率 variance = pd.DataFrame({'Variance': fa.get_factor_variance()}, index=['Factor {}'.format(i) for i in range(1, len(credit_data.columns)+1)])
In the above code, we first instantiated a FactorAnalyzer
object, and then used the fit
function to fit the data. We also use loadings_
to obtain factor loadings, which are a measure of the strength of the correlation between each variable and each factor. We use get_factor_variance
to obtain the variance contribution rate, which is used to measure the extent to which each factor explains the overall variance. In the final code, we use pd.DataFrame
to convert the result to a Pandas dataframe.
According to our algorithm, we can obtain the two indicators of factor loading and variance contribution rate. We can use these indicators to identify underlying factors.
The following is the output result of factor loading and variance contribution rate:
Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 LIMIT_BAL 0.847680 -0.161836 -0.013786 0.010617 -0.037635 0.032740 SEX -0.040857 0.215850 0.160855 0.162515 -0.175099 0.075676 EDUCATION 0.208120 -0.674727 0.274869 -0.293581 -0.086391 -0.161201 MARRIAGE -0.050921 -0.028212 0.637997 0.270484 -0.032020 0.040089 AGE -0.026009 0.028125 -0.273592 0.871728 0.030701 0.020664 PAY_0 0.710712 0.003285 -0.030082 -0.036452 -0.037875 0.040604
Variance Factor 1 1.835932 Factor 2 1.738685 Factor 3 1.045175 Factor 4 0.965759 Factor 5 0.935610 Factor 6 0.104597
In the loading matrix, we can see that the credit record has a higher loading value on factor 1, which indicates that the Factors have a strong correlation with credit history. In terms of variance contribution rate, we can see that the first factor contributes the most to the variance, which means that credit records have stronger explanatory power on factor 1.
Therefore, we can regard factor 1 as the main factor affecting customer credit records.
In this article, we introduced how to implement the illustrative factor analysis algorithm in Python. We first prepared the data, then ran explanatory factor analysis using the statsmodels
library, and finally analyzed indicators such as factor loadings and variance contribution rates. This algorithm can be used in many data analysis applications, such as market research and human resource management. If you're working with data like this, the factor analysis algorithm is worth a try.
The above is the detailed content of Detailed explanation of explanatory factor analysis algorithm in Python. For more information, please follow other related articles on the PHP Chinese website!