Chi-square test is a statistical method used to analyze changes in sample size and degree of correlation. It is commonly used in the fields of data analysis and machine learning. Python is a widely used programming language with excellent efficiency and flexibility in processing data and applying chi-square tests. This article will introduce the chi-square test technique in Python to help readers understand and apply this important statistical method.
1. The basic concept of the chi-square test
The chi-square test is used to test the independence or correlation between two or more variables. It uses the chi-square statistic to measure the difference between observed and expected values. The formula of the chi-square statistic is as follows:
X^2 = Σ(Oi - Ei)^2 / Ei
where Oi is the observed value, Ei is the expected value, and Σ is the sum symbol. The results calculated by the chi-square statistic are related to the degree of freedom, which is the degree to which the data is free to vary, and the significance level. The formula is:
df = (r - 1) x (c - 1)
where r is the number of rows and c is the number of columns. The significance level refers to the probability of being wrong and is usually set to 0.05 or 0.01.
2. Chi-square test function in Python
In Python, you can use the stats.chi2_contingency function in the SciPy library to perform the chi-square test. This function computes the results of a chi-square test of independence between two or more categorical variables, returning a tuple containing the chi-square statistic, p-value, degrees of freedom, and expected value.
The following is the syntax of this function:
chi2, pval, dof, expctd = stats.chi2_contingency(observed)
where observed is a matrix containing observed values, The rows of the matrix represent one variable and the columns represent another variable.
3. Using Python to perform the chi-square test
Now, let’s look at a practical example. Suppose we have a data set containing multiple categorical variables and we want to determine whether these variables are independent of each other. In this example, we will use a dummy dataset containing gender and preferences. The format of the data is as follows:
data = [[45, 21, 16], [34, 32, 26]]
Among them, 45 people are from the male group, 21 people like bananas, and 16 people like apples; 34 people are from the female group, 32 people like bananas, and 26 people like apples.
We can use the stats.chi2_contingency function to calculate the results of the chi-square test:
from scipy import stats data = [[45, 21, 16], [34, 32, 26]] chi2, pval, dof, expctd = stats.chi2_contingency(data) print('卡方统计量:', chi2) print('p值:', pval) print('自由度:', dof) print('期望值:', expctd)
The running result is:
卡方统计量: 6.1589105976316335 p值: 0.046274961203698944 自由度: 2 期望值: [[37.28571429 21.40559441 22.30869129] [41.71428571 31.59440559 32.69130871]]
It can be seen that at the 0.05 significance level Below, we reject the null hypothesis that there is independence between gender and preferences. This means that there is a certain correlation between gender and preferences.
4. Summary
In Python, the process of using the chi-square test is very simple. We can use the stats.chi2_contingency function in the SciPy library to input a matrix containing observations to get the results of the chi-square test. When applying the chi-square test, care needs to be taken to select appropriate degrees of freedom and significance levels. The chi-square test is a common and useful data analysis method that is widely used in machine learning and statistics. Mastering the chi-square test skills in Python is very helpful for researching and solving practical problems.
The above is the detailed content of Chi-square test techniques in Python. For more information, please follow other related articles on the PHP Chinese website!