This article brings you relevant knowledge about python, which mainly introduces related issues about data processing and visualization, including the preliminary use of NumPy, the use of Matplotlib package and data statistics Visual display, etc. I hope it will be helpful to everyone.
Recommended learning: python tutorial
Tables are general representations of data form, but it is incomprehensible to the machine, that is, it is unrecognizable data, so we need to adjust the form of the table.
The commonly used machine learning representation is a data matrix.
We observed this table and found that there are two types of attributes in the matrix, one is numeric type and the other is Boolean type. So we will now build a model to describe this table:
# 数据的矩阵化import numpy as np data = np.mat([[1,200,105,3,False],[2,165,80,2,False],[3,184.5,120,2,False], [4,116,70.8,1,False],[5,270,150,4,True]])row = 0for line in data: row += 1print( row )print(data.size)print(data)
The first line of code here means introducing NumPy and renaming it to np. In the second line, we use the mat() method in NumPy to create a data matrix, and row is the variable introduced to calculate the number of rows.
The size here means a table of 5*5. You can see the data by printing the data directly:
Let’s still look at the top table. The second column is the difference in housing prices. It is not easy to see the difference intuitively (because there are only numbers), so we hope to draw it (Research The method for numerical differences and anomalies is to draw the distribution of data ):
import numpy as npimport scipy.stats as statsimport pylab data = np.mat([[1,200,105,3,False],[2,165,80,2,False],[3,184.5,120,2,False], [4,116,70.8,1,False],[5,270,150,4,True]])coll = []for row in data: coll.append(row[0,1])stats.probplot(coll,plot=pylab)pylab.show()
The result of this code is to generate a graph:
So that we can clearly see it There is a difference.
The requirement for a coordinate chart is to show the specific values of data through different rows and columns.
Of course, we can also display the coordinate diagram:
Similarity There are many calculation methods, and we choose the two most commonly used ones, namely Euclidean similarity and cosine similarity calculation.
Euclidean distance is used to represent the true distance between two points in three-dimensional space. We all know the formula, but we rarely hear the name:
So let’s take a look at its practical application:
This table shows the ratings of items by three users:
d12 represents the similarity between user 1 and user 2, then there is:
Similarly, d13:
It can be seen that user 2 is more similar to User 1 (the smaller the distance, the greater the similarity).
The starting point for the calculation of cosine angle is the difference in the included angle.
It can be seen that compared to user 3, user 2 is more similar to user 1 (the more similar the two targets are, the smaller the angle formed by their line segments)
Quartiles are the statistical median scores A kind of digit, that is, the data is arranged from small to large, and then divided into four equal parts. The data at the three dividing points is the quartile.
First quartile (Q1), also called lower quartile;
Second quartile (Q1), also called median;
Third quartile (Q1), also called lower quartile;
The gap between the third quartile and the first quartile is also called the four-point gap (IQR).
若n为项数,则:
Q1的位置 = (n+1)*0.25
Q2的位置 = (n+1)*0.50
Q3的位置 = (n+1)*0.75
四分位示例:
关于这个rain.csv,有需要的可以私我要文件,我使用的是亳州市2010-2019年的月份降水情况。
from pylab import *import pandas as pdimport matplotlib.pyplot as plot filepath = ("C:\\Users\\AWAITXM\\Desktop\\rain.csv")# "C:\Users\AWAITXM\Desktop\rain.csv"dataFile = pd.read_csv(filepath)summary = dataFile.describe()print(summary)array = dataFile.iloc[:,:].values boxplot(array)plot.xlabel("year")plot.ylabel("rain")show()
以下是plot运行结果:
这个是pandas的运行
这里就可以很清晰的看出来数据的波动范围。
可以看出,不同月份的降水量有很大差距,8月最多,1-4月和10-12月最少。
那么每月的降水增减程度如何比较?
from pylab import *import pandas as pdimport matplotlib.pyplot as plot filepath = ("C:\\Users\\AWAITXM\\Desktop\\rain.csv")# "C:\Users\AWAITXM\Desktop\rain.csv"dataFile = pd.read_csv(filepath)summary = dataFile.describe()minRings = -1maxRings = 99nrows = 11for i in range(nrows): dataRow = dataFile.iloc[i,1:13] labelColor = ( (dataFile.iloc[i,12] - minRings ) / (maxRings - minRings) ) dataRow.plot(color = plot.cm.RdYlBu(labelColor),alpha = 0.5)plot.xlabel("Attribute")plot.ylabel(("Score"))show()
结果如图:
可以看出来降水月份并不规律的上涨或下跌。
那么每月降水是否相关?
from pylab import *import pandas as pdimport matplotlib.pyplot as plot filepath = ("C:\\Users\\AWAITXM\\Desktop\\rain.csv")# "C:\Users\AWAITXM\Desktop\rain.csv"dataFile = pd.read_csv(filepath)summary = dataFile.describe()corMat = pd.DataFrame(dataFile.iloc[1:20,1:20].corr())plot.pcolor(corMat)plot.show()
结果如图:
可以看出,颜色分布十分均匀,表示没有多大的相关性,因此可以认为每月的降水是独立行为。
今天就记录到这里了,我们下次再见!希望本文章对你也有所帮助。
推荐学习:python学习教程
The above is the detailed content of In-depth understanding of Python data processing and visualization. For more information, please follow other related articles on the PHP Chinese website!