本文轉載自微信公眾號「尤而小屋」,作者尤而小屋 。轉載本文請聯絡尤而小屋公眾號。
大家好,我是Peter~
最近取得了一份IC電子產品電商數據,後面會進行3個主題的數據分析與挖掘:
本文是第一個階段,主要內容包含:
import pandas as pd import numpy as np import time import os from datetime import datetime import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline #设置中文编码和负号的正常显示 plt.rcParams['font.sans-serif']=['SimHei'] plt.rcParams['axes.unicode_minus']=False import plotly_express as px import plotly.graph_objects as go import missingno as ms from sklearn.cluster import KMeans from sklearn.preprocessing import MinMaxScaler
df = pd.read_csv( "ic_sale.csv", encoding="utf-8",# 指定编码 cnotallow={"order_id":str,"product_id":str,"category_id":str,"user_id":str} # 指定字段类型 ) df.head()
# 1、数据shape df.shape
(564169, 11)
# 2、数据字段类型 df.dtypes
event_timeobject order_idobject product_idobject category_id object category_code object brand object pricefloat64 user_id object ageint64 sex object local object dtype: object
# 3、数据描述统计信息 df.describe()
price | #age | |
count | 564169.000000 | #564169.000000 |
mean | 208.269324 | #33.184388 |
std### | 304.559875 | #10.122088 |
0.000000 | ||
25% | # 23.130000 | |
##50% | #87.940000 | |
#75% | 277.750000 |
In [6]:
# 4、总共多少个不同客户 df["user_id"].nunique()
Out[6]:
6908
In [7]:
# 5、总共多少个不同品牌 df["brand"].nunique()
Out[7]:
868
In [8]:
# 6、总共多少个订单 df["order_id"].nunique()
Out[8]:
234232
In [9]:
# 7、总共多少个产品 df["product_id"].nunique()
Out[9]:
3756
从描述统计信息中发现price字段的最小值是0,应该是没有成交的数据;我们选择price大于0的信息:
In [10]:
df = df[df["price"] > 0]
In [11]:
df.isnull().sum()
Out[11]:
event_time0 order_id0 product_id0 category_id 0 category_code129344 brand 27215 price 0 user_id 0 age 0 sex 0 local 0 dtype: int64
可以看到缺失值体现在字段:
In [12]:
ms.bar(df,color="blue")# 缺失值可视化 plt.show()
In [13]:
df.fillna("missing",inplace=True)
In [14]:
df.isnull().sum()# 填充之后无缺失值
Out[14]:
event_time 0 order_id 0 product_id 0 category_id0 category_code0 brand0 price0 user_id0 age0 sex0 local0 dtype: int64
读进来的数据中时间字段是object类型,需要将其转成时间格式的类型
In [15]:
df["event_time"][:5] # 处理前
Out[15]:
02020-04-24 11:50:39 UTC 12020-04-24 11:50:39 UTC 22020-04-24 14:37:43 UTC 32020-04-24 14:37:43 UTC 42020-04-24 19:16:21 UTC Name: event_time, dtype: object
In [16]:
# 去掉最后的UTC df["event_time"] = df["event_time"].apply(lambda x: x[:19])
In [17]:
# 时间数据类型转化:字符类型---->指定时间格式 df['event_time'] = pd.to_datetime(df['event_time'], format="%Y-%m-%d %H:%M:%S")
In [18]:
# 提取多个时间相关字段 df['month']=df['event_time'].dt.month df['day'] = df['event_time'].dt.day df['dayofweek']=df['event_time'].dt.dayofweek df['hour']=df['event_time'].dt.hour
In [19]:
df["event_time"][:5] # 处理后
Out[19]:
0 2020-04-24 11:50:39 1 2020-04-24 11:50:39 2 2020-04-24 14:37:43 3 2020-04-24 14:37:43 4 2020-04-24 19:16:21 Name: event_time, dtype: datetime64[ns]
可以看到字段类型已经发生了变化
In [20]:
amount_by_month = df.groupby("month")["price"].sum().reset_index() amount_by_month
Out[20]:
month | price | |
0 | 1 | 1953358.17 |
1 | 2 | 2267809.88 |
2 | 3 | 2897486.26 |
3 | 4 | 1704422.41 |
#4 | #5 | #7768637.79 |
#5 | 6 | #7691244.33 |
6 | ||
7 | 16354029.27############7#### ########8############27982605.44##################8########## # ###9### | 17152310.57 |
#9 | ##10 | #19765680.76 |
#10 | 11 | 11961511.52 |
In [21]:
fig = px.scatter(amount_by_month,x="month",y="price",size="price",color="price") fig.update_layout(height=500, width=1000, title_text="每月成交金额") fig.show()
可以看到:
In [22]:
order_by_month = df.groupby("month")["order_id"].nunique().reset_index() order_by_month
Out[22]:
month | order_id | |||
0 | 1 | 10353 | ||
1 | 2 | 11461 | ||
2 | 3 | 12080 | ||
3 | 4 | 9001 | ||
4 | 5 | #30460 | ||
5 | 6 | #28978 | ||
7 | #57659 | |||
## 7 | ###8############73897###################8## ## ########9############345####################9##### ##### ###10### | ##14 | ||
10 | 11 | 6 |
month | nunique | count | |
0 | 1 | 1388 | 15575 |
1 | 2 | 1508 | 17990 |
2 | 3 | 1597 | ##18687 |
#3 | 4 | 1525 | 11867 |
#4 | 5 | 3168 | 40332 |
5 | #6 | 3966 | #41355 |
7 | 5159 | 76415 | |
##7 | 8 | #6213 | 100006 |
8 | #9 | ##549770496 | |
10 | 10 |
| 4597|
104075 | 10 | 11 | 3134 |
In [25]:
fig = px.line(people_by_month,x="month",y="nunique") fig.update_layout(height=500, width=1000, title_text="每月成交人数") fig.show()
fig = px.line(people_by_month,x="month",y="count") fig.update_layout(height=500, width=1000, title_text="每月成交人次") fig.show()
In [27]:
amount_by_month# 每月成交金额
Out[27]:
month | price | |
0 | 1 | 1953358.17 |
1 | 2 | 2267809.88 |
2 | 3 | 2897486.26 |
3 | 4 | 1704422.41 |
#4 | #5 | #7768637.79 |
#5 | 6 | #7691244.33 |
6 | ||
7 | 16354029.27############7#### ########8############27982605.44##################8########## # ###9### | 17152310.57 |
#9 | ##10 | #19765680.76 |
#10 | 11 | 11961511.52 |
In [28]:
order_by_month# 每月订单数
Out[28]:
month | order_id | |||
0 | 1 | 10353 | ||
1 | 2 | 11461 | ||
2 | 3 | 12080 | ||
3 | 4 | 9001 | ||
4 | 5 | #30460 | ||
5 | 6 | #28978 | ||
7 | #57659 | |||
## 7 | ###8############73897###################8## ## ########9############345####################9##### ##### ###10### | ##14 | ||
10 | 11 | 6 |
month | price | order_id | ||
0 | 1 | 1953358.17 | 10353 | |
1 | 2 | 2267809.88 | 11461 | |
2 | 3 | 2897486.26 | 12080 | |
3 | ##4 | 1704422.41 | 9001 | |
4 | #5 | 7768637.79 | ##30460
| |
6 | #7691244.33 | # #28978 | ||
#7 | ##16354029.27 | ##57659 | #7 | |
##8 | ##27982605.44 | #73897 | ||
8 | #9 | 17152310.57 | #345 | |
#9 | 10 | 19765680.76 | 14 | |
#10 | 11 | #11961511.52 | #6
|
order_id | product_id | |
234208 | 2388440981134640000 | 15021 |
234210 | 2388440981134660000 | 14891 |
234211 | 2388440981134670000 | 14845 |
234212 | ##2388440981134680000 | #14765 |
#234202 | #2388440981134580000 | #14587 |
234205 | ##234205 | |
## 2388440981134610000 | #14571 | |
234207 | ##238844098113463000 | 14443 |
#234204 | 2388440981134600000 | ################# ## ##14416##################234206############2388440981134620000############14414 ## # |
234203 | #2388440981134590000 | ##14194 |
In [33]:
fig = px.bar(product_by_order[:20], x="order_id", y="product_id", text="product_id" ) fig.show()
一个订单下包含的产品数量是不同;上万的订单可能是小型的ic元器件产品。
不同省份下的订单量、用户量和成交金额对比
In [34]:
local = df.groupby("local").agg({"order_id":"nunique","user_id":"nunique","price":sum}).reset_index() local.head()
Out[34]:
local | order_id | user_id | price | |
0 | 上海 | 39354 | 5680 | 19837942.20 |
1 | 北京 | 38118 | 5702 | 19137748.75 |
2 | 四川 | ##13396 | ##35896770891.28 | |
天津 | 13058 | 3497 | 6433736.85 | |
##4 | 廣東 | #51471 | 6085 | 26013770.86 |
In [35]:
df1 = local.sort_values("order_id",ascending=True)# 订单量升序 df1
Out[35]:
local | order_id | user_id | price | |
6 | 浙江 | 12790 | 3485 | 6522657.59 |
8 | 湖北 | 12810 | 3488 | 5993820.57 |
3 | 天津 | 13058 | 3497 | 6433736.85 |
##10 | 重慶 | #13058 | 3496 | #6479488.14 |
#7 | 海南 | |||
# #13076 | 3587 | #6968674.41 | ##########2### #########四川#############13396#############3589### | 6770891.28 |
#5 | Pf | |||
#13575 | 3598 | #6357286.87 | #9 | |
13879 | 3481 | #6983078.88 | 1 | |
###38118### #########5702############19137748.75##################0### | 上海 | 39354 | 5680 | 19837942.20 |
4 | 廣東 | 51471 | ##6085 | 26013770.86 |
dayofweek | order_id | |
0 | 0 | 35690 |
1 | 1 | 34256 |
2 | 2 | #31249 |
3 | #3 | #31555 |
4 | 4 | 33010 |
5 | 5 | #34772 |
6 | 6 | 33922 |
hour | order_id | |
0 | 0 | 2865 |
1 | 1 | 2711 |
2 | 2 | 3981 |
3 | 3 | 6968 |
4 | ##4 | 12176 |
5 | 5 | 16411 |
#6 | #618667 | |
7 |
| #20034|
8 | 20261 | |
#9 | ##20507 |