在房地產領域,確定房地產價格涉及許多因素,從位置和規模到便利設施和市場趨勢。簡單線性迴歸是機器學習的基礎技術,它提供了一種根據房間數量或平方英尺等關鍵特徵來預測房價的實用方法。
在本文中,我深入研究了將簡單線性回歸應用於住房資料集的過程,從資料預處理和特徵選擇到建立可以提供有價值的價格洞察的模型。無論您是資料科學新手還是尋求加深理解,該專案都可以讓您親身探索資料驅動的預測如何塑造更明智的房地產決策。
首先,您首先要匯入庫:
import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt
#Read from the directory where you stored the data data = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')
data
#Test to see if there arent any null values data.info()
#Trying to draw the same number of null values data.dropna(inplace = True)
data.info()
#From our data, we are going to train and test our data from sklearn.model_selection import train_test_split X = data.drop(['median_house_value'], axis = 1) y = data['median_house_value']
y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
#Examining correlation between x and y training data train_data = X_train.join(y_train)
train_data
#Visualizing the above train_data.hist(figsize=(15, 8))
#Encoding non-numeric columns to see if they are useful and categorical for analysis train_data_encoded = pd.get_dummies(train_data, drop_first=True) correlation_matrix = train_data_encoded.corr() print(correlation_matrix)
train_data_encoded.corr()
plt.figure(figsize=(15,8)) sns.heatmap(train_data_encoded.corr(), annot=True, cmap = "inferno")
import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt
#Read from the directory where you stored the data data = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')
data
ocean_proximity
內陸 5183
近海 2108
近灣 1783
第五島
名稱:計數,資料型態:int64
#Test to see if there arent any null values data.info()
#Trying to draw the same number of null values data.dropna(inplace = True)
data.info()
#From our data, we are going to train and test our data from sklearn.model_selection import train_test_split X = data.drop(['median_house_value'], axis = 1) y = data['median_house_value']
y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
#Examining correlation between x and y training data train_data = X_train.join(y_train)
train_data
#Visualizing the above train_data.hist(figsize=(15, 8))
#Encoding non-numeric columns to see if they are useful and categorical for analysis train_data_encoded = pd.get_dummies(train_data, drop_first=True) correlation_matrix = train_data_encoded.corr() print(correlation_matrix)
train_data_encoded.corr()
plt.figure(figsize=(15,8)) sns.heatmap(train_data_encoded.corr(), annot=True, cmap = "inferno")
train_data['total_rooms'] = np.log(train_data['total_rooms'] + 1) train_data['total_bedrooms'] = np.log(train_data['total_bedrooms'] +1) train_data['population'] = np.log(train_data['population'] + 1) train_data['households'] = np.log(train_data['households'] + 1)
train_data.hist(figsize=(15, 8))
0.5092972905670141
#convert ocean_proximity factors into binary's using one_hot_encoding train_data.ocean_proximity.value_counts()
#For each feature of the above we will then create its binary(0 or 1) pd.get_dummies(train_data.ocean_proximity)
0.4447616558596853
#Dropping afterwards the proximity train_data = train_data.join(pd.get_dummies(train_data.ocean_proximity)).drop(['ocean_proximity'], axis=1)
train_data
#recheck for correlation plt.figure(figsize=(18, 8)) sns.heatmap(train_data.corr(), annot=True, cmap ='twilight')
0.5384474921332503
我真的想說,訓練機器並不是最簡單的過程,但為了不斷改進上面的結果,您可以在param_grid 下添加更多功能,例如min_feature,這樣您的最佳估計器分數就可以不斷改進。
如果您到目前為止,請在下面按讚並分享您的評論,您的意見非常重要。謝謝! ??❤️
以上是房屋價格預測的詳細內容。更多資訊請關注PHP中文網其他相關文章!