Python机器学习/数据挖掘项目实战 波士顿房价预测 回归分析
此数据源于美国某经济学杂志上,分析研究波士顿房价( Boston HousePrice)的数据集。在这个项目中,你将利用马萨诸塞州波士顿郊区的房屋信息数据训练和测试一个模型,并对模型的性能和预测能力进行测试。通过该数据训练后的好的模型可以被用来对房屋做特定预测—尤其是对房屋的价值。对于房地产经纪等人的日常工作来说,这样的预测模型被证明非常有价值。数据集说明
数据集中的每一行数据都是对波士顿周边或城镇房价的情况描述,对数据集变量说明如下。原文:CRIM: 城镇人均犯罪率
ZN: 住宅用地所占比例
INDUS: 城镇中非住宅用地所占比例
CHAS: 虚拟变量,用于回归分析
NOX: 环保指数
RM: 每栋住宅的房间数
AGE: 1940 年以前建成的自住单位的比例
DIS: 距离 5 个波士顿的就业中心的加权距离
RAD: 距离高速公路的便利指数
TAX: 每一万美元的不动产税率
PTRATIO: 城镇中的教师学生比例
B: 城镇中的黑人比例
LSTAT: 地区中有多少房东属于低收入人群
MEDV: 自住房屋房价中位数(也就是均价)
print (boston_data['DESCR'])
Boston House Prices dataset
===========================
Notes
------ Data Set Characteristics:
:Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive:Median Value (attribute 14) is usually the target:Attribute Information (in order):- CRIMper capita crime rate by town- ZN proportion of residential land zoned for lots over 25,000 sq.ft.- INDUS proportion of non-retail business acres per town- CHASCharles River dummy variable (= 1 if tract bounds river; 0 otherwise)- NOXnitric oxides concentration (parts per 10 million)- RM average number of rooms per dwelling- AGEproportion of owner-occupied units built prior to 1940- DISweighted distances to five Boston employment centres- RADindex of accessibility to radial highways- TAXfull-value property-tax rate per $10,000- PTRATIO pupil-teacher ratio by town- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town- LSTAT % lower status of the population- MEDVMedian value of owner-occupied homes in $1000's
导入库
from sklearn.datasets import load_bostonimport pandas as pdfrom pandas import Series, DataFrameimport numpy as npfrom matplotlib import pyplot as plt
载入数据集
boston_data=load_boston()x_data = boston_data.datay_data = boston_data.targetnames=boston_data.feature_namesFeaturesNums = 13DataNums = len(x_data)
可视化分析
将数据集各个特征可视化分析相关性后再进行数据处理处理后继续可视化可视化再反馈给数据处理若数据满意,则尝试建模以下的数据图是经过筛选后的特征数据所得
特征与标签关系
观察特征与标签关系分析特征对于标签的贡献程度# 每个Feature和target二维关系图plt.subplots(figsize=(20,12))for i in range(FeaturesNums):plt.subplot(231+i)plt.scatter(x_train[:,i],y_train,s=20,color='blueviolet')plt.title(names[i])plt.show()
特征数据分布
数据分布能够估计数据价值也能发现异常数据plt.subplots(figsize=(20,10))for i in range(FeaturesNums):plt.subplot(231+i)plt.hist(x_data[:,i],color='lightseagreen',width=2)plt.xlabel(names[i])plt.title(names[i])plt.show()
数据处理
导入sklearn中的预处理库多种处理方式from sklearn import preprocessing
清除异常值
DelList0=[]for i in range(DataNums):if (y_data[i] >= 49 or y_data[i] <= 1):DelList0.append(i)DataNums -= len(DelList0)x_data = np.delete(x_data,DelList0,axis=0)y_data = np.delete(y_data,DelList0,axis=0)
去除无用特征
DelList1=[]for i in range(FeaturesNums):if (names[i] == 'ZN' ornames[i] == 'INDUS' ornames[i] == 'RAD' ornames[i] == 'TAX' ornames[i] == 'CHAS' ornames[i] == 'NOX' ornames[i] == 'B' ornames[i] == 'PTRATIO'):DelList1.append(i)x_data = np.delete(x_data, DelList1, axis=1)names = np.delete(names, DelList1)FeaturesNums -= len(DelList1)
归一化
from sklearn.preprocessing import MinMaxScaler, scalenms = MinMaxScaler()x_train = nms.fit_transform(x_train)x_test = nms.fit_transform(x_test)y_train = nms.fit_transform(y_train.reshape(-1,1))y_test = nms.fit_transform(y_test.reshape(-1,1))
数据分割
from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3)
训练模型
尝试多种模型择优选取线性回归LinearRegression
用线性回归模型训练查看MSE和R2得分from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scoremodel = LinearRegression()model.fit(x_train, y_train)y_pred = model.predict(x_test)print ("MSE =", mean_squared_error(y_test, y_pred),end='\n\n')print ("R2 =", r2_score(y_test, y_pred),end='\n\n')
可视化结果MSE = 0.013304697805737791
R2 = 0.44625845284900767
# 画图fig, ax = plt.subplots()ax.scatter(y_test, y_pred, c="blue", edgecolors="aqua",s=13)ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k', lw=2, color='navy')ax.set_xlabel('Reality')ax.set_ylabel('Prediction')plt.show()
SVR模型linear核
用SVR模型linear核模型训练查看得分from sklearn.svm import SVRfrom sklearn.model_selection import cross_val_predict, cross_val_scorelinear_svr = SVR(kernel='linear')# linear_svr.fit(x_train, y_train)# linear_pred = linear_svr.predict(x_test)linear_svr_pred = cross_val_predict(linear_svr, x_train, y_train, cv=5)linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)linear_svr_meanscore = linear_svr_score.mean()print ("Linear_SVR_Score =",linear_svr_meanscore,end='\n')
Linear_SVR_Score = 0.6497361775614359
SVR模型poly核
用SVR模型poly核模型训练查看得分from sklearn.svm import SVRfrom sklearn.model_selection import cross_val_predict, cross_val_scorepoly_svr = SVR(kernel='poly')poly_svr.fit(x_train, y_train)poly_pred = poly_svr.predict(x_test)poly_svr_pred = cross_val_predict(poly_svr, x_train, y_train, cv=5)poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)poly_svr_meanscore = poly_svr_score.mean()print ('\n',"Poly_SVR_Score =",poly_svr_meanscore,end='\n')
Poly_SVR_Score = 0.5383303049258509