1. 獲取數據。sklearn中自帶一些經常使用的數據集,點擊打開連接,例如用於迴歸分析的波士頓房價數據集(Boston)、用於分類的鳶尾花數據集(iris)等。現選用Boston數據集,能夠先調用shape()等對數據集的基本狀況進行查看:html
from sklearn import datasets from numpy import shape loaded_data = datasets.load_boston() data_X = loaded_data.data data_y = loaded_data.target print(shape(data_X)) print(shape(data_y)) print(data_X[:2, :]) print(data_y[:2])
輸出結果:python
(506, 13) (506,) [[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 0.00000000e+00 5.38000000e-01 6.57500000e+00 6.52000000e+01 4.09000000e+00 1.00000000e+00 2.96000000e+02 1.53000000e+01 3.96900000e+02 4.98000000e+00] [ 2.73100000e-02 0.00000000e+00 7.07000000e+00 0.00000000e+00 4.69000000e-01 6.42100000e+00 7.89000000e+01 4.96710000e+00 2.00000000e+00 2.42000000e+02 1.78000000e+01 3.96900000e+02 9.14000000e+00]] [ 24. 21.6]
說明該數據集包括506個樣本,每一個樣本有13個特徵值,標籤值爲房價,同時輸出了前兩個樣本的具體狀況。測試
2.劃分訓練集和測試集。咱們將20%的樣本劃分爲測試集,80%爲訓練集,即test_size=0.2,一樣咱們也能夠調用shape()來查看劃分結果:code
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2) print(shape(X_train)) print(shape(X_test))
輸出結果:htm
(404, 13) (102, 13)
3.運行線性模型。咱們選用sklearn中基於最小二乘的線性迴歸模型,並用訓練集進行擬合,獲得擬合直線y=wTx+b中的權重參數w和b:blog
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) print (model.coef_) print (model.intercept_)
輸出結果:utf-8
[ -1.04864717e-01 3.97233700e-02 1.98757774e-02 2.30896040e+00 -1.76253192e+01 3.74803039e+00 1.28555952e-04 -1.56689014e+00 2.97635772e-01 -1.18908274e-02 -9.15199442e-01 1.04446613e-02 -5.55228840e-01] 36.3527413723
4.模型測試。利用測試集獲得對應的結果,並利用均方根偏差(MSE)對測試結果進行評價:ci
y_pred = model.predict(X_test) from sklearn import metrics print "MSE:", metrics.mean_squared_error(y_test, y_pred)
輸出結果:get
MSE: 19.1283413297
5.交叉驗證。咱們使用10折交叉驗證,即cv=10,並求出交叉驗證獲得的MSE值it
from sklearn.model_selection import cross_val_predict predicted = cross_val_predict(model, data_X, data_y, cv=10) print "MSE:", metrics.mean_squared_error(data_y, predicted)
輸出結果:
MSE: 34.5970425577
6.畫圖。將實際房價數據與預測數據做出對比,接近中間綠色直線的數據表示預測準確:
import matplotlib.pyplot as plt plt.scatter(data_y, predicted, color='y', marker='o') plt.scatter(data_y, data_y,color='g', marker='+') plt.show()
輸出圖像:
7.完整代碼爲:
# -*- encoding:utf-8 -*- from sklearn import datasets from sklearn.model_selection import train_test_split #原文中cross_validation已過期改成model_selection from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt from sklearn.model_selection import cross_val_predict from numpy import shape loaded_data = datasets.load_boston() data_X = loaded_data.data data_y = loaded_data.target # print(shape(data_X)) # print(shape(data_y)) # print(data_X[:2, :]) # print(data_y[:2]) X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2) # print(shape(X_train)) # print shape(X_test) model = LinearRegression() model.fit(X_train, y_train) # print (model.coef_) # print (model.intercept_) y_pred = model.predict(X_test) from sklearn import metrics print "MSE:", metrics.mean_squared_error(y_test, y_pred) predicted = cross_val_predict(model, data_X, data_y, cv=10) print "MSE:", metrics.mean_squared_error(data_y, predicted) plt.scatter(data_y, predicted, color='y', marker='o') plt.scatter(data_y, data_y,color='g', marker='+') plt.show()