目錄html
更新、更全的《機器學習》的更新網站,更有python、go、數據結構與算法、爬蟲、人工智能教學等着你:http://www.javashuo.com/article/p-vozphyqp-cm.htmlpython
import pandas as pd import numpy as np import matplotlib.pyplot as plt from matplotlib.font_manager import FontProperties from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score %matplotlib inline font = FontProperties(fname='/Library/Fonts/Heiti.ttc')
在《代碼-普通線性迴歸》的時候說到特徵LSTAT和標記MEDV有最高的相關性,可是它們之間並非線性關係,所以此次嘗試使用多項式迴歸擬合它們之間的關係。算法
df = pd.read_csv('housing-data.txt', sep='\s+', header=0) X = df[['LSTAT']].values y = df['MEDV'].values
# 增長二次方,即二項式迴歸 quadratic = PolynomialFeatures(degree=2) # 增長三次方,即三項式迴歸 cubic = PolynomialFeatures(degree=3) # 訓練二項式和三項式迴歸獲得二次方和三次方的X X_quad = quadratic.fit_transform(X) X_cubic = cubic.fit_transform(X) # 增長x軸座標點 X_fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis] lr = LinearRegression() # 線性迴歸 lr.fit(X, y) lr_predict = lr.predict(X_fit) # 計算線性迴歸的R2值 lr_r2 = r2_score(y, lr.predict(X)) # 二項式迴歸 lr = lr.fit(X_quad, y) quad_predict = lr.predict(quadratic.fit_transform(X_fit)) # 計算二項式迴歸的R2值 quadratic_r2 = r2_score(y, lr.predict(X_quad)) # 三項式迴歸 lr = lr.fit(X_cubic, y) cubic_predict = lr.predict(cubic.fit_transform(X_fit)) # 計算三項式迴歸的R2值 cubic_r2 = r2_score(y, lr.predict(X_cubic)) print(lr.score(X_cubic, y)) print(cubic_r2)
0.6578476405895719 0.6578476405895719
r2_score即報告決定係數\((R^2)\),能夠理解成MSE的標準版,\(R^2\)的公式爲
\[ R^2 = 1-{\frac {{\frac{1}{n}\sum_{i=1}^n(y^{(i)}-\hat{y^{(i)}})^2}} {{\frac{1}{n}}\sum_{i=1}^n(y^{(i)}-\mu_{(y)})^2} } \]
其中\(\mu_{(y)}\)是\(y\)的平均值,即\({{\frac{1}{n}}\sum_{i=1}^n(y^{(i)}-\mu_{(y)})^2}\)爲\(y\)的方差,公式能夠寫成
\[ R^2 = 1-{\frac{MSE}{Var(y)}} \]
\(R^2\)的取值範圍在\(0-1\)之間,若是\(R^2=1\),則均方偏差\(MSE=0\),即模型完美的擬合數據。數據結構
plt.scatter(X, y, c='gray', edgecolor='white', marker='s', label='訓練數據') plt.plot(X_fit, lr_predict, c='r', label='線性(d=1),$R^2={:.2f}$'.format(lr_r2), linestyle='--', lw=3) plt.plot(X_fit, quad_predict, c='g', label='平方(d=2),$R^2={:.2f}$'.format(quadratic_r2), linestyle='-', lw=3) plt.plot(X_fit, cubic_predict, c='b', label='立方(d=3),$R^2={:.2f}$'.format(cubic_r2), linestyle=':', lw=3) plt.xlabel('地位較低人口的百分比[LSTAT]', fontproperties=font) plt.ylabel('以1000美圓爲計價單位的房價[RM]', fontproperties=font) plt.title('波士頓房價預測', fontproperties=font, fontsize=20) plt.legend(prop=font) plt.show()
上圖能夠看出三項式的擬合結果優於二項式和線性迴歸的結果,可是在增長模型複雜度的同時,也須要時刻考慮到是否會出現過擬合的問題。機器學習