多元線性迴歸模型檢驗和預測

時間 2020-06-04

標籤多元線性迴歸模型檢驗預測欄目應用數學简体版

原文原文鏈接

1、概述

（F檢驗）顯著性檢驗：檢測自變量是否真正影響到因變量的波動。python

（t檢驗）迴歸係數檢驗：單個自變量在模型中是否有效。api

2、迴歸模型檢驗

檢驗迴歸模型的好壞經常使用的是F檢驗和t檢驗。F檢驗驗證的是偏回歸係數是否不全爲0（或全爲0），t檢驗驗證的是單個自變量是否對因變量的影響是顯著的（或不顯著）。dom

F檢驗和t檢驗步驟：函數

提出問題的原假設和備擇假設
在原假設的條件下，構造統計量
根據樣本信息，計算統計量的值
對比統計量的值和理論F分佈的值，計算統計量的值超過理論值，則拒絕原假設，不然接受原假設

待檢驗數據集先構形成向量的模式：spa

可表示成以下形式：3d

其中β 爲n×1的一維向量。excel

1) 假設code

F檢驗假設：orm

t檢驗假設：blog

H0爲原假設，H1爲備擇假設。F檢驗拒絕原假設的條件爲計算的F檢驗的值大於查到的理論F值。t檢驗能夠經過P值和擬合優度判斷變量的顯著性及模型組合的優劣。

2）計算過程

F檢驗計算過程：

上圖爲假設其中一個點所在的平面，由以上點計算出ESS(偏差平方和)，RSS（迴歸離差平方和）,TSS（總的離差平方和）。

計算公式爲：

其中ESS和RSS都會隨着模型的變化而發生變化（估計值變更）。而TSS衡量的是因變量和均值之間的離差平方和，不會隨着模型的變化而變化。

由以上公式構造F統計量：

（n爲數據集向量的行數，p爲列數（自變量的個數），n爲離差平方和RSS的自由度，n-p-1爲偏差平方和ESS的自由度），模型擬合越好，ESS越小，RSS越大，F值越大。

3）python中計算F值得方法：能夠直接經過model.fvalue獲得f值或者經過計算獲得。

from sklearn import model_selection
import statsmodels.api as sm
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from sklearn import preprocessing

Profit = pd.read_excel(r'Predict to Profit.xlsx')
Profit.head()

dummies = pd.get_dummies(Profit.State,prefix = 'State')
Profit_New = pd.concat([Profit,dummies],axis=1)
Profit_New.drop(labels = ['State','State_New York'],axis =1,inplace = True)

train , test = model_selection.train_test_split(Profit_New,test_size = 0.2,random_state=1234)
model = sm.formula.ols('Profit~RD_Spend+Administration+Marketing_Spend+State_California+State_Florida',data = train).fit()

ybar = train.Profit.mean()
p = model.df_model  #自變量個數
n= train.shape[0]    #行數，觀測個數
RSS = np.sum((model.fittedvalues - ybar)**2)  #計算離差平方和 估計值model.fittedvalues
ESS = np.sum(model.resid**2)  #計算偏差平方和，偏差表示model.resid
F = RSS/p/(ESS/(n-p-1))
print( F, model.fvalue)

　　結果以下：

求理論F值，使用Scipy庫子模塊stats中f.ppf

from scipy.stats import f
F_Theroy = f.ppf(q=0.95,dfn = p,dfd = n-p-1)
F_Theroy

因爲計算的F值遠遠大於理論F值，因此拒絕原假設，證實多元線性迴歸方程是顯著的，偏回歸係數不全爲0，即全部自變量聯合起來的組合確實對利潤有顯著性影響。

4) t檢驗手工計算比較複雜，套用python的驗計算方式，model.summary()

參數：R-squared爲擬合優度（反映模型對樣本的擬合程度）,Adj.R-squared爲修正的可決係數。

經過t檢驗得出，只有RD_Spend的P值小於0.05，其他變量均大於0.05，說明其他變量不是影響利潤Profit的重要因素。

3、迴歸模型診斷

實際應用中，因變量若爲數值型變量，能夠考慮線性迴歸模型。模型需知足以下假設：

　　因變量（殘差）服從正態分佈，自變量之間不存在多重共線性，自變量和因變量之間存在線性關係，用於建模的數據不存在異常值，殘差項知足方差別性和獨立性。

正態性檢驗（直方圖，pp圖或qq圖，shapiro檢驗和K-S檢驗）
多重共線性檢驗（VIF>10，存在多重共線性，>100則變量間存在嚴重的多重共線性）
線性相關性（數據框的corrwith方法或者seaborn模塊的pairplot函數做圖觀察）|p|~>=0.8 高度相關，0.5《 |p|《0.8 中度相關，0.3《 |p|《0.5 弱相關，|p|《0.3 幾乎不相關
異常值檢驗（帽子矩陣、DFFITS準則、學生化殘差、Cook距離）
殘差獨立性檢驗（summary()方法觀察Durbin-Watson值），DW值在2左右，則殘差項之間不相關，偏離2較遠，則不知足殘差獨立性假設
方差齊性檢驗（圖形法和BP檢驗）

具體方法過程：

1）正態性檢測（模型的前提假設是殘差服從（0，1）正態分佈，實質上由方程y=Xβ+ε，因變量也應該服從正態分佈），檢測的是因變量是否符合正態分佈

直方圖 scipy.stats seaborn.distplot

import seaborn as sns
import scipy.stats as stats
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

sns.distplot(a = Profit_New.Profit,
            bins =10,
            fit = stats.norm,
            norm_hist = True,
            hist_kws = {'color':'green','edgecolor':'black'},
            kde_kws = {'color':'black','linestyle':'--','label':'核密度曲線'},
            fit_kws = {'color':'red','linestyle':':','label':'正態密度曲線'}
           )
plt.legend()
plt.show()

經過觀察覈密度曲線和理論正態密度曲線的擬合程度，2條曲線近似吻合，可認爲因變量Profit服從正態分佈。

PP圖和QQ圖 ppplot qqplot

pq_plot = sm.ProbPlot(Profit_New.Profit)
pq_plot.ppplot(line='45')
plt.title('PP圖')
pq_plot.qqplot(line='q')
plt.title('QQ圖')
plt.show()

觀察到散點均均勻的分佈在直線上，說明因變量近似服從正態分佈。

Shapiro檢驗和K-S檢驗（因變量數據量>5000 use K-S檢驗法，<5000 use Shapiro檢驗）

stats.shapiro(Profit_New.Profit)

　　輸出爲

(0.9793398380279541, 0.537902295589447)，第一個值爲shapiro檢驗的統計量值，第二個值爲對應機率p值,p值大於0.05置信水平，利潤變量服從正態分佈假設。

K-S檢驗 KSValue = stats.kstest(rvs = datay,args = (datay.mean(),datay.std(),cdf = 'norm'))
須要注意的是KS檢驗，必須指定args參數，來傳遞檢驗變量的均值和標準差。

2）多重共線性檢驗（自變量之間的檢測）

檢測模型中自變量之間存在較高的線性相關關係，若存在會給模型帶來嚴重的後果。其檢驗能夠經過方差膨脹因子VIF來鑑定。（VIF>10，存在多重共線性，>100則變量間存在嚴重的多重共線性）

消除多重共線性，能夠經過不一樣自變量之間的組合觀察vif值，肯定最佳組合。R語言中消除多重共線性，肯定最優組合使用模型的step方法（逐步迴歸）。

計算方式：

X1爲第一個自變量。構造x1與其他自變量的線性迴歸模型，獲得迴歸模型的判決係數R2，獲得第一個自變量的膨脹因子，依次類推計算剩餘變量的VIF。

Python 的statsmodels模塊可直接計算VIF值

from statsmodels.stats.outliers_influence import variance_inflation_factor
X = sm.add_constant(Profit_New.ix[:,['RD_Spend','Marketing_Spend']])
vif = pd.DataFrame()
vif['features'] = X.columns
vif["VIF Factor"] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]

獲得vif的值：

計算全部的變量:

X = sm.add_constant(Profit_New.drop('Profit',axis =1).ix[:])
vif = pd.DataFrame()
vif['features'] = X.columns
vif["VIF Factor"] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif

vif值:

第二種模型獲得的const的值大於10，說明構建模型的變量之間存在多重共線性，須要刪除變量從新選擇模型。第一種模式選擇2個自變量（'RD_Spend','Marketing_Spend'）是符合自變量之間無線性相關要求的。

3）線性相關性檢驗（各變量之間線性相關性檢測）

圖形法 seaborn pariplot函數

sns.pairplot(Profit_New.ix[:,['RD_Spend','Administration','Marketing_Spend','Profit']]) #去除啞變量
plt.show()

　　散點圖矩陣圖示：

觀察散點圖分佈得出，RD_Spend，Marketing_Spend兩個變量和因變量Profit之間存在線性關係，Administration與Profit之間散點圖呈水平趨勢，說明2者之間不相關。

數據框的corrwith方法(該方式優勢是能夠計算任意指定變量之間的相關係數)

判斷條件：

# Profit 和各變量之間的相關係數
Profit_New.drop(['Profit','State_California','State_Florida'],axis =1).corrwith(Profit_New.Profit)

結果看出：自變量RD_Spend和Marketing_Spend與因變量之間存在較高的線性相關性，而其餘變量與利潤之間線性相關性比較弱。可忽略其線性相關關係。

須要注意的是經過corrwith檢測變量之間不存在線性相關關係，結合seaborn.pariplot觀察，可能存在其餘關係如2次方，3次方，倒數關係等等（在變量經過t檢驗被確認爲對因變量影響是顯著的狀況下，還須要對模型再次檢查）。

經過以下方式從新肯定模型爲： Profit = 51902.112471 + 0.785116RD_Spend + 0.019402Marketing_Spend

model3 = sm.formula.ols('Profit~RD_Spend + Marketing_Spend ',data = train).fit()
model3.params

4）異常值檢測（帽子矩陣、DFFITS準則、學生化殘差、Cook距離）

確認方式：|學生化殘差| > 2，則對應的數據點爲異常值。

對異常值得處理：異常值檢測 model.get_influence() ，學生化殘差值 model.get_influence().resid_studentized_external

確認異常樣本的比例，比例<5%，考慮能夠直接刪除異常值，>5%，能夠生成啞變量，對於異常點，設置啞變量爲1，不然爲0。

# 異常值檢測
outliers = model3.get_influence()   

# 帽子矩陣
leverage = outliers.hat_matrix_diag
# dffits值
dffits= outliers.dffits[0]
# 學生化殘差
resid_stu = outliers.resid_studentized_external
# cook距離
cook = outliers.cooks_distance[0]


# 合併各類異常值檢驗的統計量值
contatl = pd.concat([pd.Series(leverage, name = 'leverage'),
                     pd.Series(dffits, name = 'dffits'),
                     pd.Series(resid_stu, name = 'resid_stu'),
                     pd.Series(cook, name = 'cook')
                    ],axis =1 )

train.index = range(train.shape[0])
profit_outliers = pd.concat([train,contatl ],axis =1)
profit_outliers.head()

使用非異常值的數據點進行建模：

# 計算異常值比例
outliers_ratio = np.sum(np.where((np.abs(profit_outliers.resid_stu)>2),1,0))/profit_outliers.shape[0]
outliers_ratio  # 0.02564

none_outliers = profit_outliers.ix[np.abs(profit_outliers.resid_stu)<=2,]
model4 = sm.formula.ols('Profit~RD_Spend+Marketing_Spend',data = none_outliers).fit()
model4.params

再次更新模型爲：Profit = 51827.416821 + 0.797038RD_Spend + 0.01774Marketing_Spend

5）殘差獨立性檢驗

經過mdoel.summary函數觀察Durbin-Watson的值，在2左右說明殘差之間是不相關的，偏離2較遠，說明殘差之間是不獨立的。

model4.summary()，DW的值爲2.065，說明模型殘差知足獨立性假設。

6）殘差方差齊性檢驗（圖形法和BP法）

目的：殘差項的方差不隨自變量的變更而呈現某種趨勢，即殘差項的方差不受自變量變化而影響

圖形法：繪製殘差和自變量之間的散點圖

ax1 = plt.subplot2grid(shape = (2,1) , loc = (0,0)) # 設置第一張子圖位置
# 散點圖繪製
# ax1.scatter(none_outliers.RD_Spend, none_outliers.resid_stu )  #學生化殘差與自變量散點圖
ax1.scatter(none_outliers.RD_Spend, (model4.resid - model4.resid.mean())/model4.resid.std()) #標準化殘差和自變量散點圖
# 添加水平參考線
ax1.hlines(y = 0 ,
          xmin = none_outliers.RD_Spend.min(),
           xmax = none_outliers.RD_Spend.max(),
           color = 'red',
           linestyle = '--'
          )
ax1.set_xlabel('RD_Spend')
ax1.set_ylabel('Std_Residual')
ax2 = plt.subplot2grid(shape = (2,1) , loc = (1,0))
# ax1.scatter(none_outliers.Marketing_Spend, none_outliers.resid_stu )
ax2.scatter(none_outliers.Marketing_Spend, (model4.resid - model4.resid.mean())/model4.resid.std())
ax2.hlines(y = 0 ,
          xmin = none_outliers.Marketing_Spend.min(),
           xmax = none_outliers.Marketing_Spend.max(),
           color = 'magenta',
           linestyle = '--'
          )
ax2.set_xlabel('Marketing_Spend')
ax2.set_ylabel('Std_Residual')

# 調整2子圖之間距離
plt.subplots_adjust(hspace = 0.6,wspace = 0.3)
plt.show()

　注意的是採用學生化殘差或標準化殘差繪製和自變量的散點圖都可以，2者圖形相似。

緣由是學生化殘差計算公式爲：

圖形結果以下：

根據結果可知標準化殘差並無隨着自變量變化而呈現某種趨勢，圖形中全部點均勻的分佈在y=0兩側，能夠認爲殘差項知足方差齊性假設。

方差齊性檢測完整代碼：

from sklearn import model_selection
import statsmodels.api as sm
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

Profit = pd.read_excel(r'Predict to Profit.xlsx')

dummies = pd.get_dummies(Profit.State,prefix = 'State')
Profit_New = pd.concat([Profit,dummies],axis=1)
Profit_New.drop(labels = ['State','State_New York'],axis =1,inplace = True)
train , test = model_selection.train_test_split(Profit_New,test_size = 0.2,random_state=1234)
model3 = sm.formula.ols('Profit~RD_Spend + Marketing_Spend ',data = train).fit()
# 異常值檢測
outliers = model3.get_influence()   

# 帽子矩陣
leverage = outliers.hat_matrix_diag
# dffits值
dffits= outliers.dffits[0]
# 學生化殘差
resid_stu = outliers.resid_studentized_external
# cook距離
cook = outliers.cooks_distance[0]
contatl = pd.concat([pd.Series(leverage, name = 'leverage'),
                     pd.Series(dffits, name = 'dffits'),
                     pd.Series(resid_stu, name = 'resid_stu'),
                     pd.Series(cook, name = 'cook')
                    ],axis =1 )

train.index = range(train.shape[0])
profit_outliers = pd.concat([train,contatl ],axis =1)

outliers_ratio = np.sum(np.where((np.abs(profit_outliers.resid_stu)>2),1,0))/profit_outliers.shape[0]

none_outliers = profit_outliers.ix[np.abs(profit_outliers.resid_stu)<=2,]
model4 = sm.formula.ols('Profit~RD_Spend+Marketing_Spend',data = none_outliers).fit()

ax1 = plt.subplot2grid(shape = (2,1) , loc = (0,0)) # 設置第一張子圖位置
# 散點圖繪製
# ax1.scatter(none_outliers.RD_Spend, none_outliers.resid_stu )
ax1.scatter(none_outliers.RD_Spend, (model4.resid - model4.resid.mean())/model4.resid.std())
# 添加水平參考線
ax1.hlines(y = 0 ,
          xmin = none_outliers.RD_Spend.min(),
           xmax = none_outliers.RD_Spend.max(),
           color = 'red',
           linestyle = '--'
          )
ax1.set_xlabel('RD_Spend')
ax1.set_ylabel('Std_Residual')
ax2 = plt.subplot2grid(shape = (2,1) , loc = (1,0))
# ax1.scatter(none_outliers.Marketing_Spend, none_outliers.resid_stu )
ax2.scatter(none_outliers.Marketing_Spend, (model4.resid - model4.resid.mean())/model4.resid.std())
ax2.hlines(y = 0 ,
          xmin = none_outliers.Marketing_Spend.min(),
           xmax = none_outliers.Marketing_Spend.max(),
           color = 'magenta',
           linestyle = '--'
          )
ax2.set_xlabel('Marketing_Spend')
ax2.set_ylabel('Std_Residual')

# 調整2子圖之間距離
plt.subplots_adjust(hspace = 0.6,wspace = 0.3)
plt.show()

BP法（拉格朗日乘子檢驗）：statsmodels模塊het_breuschpagan函數 https://programtalk.com/python-examples/statsmodels.stats/

sm.stats.diagnostic.het_breuschpagan(model4.resid, exog_het = model4.model.exog)

結果獲得4個值

(1.4675103668308342, 0.48010272699006384, 0.7029751237162462, 0.5019659740962872)

第一個值爲LM統計量，第二個爲p值，第三個爲F統計量，第四個爲F統計量的p值，經過觀察p值均大於0.05，證實接受殘差方差齊性原假設，即殘差不受自變量的影響而變化。

若果殘差不知足方差齊性假設，觀察殘差和自變量的關係，經過模型變換或加權最小二乘法對模型加以處理。

4、迴歸模型預測

pred4 = model4.predict(exog = test.ix[:,['RD_Spend','Marketing_Spend']])
plt.scatter(x = test.Profit ,y = pred4)
plt.plot([ test.Profit.min(),test.Profit.max()],
         [test.Profit.min(),test.Profit.max()],
        color = 'red',
         linestyle = '--'
)
plt.xlabel('實際值')
plt.ylabel('預測值')
plt.show()

　真實值和預測值之間差別：

print(pd.DataFrame({"prediction":pred4,'real':test.Profit}))

圖形展現效果：

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。