天朗氣清,惠風和暢。賦閒在家,正宜讀書。前人文章,不得其解。代碼開源,無人註釋。大家不來,我行我上。廢話少說,直入主題。o( ̄︶ ̄)ohtml
咱們要探測自變量 與因變量 的關係,最簡單的方法是線性迴歸,即假設:python
咱們經過最小二乘方法 (OLS: ordinary least squares), 的可靠性問題,咱們同時對殘差 作了假設,即:爲均值爲0,方差恆定的獨立隨機變量。 即爲給定自變量 下,因變量 的條件均值。api
假如殘差 不知足咱們的假設,或者更重要地,咱們不單單想要知道 的在給定下的條件均值,並且想知道是條件中位數(更通常地,條件分位數),那麼OLS下的線性迴歸就不能知足咱們的需求。分位數迴歸(Quantile Regression)[2]解決了這些問題,下面我先給出一個分位數迴歸的實際應用例子,再簡述其原理,最後再分析其在Python實現的源代碼。app
這個例子出自statasmodels:Quantile Regression.[3] 咱們想探索家庭收入與食品消費的關係,數據出自working class Belgian households in 1857 (the Engel data).咱們用Python包statsmodels實現分位數迴歸。函數
1.1 預處理oop
%matplotlib inline import numpy as np import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf import matplotlib.pyplot as plt data = sm.datasets.engel.load_pandas().data data.head()
income foodexp 0 420.157651 255.839425 1 541.411707 310.958667 2 901.157457 485.680014 3 639.080229 402.997356 4 750.875606 495.560775
1.2 中位數迴歸 (分位數迴歸的特例,q=0.5)源碼分析
mod = smf.quantreg('foodexp ~ income', data) res = mod.fit(q=.5) print(res.summary())
QuantReg Regression Results ============================================================================== Dep. Variable: foodexp Pseudo R-squared: 0.6206 Model: QuantReg Bandwidth: 64.51 Method: Least Squares Sparsity: 209.3 Date: Mon, 21 Oct 2019 No. Observations: 235 Time: 17:46:59 Df Residuals: 233 Df Model: 1 ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 81.4823 14.634 5.568 0.000 52.649 110.315 income 0.5602 0.013 42.516 0.000 0.534 0.586 ============================================================================== The condition number is large, 2.38e+03. This might indicate that there are strong multicollinearity or other numerical problems.
由結果能夠知道 ,如何獲得迴歸係數的估計?結果中的std err, t, Pseudo R-squared等是什麼?我會在稍後解釋。ui
1.3 數據可視化this
咱們先擬合10個分位數迴歸,分位數q分別在0.05到0.95之間。spa
quantiles = np.arange(.05, .96, .1) def fit_model(q): res = mod.fit(q=q) return [q, res.params['Intercept'], res.params['income']] + \ res.conf_int().loc['income'].tolist() models = [fit_model(x) for x in quantiles] models = pd.DataFrame(models, columns=['q', 'a', 'b', 'lb', 'ub']) ols = smf.ols('foodexp ~ income', data).fit() ols_ci = ols.conf_int().loc['income'].tolist() ols = dict(a = ols.params['Intercept'], b = ols.params['income'], lb = ols_ci[0], ub = ols_ci[1]) print(models) print(ols)
q a b lb ub 0 0.05 124.880096 0.343361 0.268632 0.418090 1 0.15 111.693660 0.423708 0.382780 0.464636 2 0.25 95.483539 0.474103 0.439900 0.508306 3 0.35 105.841294 0.488901 0.457759 0.520043 4 0.45 81.083647 0.552428 0.525021 0.579835 5 0.55 89.661370 0.565601 0.540955 0.590247 6 0.65 74.033435 0.604576 0.582169 0.626982 7 0.75 62.396584 0.644014 0.622411 0.665617 8 0.85 52.272216 0.677603 0.657383 0.697823 9 0.95 64.103964 0.709069 0.687831 0.730306 {'a': 147.47538852370562, 'b': 0.48517842367692354, 'lb': 0.4568738130184233,
這裏擬合了10個迴歸,其中q是對應的分位數,a是斜率,b是迴歸係數。lb和ub分別是b的95%置信區間的下界與上界。
如今來畫出這10條迴歸線:
x = np.arange(data.income.min(), data.income.max(), 50) get_y = lambda a, b: a + b * x fig, ax = plt.subplots(figsize=(8, 6)) for i in range(models.shape[0]): y = get_y(models.a[i], models.b[i]) ax.plot(x, y, linestyle='dotted', color='grey') y = get_y(ols['a'], ols['b']) ax.plot(x, y, color='red', label='OLS') ax.scatter(data.income, data.foodexp, alpha=.2) ax.set_xlim((240, 3000)) ax.set_ylim((240, 2000)) legend = ax.legend() ax.set_xlabel('Income', fontsize=16) ax.set_ylabel('Food expenditure', fontsize=16);
上圖中虛線是分位數迴歸線,紅線是線性最小二乘(OLS)的迴歸線。經過觀察,咱們能夠發現3個現象:
這部分是數理統計的內容,只關心如何實現的朋友能夠略過。咱們要解決如下這幾個問題:
2.1 分位數的定義]
是隨機變量, 的累積密度函數是 . 的 分位數爲:
,
假設有100我的,95%的人身高少於1.9m, 1.9m就是身高的95%分位數。
2.2 分位數的求法
經過選擇不一樣的 值,使 最小,對應的 值即爲 的 分位數的估計 .
2.3 分位數迴歸
對於OLS, 咱們有:
爲 最小化所對應的 ,類比地,對於 分位數迴歸,咱們有:
爲最小化:
即最小化
所對應的
2.4 係數估計
因爲 不能直接對 求導,咱們只能用迭代的方法來逼近 最小時對應的 值。statsmodels採用了Iteratively reweighted least squares (IRLS)的方法。
假設咱們要求 最小化形以下的 範數:
則第t+1步迭代的 值爲:
是對角矩陣且初始值爲
第t次迭代
以中位數迴歸爲例子(q=0.5),咱們求:
即
即最小化形如上的 範數,
爲避免分母爲0,咱們取 , 是一個很小的數,例如0.0001.
2.5 迴歸係數的檢驗
咱們經過2.4,屢次迭代得出 的估計值,爲了獲得假設檢驗的t統計量,咱們只需獲得 的方差的估計。
分位數迴歸 的協方差矩陣的漸近估計爲:
其中 是對角矩陣,
當 , , 當 ,
的估計爲
其中
爲核函數(Kernel),可取Epa,Gaussian等. 爲根據Stata 12所選的窗寬(bandwidth)[5]
迴歸結果中的std err即由 得到,t統計量等於 。
2.6 擬合優度
對於OLS,咱們用 來衡量擬合優度。對於 分位數迴歸,咱們類比獲得:
,其中 爲全部 觀察值的 分位數。 即爲迴歸結果中的Pseudo R-squared。
實現分位數迴歸的完整源碼在 ,裏面主要含有兩個類QuantReg 和 QuantRegResults. 其中QuantReg是核心,包含了迴歸係數估計,協方差計算等過程。QuantRegResults計算擬合優度並組織迴歸結果。
3.1 QuantReg類
#QuantReg是包中RegressionModel的一個子類 class QuantReg(RegressionModel): #計算迴歸係數及其協方差矩陣。q是分位數,vcov是協方差矩陣,默認robust即2.5的方法。核函數kernel默認 #epa,窗寬bandwidth默認hsheather.IRLS最大迭代次數默認1000,差值默認小於1e-6時中止迭代 def fit(self, q=.5, vcov='robust', kernel='epa', bandwidth='hsheather', max_iter=1000, p_tol=1e-6, **kwargs): """ Solve by Iterative Weighted Least Squares Parameters ---------- q : float Quantile must be between 0 and 1 vcov : str, method used to calculate the variance-covariance matrix of the parameters. Default is ``robust``: - robust : heteroskedasticity robust standard errors (as suggested in Greene 6th edition) - iid : iid errors (as in Stata 12) kernel : str, kernel to use in the kernel density estimation for the asymptotic covariance matrix: - epa: Epanechnikov - cos: Cosine - gau: Gaussian - par: Parzene bandwidth : str, Bandwidth selection method in kernel density estimation for asymptotic covariance estimate (full references in QuantReg docstring): - hsheather: Hall-Sheather (1988) - bofinger: Bofinger (1975) - chamberlain: Chamberlain (1994) """ if q < 0 or q > 1: raise Exception('p must be between 0 and 1') kern_names = ['biw', 'cos', 'epa', 'gau', 'par'] if kernel not in kern_names: raise Exception("kernel must be one of " + ', '.join(kern_names)) else: kernel = kernels[kernel] if bandwidth == 'hsheather': bandwidth = hall_sheather elif bandwidth == 'bofinger': bandwidth = bofinger elif bandwidth == 'chamberlain': bandwidth = chamberlain else: raise Exception("bandwidth must be in 'hsheather', 'bofinger', 'chamberlain'") #endog樣本因變量,exog樣本自變量 endog = self.endog exog = self.exog nobs = self.nobs exog_rank = np.linalg.matrix_rank(self.exog) self.rank = exog_rank self.df_model = float(self.rank - self.k_constant) self.df_resid = self.nobs - self.rank #IRLS初始化 n_iter = 0 xstar = exog beta = np.ones(exog_rank) # TODO: better start, initial beta is used only for convergence check # Note the following does not work yet, # the iteration loop always starts with OLS as initial beta # if start_params is not None: # if len(start_params) != rank: # raise ValueError('start_params has wrong length') # beta = start_params # else: # # start with OLS # beta = np.dot(np.linalg.pinv(exog), endog) diff = 10 cycle = False history = dict(params = [], mse=[]) #IRLS迭代 while n_iter < max_iter and diff > p_tol and not cycle: n_iter += 1 beta0 = beta xtx = np.dot(xstar.T, exog) xty = np.dot(xstar.T, endog) beta = np.dot(pinv(xtx), xty) resid = endog - np.dot(exog, beta) mask = np.abs(resid) < .000001 resid[mask] = ((resid[mask] >= 0) * 2 - 1) * .000001 resid = np.where(resid < 0, q * resid, (1-q) * resid) resid = np.abs(resid) #1/resid[:, np.newaxis]爲更新權重W xstar = exog / resid[:, np.newaxis] diff = np.max(np.abs(beta - beta0)) history['params'].append(beta) history['mse'].append(np.mean(resid*resid)) #檢查是否收斂,若收斂則提早中止迭代 if (n_iter >= 300) and (n_iter % 100 == 0): # check for convergence circle, should not happen for ii in range(2, 10): if np.all(beta == history['params'][-ii]): cycle = True warnings.warn("Convergence cycle detected", ConvergenceWarning) break if n_iter == max_iter: warnings.warn("Maximum number of iterations (" + str(max_iter) + ") reached.", IterationLimitWarning) #計算協方差矩陣 e = endog - np.dot(exog, beta) # Greene (2008, p.407) writes that Stata 6 uses this bandwidth: # h = 0.9 * np.std(e) / (nobs**0.2) # Instead, we calculate bandwidth as in Stata 12 iqre = stats.scoreatpercentile(e, 75) - stats.scoreatpercentile(e, 25) h = bandwidth(nobs, q) h = min(np.std(endog), iqre / 1.34) * (norm.ppf(q + h) - norm.ppf(q - h)) fhat0 = 1. / (nobs * h) * np.sum(kernel(e / h)) if vcov == 'robust': d = np.where(e > 0, (q/fhat0)**2, ((1-q)/fhat0)**2) xtxi = pinv(np.dot(exog.T, exog)) xtdx = np.dot(exog.T * d[np.newaxis, :], exog) vcov = chain_dot(xtxi, xtdx, xtxi) elif vcov == 'iid': vcov = (1. / fhat0)**2 * q * (1 - q) * pinv(np.dot(exog.T, exog)) else: raise Exception("vcov must be 'robust' or 'iid'") #用係數估計值和協方差矩陣建立一個QuantResults對象,並輸出結果 lfit = QuantRegResults(self, beta, normalized_cov_params=vcov) lfit.q = q lfit.iterations = n_iter lfit.sparsity = 1. / fhat0 lfit.bandwidth = h lfit.history = history return RegressionResultsWrapper(lfit) #核函數表達式 def _parzen(u): z = np.where(np.abs(u) <= .5, 4./3 - 8. * u**2 + 8. * np.abs(u)**3, 8. * (1 - np.abs(u))**3 / 3.) z[np.abs(u) > 1] = 0 return z kernels = {} kernels['biw'] = lambda u: 15. / 16 * (1 - u**2)**2 * np.where(np.abs(u) <= 1, 1, 0) kernels['cos'] = lambda u: np.where(np.abs(u) <= .5, 1 + np.cos(2 * np.pi * u), 0) kernels['epa'] = lambda u: 3. / 4 * (1-u**2) * np.where(np.abs(u) <= 1, 1, 0) kernels['gau'] = lambda u: norm.pdf(u) kernels['par'] = _parzen #kernels['bet'] = lambda u: np.where(np.abs(u) <= 1, .75 * (1 - u) * (1 + u), 0) #kernels['log'] = lambda u: logistic.pdf(u) * (1 - logistic.pdf(u)) #kernels['tri'] = lambda u: np.where(np.abs(u) <= 1, 1 - np.abs(u), 0) #kernels['trw'] = lambda u: 35. / 32 * (1 - u**2)**3 * np.where(np.abs(u) <= 1, 1, 0) #kernels['uni'] = lambda u: 1. / 2 * np.where(np.abs(u) <= 1, 1, 0) #窗寬計算 def hall_sheather(n, q, alpha=.05): z = norm.ppf(q) num = 1.5 * norm.pdf(z)**2. den = 2. * z**2. + 1. h = n**(-1. / 3) * norm.ppf(1. - alpha / 2.)**(2./3) * (num / den)**(1./3) return h def bofinger(n, q): num = 9. / 2 * norm.pdf(2 * norm.ppf(q))**4 den = (2 * norm.ppf(q)**2 + 1)**2 h = n**(-1. / 5) * (num / den)**(1. / 5) return h def chamberlain(n, q, alpha=.05): return norm.ppf(1 - alpha / 2) * np.sqrt(q*(1 - q) / n)
3.2 QuantRegResults類
這裏我只給出計算擬合優度的代碼。
class QuantRegResults(RegressionResults): '''Results instance for the QuantReg model''' @cache_readonly def prsquared(self): q = self.q endog = self.model.endog #e爲殘差 e = self.resid e = np.where(e < 0, (1 - q) * e, q * e) e = np.abs(e) ered = endog - stats.scoreatpercentile(endog, q * 100) ered = np.where(ered < 0, (1 - q) * ered, q * ered) ered = np.abs(ered) return 1 - np.sum(e) / np.sum(ered)
上文我先給出了一個分位數迴歸的應用例子,進而敘述了分位數迴歸的原理,最後再分析了Python實現的源碼。
分位數迴歸對比起OLS迴歸,雖然較爲複雜,但它有三個主要優點: