ARIMA小記

時間 2019-12-01

標籤 arima 简体版

原文原文鏈接

partial correlation: 條件相關.

在某個已知的假設下面,咱們考慮其餘的一些變量值的狀況下,另外兩個變量的相關性.html

e.g.
因變量(response variable):$y$, 自變量(predictor variables): $x_{1}, x_{2}, x_{3}$
關於 $y 和 x_{3}$的partial correlation是考慮由$x_{1}和 x_{2}分別對y 和 x_{3}的影響的correlation $node

在迴歸當中,partial correlation是兩個不一樣的迴歸當中的殘差的相關關係:

(1) 經過$x_{1}和 x_{2}預測y$的迴歸過程;python

(2) 從$x_{1}和 x_{2}預測x_{3}的迴歸過程$
簡單來講,咱們計算沒有從$x_{1}和 x_{2}預測出來的y 和 x_{3}$的部分的相關關係
具體公式:
\[ \frac{Convariance(y,x_{3}|x_{1}, x_{2})}{\sqrt{Variance(y|x_{1},x_{2})Variance(x_{3}|x_{1},x_{2})}} \]
參考文獻1當中對迴歸模型有一個比較不一樣的解釋, 能夠查看git

時間序列. 關於 $x_{t}, x_{t-h}$的偏自相關(partial autocorrelation)定義爲基於一些列的觀測時間點$t和t-h之間的觀測值x_{t-h+1}, ...,x_{t-1}的條件下,x_{t}, x_{t-h}的相關性.$

$1^{st}階的偏自相關和1^{st}階的自相關一致$
$2^{st}$階(lag)偏自相關是:
\[ \frac{Convariance(x_t,x_{t-2}|x_{t-1})}{\sqrt{Variance(x_{t}|x_{t-1})Variance(x_{t-2}|x_{t-1})}} \]
在穩態序列當中,分母的兩個值會是同樣的. 這是兩個不一樣時間片斷基於中間的時間的觀測值已知的相關性.
$3^{st}$階(lag)偏自相關是:
\[ \frac{Convariance(x_t,x_{t-3}|x_{t-1}, x_{t-2})}{\sqrt{Variance(x_{t}|x_{t-1},x_{t-2})Variance(x_{t-3}|x_{t-1}, x_{t-2})}} \]github

偏自相關和AR模型當中的p相關.非零值的PACF表明能夠選擇的p
非零值的ACF表明能夠選擇的q,在MA過程中,這裏是擬合偏差.編程

PACF值只能描述觀測和它滯後的(lag, 也就是t-h時刻的數據)觀測之間的相關性.api

MA過程是自迴歸模型當中時間序列和前一個預測值的殘差, 也就是基於最近的預測的偏差上面來修正將來的預測值.dom

穩態性

單邊量的隨機過程的穩態性的測試:測試

def test_stationary(timeseries):
    plt.figure(figsize=(20,12))
    
    rolmean = pd.rolling_mean(timeseries, window=24)
    rolstd = pd.rolling_std(timeseries, window=24)
    
    orig = plt.plot(timeseries, color='blue', label='original')
    mean = plt.plot(rolmean, color='red', label='rolling mean')
    std = plt.plot(rolstd, color='black', label='rolling std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    print("Results of Dickey-Fuller Test:")
    dftest = sm.tsa.adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4],index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Values ({})'.format(key)] = value
    print(dfoutput)

這裏經過利用Dickey-Fuller Test來查看數據的穩態性ui

考慮隨機過程以下:
\[ y_{i} = \phi y_{i-1} + \varepsilon_{i} \]
$\mid \varphi \mid \le 1且 \varepsilon_{i}是白噪聲$. 若是$\mid \varphi \mid = 1$,這裏稱爲unit root.
特別的說,當$\varphi=1$時, 這個隨機過程是不穩定的,(raondom walk,可是沒有位移(no drift)).
事實上當$\mid \varphi \mid = 1,這個過程是不穩定的, 當\mid \varphi \mid \le 1, 這個過程是穩定的.咱們通常不考慮\mid \varphi \mid \ge 1的情形,由於在這種狀況下,這個過程是爆炸態的(explosive), 由於是持續增加的.$

上面的這個隨機過程是一階自迴歸的過扯恩個AR(1).

Dickey-Fuller test

是檢驗上面的過程是否有一個unit root.這個方法的具體過程以下:
首先咱們會計算一階差分:
\[ y_{i} - y_{i-1} = \phi y_{i-1} + \varepsilon_{i} - y_{i-1}\]
\[ y_{i} - y_{i-1} = (\phi-1)y_{i-1} + \varepsilon_{i} \]
定義 $\Delta y_{i} = y_{i} - y_{i-1}$,定義$\beta=\varphi-1$, 上面的方程變爲:
\[\Delta y_{i} =\beta y_{i-1} + \varepsilon_{i}\]
$\beta \le 0$因此對$\varphi$的驗證編程了對斜率參數$\beta = 0$的測試.所以,咱們有一個單邊測試(one-tailed test,由於$\beta$不多是正值):
\[ H_{0}: \beta = 0( 等同與 \varphi = 1) \]
\[ H_{1}: \beta < 0(等同於 \varphi < 1) \]
在替換假設下,若是$b是正常的最小平方(ordinary least square, OLS)關於 \beta的估計,那麼 \varphi-bar = 1+b是 \varphi的 OLS的估計, 對於足夠大的n有$
\[ \sqrt{n}(\phi - \tilde{\phi}) \tilde N(0, s.e.) , s.e.=\sqrt{1-\phi^2}\]

這裏能夠利用普通的線性迴歸的方法, 可是在這裏的零假設的係數t不滿組正態分佈,因此咱們不可以採用學生分佈(t test), 這裏的係數服從tau 分佈,
因此測試當中決定了這個tau的統計量$\tau$是否小於$\tau_{crit}$, 這個值能夠從Dickey-Fuller Table當中找到.

若是說計算的tau的值小於critical value, 那麼咱們會有一個顯著的結果,不然咱們接受零假設, 也就是這個時間序列不是穩態的
There are the following three versions of the Dickey-Fuller test:

Type 0	No constant, no trend	Δy_i = β₁y_i-1 + ε_i
Type 1	Constant, no trend	Δy_i = β₀ + β₁y_i-1 + ε_i
Type 2	Constant and trend	Δy_i = β₀ + β₁y_i-1 + β₂i+ ε_i

當時序數據不是穩定的時候, 須要對時序數據進行分解
通常分解有乘法分解,一個有加性分解.

sm.tsa.seasonal_decompose(time_series, model='multiplicative', freq=freq)

上面的拆分是很是簡單的將模型拆分了趨勢性信息, 季節性信息以及殘差信息.
當模型是加性模型的時候: 線性的條件是, 殘差信號隨着時間是一個一樣大小的值, 線性的週期性信息是一樣頻率和幅度的, 線性的趨勢是一條直線.
乘法模型就是非線性的,相似與quadratic或者是指數模型.隨着時間的增長或者下降.一個非線性的趨勢是一條曲線.一個非線性的週期新信息是頻率或者是幅度隨着時間的變化增長或者下降.

週期性ARIMA信號

這裏好像計算的都是加性模型的,可能由於arima自己的特性就是由於須要的是平穩信號吧,有待補充

prophet: facebook的時間預測的開源包

這裏補充一部分的數據問題:

In statistics, engineering, economics, and medical research, censoring is a condition in which the value of a measurement or observation is only partially known. 

MCAR: mssing completely at random
MAR: missing at random
NMAR: not missing at random

參考文獻:
https://newonlinecourses.science.psu.edu/stat510/node/62/
https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/
http://www.real-statistics.com/time-series-analysis/stochastic-processes/dickey-fuller-test/
http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/
https://facebook.github.io/prophet/docs/quick_start.html#python-api
https://otexts.org/fpp2/seasonal-arima.html
https://github.com/seanabu/seanabu.github.io/blob/master/Seasonal_ARIMA_model_Portland_transit.ipynb
https://zhuanlan.zhihu.com/p/50741970