(圖片來自百度)css
數據git
分析數據第一步仍是套路------畫圖github
數據看上去比較平整,可是因爲數據太對看不出具體狀況,因而將只取前300個數據再此畫圖app
這數據看上去很不錯,感受有隱藏週期的意思函數
代碼測試
#coding:utf-8 import csv import matplotlib.pyplot as plt def read_csv_data(aim_list_1, aim_list_2, file_name): i = 0 csv_file = csv.reader(open(file_name,'r')) for data in csv_file: if (i == 0): i += 1 continue aim_list_1.append(float(data[1])) aim_list_2.append(data[3]) return def plot_picture(x, y): plt.xlabel('x') plt.ylabel('y') plt.plot(x, y) plt.show() return if __name__ == '__main__': temp = [] tim = [] file_name = 'C:/Users/lichaoxing/Desktop/testdata.csv' read_csv_data(temp, tim, file_name) plot_picture(tim[:300], temp[:300])
使用ARIMA模型(ARMA)spa
第一步觀察數據是不是平穩序列,經過上圖能夠看出是平穩的3d
若是不平穩,則須要進行預處理,方法有 對數變換 差分code
對於平穩的時間序列能夠直接使用ARMA(p, q)模型進行擬合blog
ARMA (p, q) : AR(p) + MA(q)
此時參數p和q的肯定能夠經過觀察ACF和PACF圖來肯定
經過觀察PACF圖能夠看出,階數爲9也就是p=9,這裏ACF圖看出自相關呈現震盪降低收斂,可是怎麼決定出q,我沒太明白,這裏姑且拍腦殼才一個吧就q=3
可是這裏我遇到了一個問題,沒有搞懂,就是平穩的序列,若是我進行一階差分後應該仍然是平穩的序列,可是這個時候我又畫了一個ACF與PACF圖,居然是下圖這樣,lag的範圍是-0.04到0.04(不懂)
lag的範圍是-0.04到0.04的問題緣由(修改於再次使用此模型)
緣由:
當時,我使用的是一階差分,也就是讓數據的後一個值減去前一個值獲得新的值,這樣就會致使第一個值變爲缺失值(下面的數據是再此使用此模型時的數據,與原博客數據無關)
就是由於此處的值爲缺失值,致使繪製ACF與PACF時數據有問題而沒法成功顯示
解決辦法,在繪製上述圖形前,將第一個數據去除:
dta= dta.diff(1)
dta = dta.truncate(before= ym[1])#刪除第一個缺失值
其實還有就是使用ADF檢驗,獲得的結果如圖,這個p值很小===》平穩
畫圖代碼
def acf_pacf(temp, tim): x = tim y = temp dta = pd.Series(y, index = pd.to_datetime(x)) fig = plt.figure(figsize=(9,6)) ax1 = fig.add_subplot(211) fig = sm.graphics.tsa.plot_acf(dta,lags=50,ax=ax1) ax2 = fig.add_subplot(212) fig = sm.graphics.tsa.plot_pacf(dta,lags=50,ax=ax2) show()
ADF檢驗代碼
def test_stationarity(timeseries): dftest = adfuller(timeseries, autolag='AIC') return dftest[1]
這裏先使用ARMA(9,3)來實驗測試一下效果,取前300個數據中的前250個做爲train,後面的做爲test
效果
能夠說這個模型是真的強大,預測的仍是十分準確的
代碼
def test_300(temp, tim): x = tim[0:300] y = temp[0:300] dta = pd.Series(y[0:249], index = pd.to_datetime(x[0:249])) fig = plt.figure(figsize=(9,6)) ax1 = fig.add_subplot(211) fig = sm.graphics.tsa.plot_acf(dta,lags=30,ax=ax1) ax2 = fig.add_subplot(212) fig = sm.graphics.tsa.plot_pacf(dta,lags=30,ax=ax2) arma_mod = sm.tsa.ARMA(dta, (9, 3)).fit(disp = 0) predict_sunspots = arma_mod.predict(x[200], x[299], dynamic=True) fig, ax = plt.subplots(figsize=(9, 6)) ax = dta.ix[x[0]:].plot(ax=ax) predict_sunspots.plot(ax=ax) show()
其實,能夠經過代碼來自動的選擇p和q的值,依據BIC準則,目標就是bic越小越好
代碼
def proper_model(timeseries, maxLag): init_bic = 100000000 init_properModel = None for p in np.arange(maxLag): for q in np.arange(maxLag): model = ARMA(timeseries, order=(p, q)) try: results_ARMA = model.fit(disp = 0, method='css') except: continue bic = results_ARMA.bic if bic < init_bic: init_properModel = results_ARMA init_bic = bic return init_properModel
遇到的問題,預測時predict函數沒怎麼使用明白
當寫於某些預測區間的時候,會報 「start」或「end」的相關錯誤,還有一個函數forcast,這個函數使用就是forcast(N):預測後面N個值
返回的是預測值(array型)標準偏差(array型)置信區間(array型)
還有:
對於構造時間序列,時間能夠是時間格式:如 「2018-01-01」 或者就是個時間戳,在用時間戳的時候,其實在序列裏它會自動識別時間戳,並加上起始時間1970-01-01 00:00:01
形式
附錄(代碼)
預測一序列中某一點的值
#coding:utf-8 import csv import time import pandas as pd import numpy as np from statsmodels.tsa.arima_model import ARMA import argparse import warnings warnings.filterwarnings('ignore') def timestamp_datatime(value): value = time.localtime(value) dt = time.strftime('%Y-%m-%d %H:%M',value) return dt def time_timestamp(my_date): my_date_array = time.strptime(my_date,'%Y-%m-%d %H:%M') my_date_stamp = time.mktime(my_date_array) return my_date_stamp def read_csv_data(aim_list_1, aim_list_2, file_name): i = 0 csv_file = csv.reader(open(file_name,'r')) for data in csv_file: if (i == 0): i += 1 continue aim_list_1.append(float(data[1])) #1:溫度 2:溼度 dt = int(data[3]) aim_list_2.append(dt) return def proper_model(timeseries, maxLag): init_bic = 100000000 init_properModel = None for p in np.arange(maxLag): for q in np.arange(maxLag): model = ARMA(timeseries, order=(p, q)) #bug try: results_ARMA = model.fit(disp = 0, method='css') except: continue bic = results_ARMA.bic if bic < init_bic: init_properModel = results_ARMA init_bic = bic return init_properModel def test_300(temp, tim, time_in): x = [] y = [] end_index = len(tim) for i in range(0, len(tim)): if (time_in - (tim[i]) < 300): end_index = i break if (end_index < 100): x = tim[0: end_index] y = temp[0: end_index] else: x = tim[end_index - 100: end_index] y = temp[end_index - 100: end_index] tidx = pd.DatetimeIndex(x, freq='infer') dta = pd.Series(y, index = tidx) print(dta) arma_mod = proper_model(dta, 9) predict_sunspots = arma_mod.forecast(1) return predict_sunspots[0] def predict_temperature(file_name, time_in): temp = [] tim = [] read_csv_data(temp, tim, file_name) result_temp = test_300(temp, tim, time_in) return result_temp if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-f', action='store', dest='file_name') parser.add_argument('-t', action='store', type = int, dest='time_') args = parser.parse_args() file_name = args.file_name time_in = args.time_ result_temp = predict_temperature(file_name, time_in) print ('the temperature is %f ' % result_temp)
在上面的代碼中,預測某一點的值我採用序列中此點的前100個點做爲訓練集
若是給出待預測的多個點,因爲每次都要計算模型的p和q以及擬合模型,時間會很慢,因而考慮將給定的待預測時間點序列切割成小段,使每一段中最大與最小的時間間隔在某一範圍內
在使用forcast(n)函數一次預測多點,而後在預測值中找到與待預測的時間值相近的值,速度大大提高,思路如圖
代碼
#coding:utf-8 import csv #import time import pandas as pd import numpy as np from statsmodels.tsa.arima_model import ARMA import warnings warnings.filterwarnings('ignore') def proper_model(timeseries, maxLag): init_bic = 1000000000 init_p = 1 init_q = 1 for p in np.arange(maxLag): for q in np.arange(maxLag): model = ARMA(timeseries, order=(p, q)) try: results_ARMA = model.fit(disp = 0, method='css') except: continue bic = results_ARMA.bic if bic < init_bic: init_p = p init_q = q init_bic = bic return init_p, init_q def read_csv_data(file_name, clss = 1): i = 0 aim_list_1 = [] #temperature(1) or humidity(2) aim_list_2 = [] #time csv_file = csv.reader(open(file_name,'r')) for data in csv_file: if (i == 0): i += 1 continue aim_list_1.append(float(data[clss])) dt = int(data[3]) aim_list_2.append(dt) tidx = pd.DatetimeIndex(aim_list_2, freq = None) dta = pd.Series(aim_list_1, index = tidx) init_p, init_q = proper_model(dta[:aim_list_2[100]], 9) return init_p, init_q, aim_list_2, dta def for_kernel(p, q, tim, dta, tmp_time_list, result_dict): interval = 20 end_index = len(tim) - 1 for i in range(0, len(tim)): if (tmp_time_list[0]["time"] - tim[i] < tim[1] - tim[0]): end_index = i break if (end_index < 100): dta = dta.truncate(after = tim[end_index]) else: dta = dta.truncate(before= tim[end_index - 101], after = tim[end_index]) arma_mod = ARMA(dta, order=(p, q)).fit(disp = 0, method='css') #爲將來interval天進行預測, 返回預測結果, 標準偏差, 和置信區間 predict_sunspots = arma_mod.forecast(interval) #################################### for tim_i in tmp_time_list: for tim_ in tim: if tim_i["time"] - tim_ >= 0 and tim_i["time"] - tim_ < tim[1] - tim[0]: result_dict[tim_i["time"]] = predict_sunspots[0][tim.index(tim_) - end_index] return def kernel(p, q, tim, dta, time_in_list): interval = 20 time_first = time_in_list[0] det_time = tim[1] - tim[0] result_dict = {} tmp_time_list = [] for time_ in time_in_list: if time_first["time"] + det_time * interval > time_["time"]: tmp_time_list.append(time_) continue time_first = time_ for_kernel(p, q, tim, dta, tmp_time_list, result_dict) tmp_time_list = [] tmp_time_list.append(time_first) for_kernel(p, q, tim, dta, tmp_time_list, result_dict) return result_dict def predict_temperature(file_name, time_in_list, clss = 1): p, q, tim, dta = read_csv_data(file_name, clss) result_temp_dict = kernel(p, q, tim, dta, time_in_list) return result_temp_dict def predict_humidity(file_name, time_in_list, clss = 2): p, q, tim, dta = read_csv_data(file_name, clss) result_humi_dict = kernel(p, q, tim, dta, time_in_list) return result_humi_dict if __name__ == '__main__': file_name = "testdata.csv" time_in = [{"time":1530419271,"temp":"","humi":""},{"time":1530600187,"temp":"","humi":""},{"time":1530825809,"temp":"","humi":""}] #time_in = [{"time":1530600187,"temp":"","humi":""},] result_temp = predict_temperature(file_name, time_in) print(result_temp)
因爲後續又改動了需求,須要預測溫度以及溼度,完成了項目在github