Energy Consumption (能源消耗）數據集預測（2）: 傳統季節性信號分解PK prophet

時間 2020-05-31

標籤 energy consumption 能源消耗數據預測傳統季節性信號分解 prophet 简体版

原文原文鏈接

前言

上一篇咱們講解了如何在2小時快速完成時間序列的預測。採用的算法是facebook的prophet。文末，咱們留了一個問題：如何知道咱們快速建立的prophet的模型相對是好仍是壞？html

我一直信奉的原則是：機器學習，尤爲是優化問題，必定要先找到一個baseline。以後咱們就能夠找到咱們改善的力度有多大？或者是我咱們的差距有多大。這樣你就能說服客戶爲你的解決方案付款。redis

因爲本文中涉及一些信號處理相關的知識，因此會混用信號和時間序列兩個詞。算法

prophet預測新數據集

上文中咱們採用的數據集是AEP_hourly.csv。很遺憾，我在有限的時間內沒有找到網上有人作過該數據集預測。不少人都預測的是PJME_hourly.csv。bash

爲了對比標杆，咱們也用這個數據集快速作預測。代碼仍是和上一篇代碼一致，只是更改數據集的名稱以及columns的名稱，屬於很基本的操做。這裏就只貼出代碼和結果，不作詳細的解釋。機器學習

from statsmodels.tsa.seasonal import seasonal_decompose
import numpy as np
from scipy.fft import fft
import pandas as pd
import matplotlib.pyplot as plt
from fbprophet import Prophet
from sklearn.metrics import mean_absolute_error


# we only analyze the dataset pjm_hourly_est.csv
file = 'data/PJME_hourly.csv'
raw_df = pd.read_csv(file)
ax = raw_df.plot(kind='line')
plt.show()


# predict new dataset with prophet
dt_format = '%Y-%m-%d %H:%M:%S'
raw_df['Datetime']= pd.to_datetime(raw_df['Datetime'],format=dt_format)
raw_df.rename({'PJME_MW':'y','Datetime':'ds'},axis=1,inplace=True)
split_dt = pd.Timestamp('2015-01-01 00:00:00')
train_df = raw_df[raw_df['ds']< split_dt]
test_df = raw_df[raw_df['ds']>= split_dt]
model = Prophet()
model= model.fit(train_df)
y_test  = model.predict(test_df)
fig  = model.plot(y_test)
print(mean_absolute_error(test_df['y'].values,y_test['yhat']))
model.plot_components(y_test)
複製代碼

MAE 偏差爲5183。看到kaggle上置頂的prophet預測，ta的結果是5181.782050398612。ide

也就是說，咱們兩個小時速成的預測模型，和「大神」預測的差很少。或許prophet的模型基本成績就是這個水平。那麼這個水平在模型界屬於什麼地位呢？函數

Convergence detected: relative gradient magnitude is below tolerance
5183.653998193845
複製代碼

傳統的季節性信號分解

咱們能夠試試傳統的季節性信號分解的模型預測該時間集。prophet的核心思想也是信號分解，將一個時間序列中抽取不一樣週期的季節性信號。工具

時間序列的聖經中提到，時間序列通常會被分解爲trend + seasonal + remiander （趨勢+季節性+殘餘信號）。post

statsmodels模塊中的seasonal_decompose 已經有成熟的分解方法：移動平均值法。學習

下面咱們對該函數進行說明：

statsmodels.tsa.seasonal.seasonal_decompose(x, model='additive', filt=None, period=None, two_sided=True, extrapolate_trend=0)[source]¶
Seasonal decomposition using moving averages.

複製代碼

主要參數：

X：就是要分析的時間序列
model: 模型有兩種，加法（additive）或（multiplicative）。所謂的加法或乘法，都是指的季節性信號的幅值變化，若是是幅值變化不大，那就是加法。若是幅值是變化的，那就是乘法。
period：信號要分解的週期

輸出的結果爲：

observe（原始信號）
seasonal（季節信號）
trend（趨勢信號）
resid（殘餘信號）

基本瞭解了函數的用法，咱們就能夠進行實戰了。

傳統信號分解建模

首先咱們準備好數據，該數據和prophet採用的數據集一致。

咱們先分解出半天的信號。由於白天和晚上的用電量很明顯不一樣，這是常識。因此有了這個先驗知識，咱們嘗試去分解，period 設爲12。

half_day_result = seasonal_decompose(raw_df['PJME_MW'], model='additive', period=12)
half_day_result.plot()
複製代碼

從分解結果中能夠看到，seasonal 信號是幅值恆定的一個週期信號。咱們能夠plot更少的點，讓其顯示更清楚。由於咱們假定的週期是12小時，因此咱們plot 24小時的信號。能夠看到這是幾乎很完美的正（餘）弦信號（由於相位不清楚）。

half_day_result.seasonal.iloc[0:24].plot()
複製代碼

當咱們對比trend信號和原始信號時，發現trend相比來講，趨勢仍不明顯，可是相對原始信號來講有點進步。

回到咱們信號分解的基本思路：若是咱們能預測分解後的每一個信號，那麼合併起來的信號，咱們就能預測。 seasonal信號，是正（餘）弦信號，只須要知道幅值，頻率，相位和偏移量便可。

傅里葉變換求解

關於傅里葉變換這個深奧的信號分析工具，網上有太多的講解。這裏咱們只須要知道的是：傅里葉變換能夠將時域信號變成頻域信號。白話文就是：傅里葉變換能夠幫咱們快速求解幅值，頻率。我寫了一個函數，思路是：

首先對信號進行fft變換
以後找到幅值最大的點對應的頻率和幅值
順便返回時間信號的均值，做爲偏移量

def find_period(x, plot=True):
    yf = fft(x)
    xf = np.linspace(0, len(x) // 2, len(x) // 2)
    yf = 2.0 / len(x) * np.abs(yf[0:len(x) // 2])
    all_df = pd.DataFrame([])
    all_df['time'] = xf
    all_df['freq'] = yf
    period = len(x) / all_df.loc[all_df['freq'].idxmax(), 'time']
    all_df.sort_values(by='freq',inplace=True,ascending=False)
    all_df.reset_index(inplace=True)
    all_df['period']= len(x)/all_df['time']
    all_df.replace([np.inf, -np.inf], 0,inplace=True)
    period = len(x) / all_df.loc[all_df['freq'].idxmax(), 'time']
    if plot:
        plt.plot(all_df['period'], yf)
    return period,all_df['freq'].max(),np.mean(x)
複製代碼

咱們來調用函數對seaonal信號進行變換。

half_day_season_period = find_period(half_day_result_seasonal)
print('*' * 30)
print(f'half_day_season_period: {half_day_season_period}')
複製代碼

週期爲11.99 小時,也就是12小時，幅值爲554，偏移量爲0。咱們能夠用數學表達式表示這個信號了，可是還缺了相位信息。

******************************
half_day_season_period: (11.999669803533104, 554.1702081757178, 5.666447331079827e-05)
複製代碼

seaonal信號咱們(幾乎）能夠表示了，其餘的信號怎麼辦？看不懂？？繼續分解！！直到看懂（信號簡單明瞭，或者單調，或者週期）。

繼續分解trend 以及resid信號。

後面咱們會繼續分解天天的信號，每週的信號，每個月，每半年，每一年，而後對seasonal 信號進行fft變換，求出幅值和頻率。代碼幾乎雷同，因此不一一講解。

# split day compose
day_result = seasonal_decompose(half_day_result.trend.dropna(), model='additive', period=24)
day_result.plot()

# find period of day cyclic
day_result_seasonal = day_result.seasonal.values
day_season_period = find_period(day_result_seasonal)
print('*' * 30)
print(f'day_season_period: {day_season_period}')
# find the period of redis
day_redis_period = find_period(day_result.resid.dropna().values)
print(f'day_redis_period: {day_redis_period}')

# find period of week cyclic
week_result = seasonal_decompose(
    day_result.trend.dropna(),
    model='additive',
    period=7 * 24)
week_result.plot()
week_season_period = find_period(week_result.seasonal.values)
print('*' * 30)
print('week split result')
print(f'week_season_period: {week_season_period}')
week_redis_period = find_period(week_result.resid.dropna().values)
print(f'week_redis_period: {week_redis_period}')


month_result = seasonal_decompose(
    day_result.trend.dropna(),
    model='additive',
    period=30 * 24)
month_result.plot()
month_season_period = find_period(month_result.seasonal.values)
print('*' * 30)
print('month split result')
print(f'month_season_period: {month_season_period}')

month_redis_period = find_period(month_result.resid.dropna().values)
print(f'month_redis_period: {month_redis_period}')



half_year_result = seasonal_decompose(
    month_result.trend.dropna(),
    model='additive',
    period=365 * 12)
half_year_result.plot()
half_year_season_period = find_period(half_year_result.seasonal.values)
print('*' * 30)
print('half year split result')
print(f'half_year_season_period: {half_year_season_period}')

half_year_redis_period = find_period(half_year_result.resid.dropna().values)
print(f'half_year_redis_period: {half_year_redis_period}')

year_result = seasonal_decompose(
    half_year_result.trend.dropna(),
    model='additive',
    period=365 * 24)
year_result.plot()
year_season_period = find_period(year_result.seasonal.values)
print('*' * 30)
print('yearly split result')
print(f'year_season_period: {year_season_period}')
複製代碼

咱們能夠看一下信號分解的結果。

每日份量分解後的信號圖：

每週份量分解的信號圖：

每個月份量分解的信號圖：

半年份量分解的信號圖：

能夠看到，分解到最後，trend幾乎是單調遞減的信號。

fft結果以下：

******************************
day_season_period: (24.00132100396301, 655.682736730418, 0.022879188988473007)
day_redis_period: (23.993396070662044, 2129.021843848859, 0.006176697324649406)
******************************
week split result
week_season_period: (168.0092485549133, 169.52905012773715, 0.010345771248929707)
week_redis_period: (169.1841491841492, 1035.9007684637975, 0.34519449979470296)
******************************
month split result
month_season_period: (719.4455445544555, 148.40567700918697, -0.09128365805888033)
month_redis_period: (169.13216374269007, 1144.5380532729368, -1.1388580520215974)
******************************
half year split result
half_year_season_period: (4382.060606060606, 4156.997946762282, 0.6291791975756035)
half_year_redis_period: (2921.416666666667, 848.789097533505, 22.611620632847078)
******************************
yearly split result
year_season_period: (8764.25, 837.7800370550049, 0.3809718680342369)
複製代碼

構建數學模型

提到數學模型，我會頭大。不過這裏的數學模型很簡單：原始時間信號=咱們分解出來的全部seasonal信號+ trend份量。 trend份量咱們就用簡單的線性擬合，seasonal信號咱們假定都是正弦信號，可是相位信息也很重要，這裏咱們採用優化求解的算法，獲得每一個正弦信號的相位。具體見代碼

from sklearn.linear_model import LinearRegression
print(np.linspace(0,year_result.trend.shape[0], year_result.trend.shape[0]).shape)
y = year_result.trend.dropna()
lnr = LinearRegression().fit(np.linspace(0,y.shape[0], y.shape[0]).reshape(-1,1),y)
print(f'{lnr.coef_},{lnr.intercept_}')


from scipy.optimize import basinhopping
from sklearn.metrics import mean_squared_error


def energy_consumption_math_model(coefs, X, y):  # kwargs without ** to make it as positional para
    y_hat = -0.01029388 * X + 32815.50 \
            + 2*1280.4696176751042 * np.sin(X / 24 * np.pi * 2 + coefs[0]) \
            + 2 * 2129.021843848859 * np.sin(X / 24 * np.pi * 2 + coefs[1]) \
            + 2 * 1035.9007684637975 * np.sin(X / 24 / 7 * np.pi * 2 + coefs[2]) \
            + 2 * 1144.5380532729368 * np.sin(X / 24 / 7 * np.pi * 2 + coefs[3]) \
            + 2*4156.997946762282 * np.sin(X / 24 / 183 * np.pi * 2 + coefs[4]) + 22\
            +2*848.789097533505 * np.sin(X / 2921 * np.pi * 2 + coefs[5]) + \
            +2*837.7800370550049 * np.sin(X / 24 / 365 * np.pi * 2 + coefs[6])     + coefs[7]

    error = mean_squared_error(y_hat,y)
    return  error
coef = np.zeros(8)
minimizer_kwargs = {"method": "BFGS"}
minimizer_kwargs['args'] = (raw_df.index.to_numpy(), raw_df['PJME_MW'])
opt_result = basinhopping(energy_consumption_math_model, coef, minimizer_kwargs=minimizer_kwargs)
print(opt_result)
複製代碼

優化的結果以下：

coefs = np.array([  0.92350449,  -2.21809036,   4.51689507,   7.65849104,
         2.54063322,  -0.67270678,  -0.92850058, -48.15463512])
複製代碼

咱們將求解出來的coef 從新帶入y_hat 公式：

y_hat = -0.01029388 * X + 32815.50 \
        + 2 * 1280.4696176751042 * np.sin(X / 24 * np.pi * 2 + coefs[0]) \
        + 2 * 2129.021843848859 * np.sin(X / 24 * np.pi * 2 + coefs[1]) \
        + 2 * 1035.9007684637975 * np.sin(X / 24 / 7 * np.pi * 2 + coefs[2]) \
        + 2 * 1144.5380532729368 * np.sin(X / 24 / 7 * np.pi * 2 + coefs[3]) \
        + 2 * 4156.997946762282 * np.sin(X / 24 / 183 * np.pi * 2 + coefs[4]) + 22 \
        + 2 * 848.789097533505 * np.sin(X / 2921 * np.pi * 2 + coefs[5]) + \
        +2 * 837.7800370550049 * np.sin(X / 24 / 365 * np.pi * 2 + coefs[6]) + coefs[7]
複製代碼

以上就是咱們創建的數學模型（包含1個線性趨勢+7個正弦信號+常量），只要輸入x的值，就會計算（預測）出對應的y值（用電量）。