機器學習入門-隨機森林溫度預測-增長樣本數據 1.sns.pairplot(畫出兩個關係的散點圖) 2.MAE(平均絕對偏差) 3.MAPE(準確率指標)

時間 2019-12-20

標籤機器學習入門隨機森林溫度預測增長樣本數據 1.sns.pairplot sns pairplot 畫出兩個關係散點圖 2.mae mae 平均絕對偏差 3.mape mape 準確率指標简体版

原文原文鏈接

在上一個博客中，咱們構建了隨機森林溫度預測的基礎模型，而且研究了特徵重要性。spring

在這個博客中，咱們將從兩方面來研究數據對預測結果的影響app

第一方面：特徵不變，只增長樣本的數據dom

第二方面：增長特徵數，增長樣本的數據測試

1.sns.pairplot 畫出兩個變量的關係圖，用於研究變量之間的線性相關性，sns.pattle([color]) 用於設置調色板，有點像scatter_matrix編碼

2.MSE round(abs(pred - test_y).mean(), 2) 研究預測值與真實值之差的平均值spa

3.MAPE round(100 -abs(pred-test_y)/test_y*100, 2) （1 - 偏差與真實值的比值)的平均值3d

代碼：rest

第一步：載入數據code

第二步：使用datetime.datetime.strptime() 將年月日進行組合，構造出日期的標籤 blog

第三步：對數據中的溫度特徵進行畫圖

第四步：對新增的特徵進行畫圖

第五步：sns.pairplot進行兩兩變量的關係畫圖，使用sns.pattle()生成顏色的調色板

第六步：創建隨機森林模型，研究新增長的數據對預測精度的影響，不加入新增的特徵

第七步：創建隨機森林模型，研究新增長的數據對預測精度的影響，加入新增的特徵

import datetime
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


# 第一步：導入數據
features = pd.read_csv('data/temps_extended.csv')
print(features.describe())
print(features.columns)

# 第二步：使用datetime.datetime.strptime將字符串轉換爲日期類型
years = features['year']
months = features['month']
days = features['day']
# 先轉換爲字符串類型
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
# 字符串類型轉換爲日期類型
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]

# 第三步對溫度特徵進行畫圖操做

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(ncols=2, nrows=2, figsize=(12, 12))
fig.autofmt_xdate(rotation=60)

ax1.plot(dates, features['temp_2'], linewidth=4)
ax1.set_xlabel(''); ax1.set_ylabel('temperature'); ax1.set_title('pre two max')

ax2.plot(dates, features['temp_1'], linewidth=4)
ax2.set_xlabel(''); ax2.set_ylabel('temperature'); ax2.set_title('pre max')

ax3.plot(dates, features['actual'], linewidth=4)
ax3.set_xlabel(''); ax3.set_ylabel('temperature'); ax3.set_title('today max')

ax4.plot(dates, features['friend'], linewidth=4)
ax4.set_xlabel(''); ax4.set_ylabel('temperature'); ax4.set_title('friend max')

plt.show()

# 第四步：對新增的特徵和平均溫度進行做圖

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(ncols=2, nrows=2, figsize=(12, 12))
fig.autofmt_xdate(rotation=60)

ax1.plot(dates, features['average'])
ax1.set_xlabel(''); ax1.set_ylabel('temperature'); ax1.set_title('average')

ax2.plot(dates, features['ws_1'], 'r-')
ax2.set_xlabel(''); ax2.set_ylabel('temperature'); ax2.set_title('WS')

ax3.plot(dates, features['prcp_1'], 'r-')
ax3.set_xlabel(''); ax3.set_ylabel('temperature'); ax3.set_title('Prcp')

ax4.plot(dates, features['snwd_1'], 'ro')
ax4.set_xlabel(''); ax4.set_ylabel('temperature'); ax4.set_title('Snwd')

plt.show()

# 第五步:使用sns.pairplot畫兩兩關係的散點圖
# 新增長季節特徵，用作畫圖時的區分
season = []
for month in months:
    if month in [12, 1, 2]:
        season.append('window')
    elif month in [3, 4, 5]:
        season.append('spring')
    elif month in [6, 7, 8]:
        season.append('summer')
    else:
        season.append('Autumn')

feature_matrix = features[['prcp_1', 'temp_1', 'average', 'actual']]
feature_matrix['season'] = season

import  seaborn as sns

sns.set(style='ticks', color_codes=True)

palette = sns.xkcd_palette(['dark blue', 'dark green', 'gold', 'orange'])
# hue表示經過什麼進行分類
sns.pairplot(feature_matrix, hue='season', palette=palette, plot_kws=dict(alpha=0.7), diag_kind='kde',
             diag_kws=dict(shade=True))
plt.show()

# 第六步使用增長的數據進行隨機森林的建模，不添加新增的特徵
feature_names = list(features.columns)
feature_indices = [feature_names.index(feature_name) for feature_name in feature_names
    if feature_names not in ['ws_1', 'prcp_1', 'snwd_1']]
print(feature_indices)
# 使用pd.get_dummies 將week的文本標籤轉換爲one-hot編碼
features = pd.get_dummies(features)
# 提取特徵和標籤
X = features.iloc[:, feature_indices]
y = np.array(features['actual'])
X = X.drop('actual', axis=1)
X = np.array(X)
# 使用train_test_split 進行訓練集和測試集的分開
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42)
# 構建隨機森林模型進行預測
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=1000, random_state=42)
rf.fit(train_x, train_y)
pred_y = rf.predict(test_x)

# 使用MAE指標
MAE = round(abs(pred_y - test_y).mean(), 2)

# 使用MAPE指標
MAPE = round(((1-abs(pred_y-test_y)/test_y)*100).mean(), 2)

print(MAE, MAPE)

# 探討原來數據的MAE和MAPE
# 使用pd.get_dummies 將week的文本標籤轉換爲one-hot編碼
features = pd.read_csv('data/temps.csv')
features = pd.get_dummies(features)
# 提取特徵和標籤
y = np.array(features['actual'])
X = features.drop('actual', axis=1)
X = np.array(X)
# 使用train_test_split 進行訓練集和測試集的分開
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42)
# 構建隨機森林模型進行預測
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=1000, random_state=42)
rf.fit(train_x, train_y)
pred_y = rf.predict(test_x)

# 使用MAE指標
MAE = round(abs(pred_y - test_y).mean(), 2)

# 使用MAPE指標
MAPE = round(((1-abs(pred_y-test_y)/test_y)*100).mean(), 2)

print(MAE, MAPE)

# 第七步： 探討將新增長的指標也加入對數據結果的影響
features = pd.read_csv('data/temps_extended.csv')
features = pd.get_dummies(features)
# 提取特徵和標籤
y = np.array(features['actual'])
X = features.drop('actual', axis=1)
X = np.array(X)
# 使用train_test_split 進行訓練集和測試集的分開
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42)
# 構建隨機森林模型進行預測
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=1000, random_state=42)
rf.fit(train_x, train_y)
pred_y = rf.predict(test_x)

# 使用MAE指標
MAE = round(abs(pred_y - test_y).mean(), 2)

# 使用MAPE指標
MAPE = round(((1-abs(pred_y-test_y)/test_y)*100).mean(), 2)

print(MAE, MAPE)