公共自行車低碳,環保,健康,而且解決了交通中「最後一千米」的痛點,在全國各個城市愈來愈受歡迎。本次練習的數據取自於兩個城市某街道上的幾處公共自行車停車樁。咱們但願根據時間,天氣等信息,預測出該街區在一小時內的被借取的公共自行車的數量。html
迴歸node
train.csv 訓練集 文件大小爲273KBgit
test.csv 預測集 文件大小爲179KBgithub
sample_submit.csv 提交示例 文件大小爲97KB算法
訓練集中共有10000條樣本,預測集中有7000條樣本 dom
評價方法爲RMSE(Root of Mean Squared Error)函數
傳送門:請點擊我學習
print(train.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 7 columns): city 10000 non-null int64 hour 10000 non-null int64 is_workday 10000 non-null int64 weather 10000 non-null int64 temp_1 10000 non-null float64 temp_2 10000 non-null float64 wind 10000 non-null int64 dtypes: float64(2), int64(5) memory usage: 547.0 KB None
咱們能夠看到,共有10000個觀測值,沒有缺失值。測試
print(train.describe()) city hour ... temp_2 wind count 10000.000000 10000.000000 ... 10000.000000 10000.000000 mean 0.499800 11.527500 ... 15.321230 1.248600 std 0.500025 6.909777 ... 11.308986 1.095773 min 0.000000 0.000000 ... -15.600000 0.000000 25% 0.000000 6.000000 ... 5.800000 0.000000 50% 0.000000 12.000000 ... 16.000000 1.000000 75% 1.000000 18.000000 ... 24.800000 2.000000 max 1.000000 23.000000 ... 46.800000 7.000000 [8 rows x 7 columns]
經過觀察能夠得出一些猜想,如城市0 和城市1基本能夠排除南方城市;整個觀測記錄時間跨度較長,還可能包含了一個長假期數據等等。優化
(爲了方便查看,絕對值低於0.2的就用nan替代)
corr = feature_data.corr() corr[np.abs(corr) < 0.2] = np.nan print(corr) city hour is_workday weather temp_1 temp_2 wind city 1.0 NaN NaN NaN NaN NaN NaN hour NaN 1.0 NaN NaN NaN NaN NaN is_workday NaN NaN 1.0 NaN NaN NaN NaN weather NaN NaN NaN 1.0 NaN NaN NaN temp_1 NaN NaN NaN NaN 1.000000 0.987357 NaN temp_2 NaN NaN NaN NaN 0.987357 1.000000 NaN wind NaN NaN NaN NaN NaN NaN 1.0
從相關性角度來看,用車的時間和當時的氣溫對借取數量y有較強的關係;氣溫和體感氣溫顯強正相關(共線性),這個和常識一致。
該模型預測結果的RMSE爲:39.132
# -*- coding: utf-8 -*- # 引入模塊 from sklearn.linear_model import LinearRegression import pandas as pd # 讀取數據 train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") submit = pd.read_csv("sample_submit.csv") # 刪除id train.drop('id', axis=1, inplace=True) test.drop('id', axis=1, inplace=True) # 取出訓練集的y y_train = train.pop('y') # 創建線性迴歸模型 reg = LinearRegression() reg.fit(train, y_train) y_pred = reg.predict(test) # 若預測值是負數,則取0 y_pred = map(lambda x: x if x >= 0 else 0, y_pred) # 輸出預測結果至my_LR_prediction.csv submit['y'] = y_pred submit.to_csv('my_LR_prediction.csv', index=False)
該模型預測結果的RMSE爲:28.818
# -*- coding: utf-8 -*- # 引入模塊 from sklearn.tree import DecisionTreeRegressor import pandas as pd # 讀取數據 train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") submit = pd.read_csv("sample_submit.csv") # 刪除id train.drop('id', axis=1, inplace=True) test.drop('id', axis=1, inplace=True) # 取出訓練集的y y_train = train.pop('y') # 創建最大深度爲5的決策樹迴歸模型 reg = DecisionTreeRegressor(max_depth=5) reg.fit(train, y_train) y_pred = reg.predict(test) # 輸出預測結果至my_DT_prediction.csv submit['y'] = y_pred submit.to_csv('my_DT_prediction.csv', index=False)
該模型預測結果的RMSE爲:18.947
# -*- coding: utf-8 -*- # 引入模塊 from xgboost import XGBRegressor import pandas as pd # 讀取數據 train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") submit = pd.read_csv("sample_submit.csv") # 刪除id train.drop('id', axis=1, inplace=True) test.drop('id', axis=1, inplace=True) # 取出訓練集的y y_train = train.pop('y') # 創建一個默認的xgboost迴歸模型 reg = XGBRegressor() reg.fit(train, y_train) y_pred = reg.predict(test) # 輸出預測結果至my_XGB_prediction.csv submit['y'] = y_pred submit.to_csv('my_XGB_prediction.csv', index=False)
Xgboost的相關博客:請點擊我
參數調優的方法步驟通常狀況以下:
def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective="rank:pairwise", booster='gbtree', n_jobs=-1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs):
def xgboost_parameter_tuning(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test1 = { 'n_estimators': range(100, 1000, 100) } gsearch1 = GridSearchCV(estimator= xgb.XGBRegressor( learning_rate=0.1, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1, seed=27), param_grid=param_test1, iid=False, cv=5 ) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_
獲得結果以下(因此咱們選擇樹的個數爲200):
{'n_estimators': 200} 0.9013685759002941
(樹的最大深度,缺省值爲3,範圍是[1, 正無窮),樹的深度越大,則對數據的擬合程度越高,可是一般取值爲3-10)
(孩子節點中的最小的樣本權重和,若是一個葉子節點的樣本權重和小於min_child_weight則拆分過程結果)
下面咱們對這兩個參數調優,是由於他們對最終結果由很大的影響,因此我直接小範圍微調。
def xgboost_parameter_tuning2(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test2 = { 'max_depth': range(3, 10, 1), 'min_child_weight': range(1, 6, 1), } gsearch1 = GridSearchCV(estimator= xgb.XGBRegressor( learning_rate=0.1, n_estimators=200 ), param_grid=param_test2, cv=5) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_
獲得的結果以下:
{'max_depth': 5, 'min_child_weight': 5} 0.9030852081699604
咱們對於數值進行較大跨度的48種不一樣的排列組合,能夠看出理想的max_depth值爲5,理想的min_child_weight值爲5。
(gamma值使得算法更加conservation,且其值依賴於loss function,在模型中應該調參)
在已經調整好其餘參數的基礎上,咱們能夠進行gamma參數的調優了。Gamma參數取值範圍能夠很大,我這裏把取值範圍設置爲5,其實咱們也能夠取更精確的gamma值。
def xgboost_parameter_tuning3(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test3 = { 'gamma': [i/10.0 for i in range(0, 5)] } gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor( learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5 ), param_grid=param_test3, cv=5) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_
結果以下:
{'gamma': 0.0} 0.9024876500236406
(subsample 用於訓練模型的子樣本佔整個樣本集合的比例,若是設置0.5則意味着XGBoost將隨機的從整個樣本集合中抽取出百分之50的子樣本創建模型,這樣能防止過擬合,取值範圍爲(0, 1])
(在創建樹的時候對特徵採樣的比例,缺省值爲1,物質範圍爲(0, 1])
下一步是嘗試不一樣的subsample 和colsample_bytree 參數。咱們分兩個階段來進行這個步驟。這兩個步驟都取0.6,0.7,0.8,0.9 做爲起始值。
def xgboost_parameter_tuning4(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test4 = { 'subsample': [i / 10.0 for i in range(6, 10)], 'colsample_bytree': [i / 10.0 for i in range(6, 10)] } gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor( learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5, gamma=0 ), param_grid=param_test4, cv=5) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_
結果以下:
{'colsample_bytree': 0.9, 'subsample': 0.8} 0.9039011907271065
因爲gamma函數提供了一種更加有效的下降過擬合的方法,大部分人不多會用到這個參數,可是咱們能夠嘗試用一下這個參數。
def xgboost_parameter_tuning5(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test5 = { 'reg_alpha': [0, 0.001, 0.005, 0.01, 0.05] } gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor( learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5, gamma=0.0, colsample_bytree=0.9, subsample=0.8), param_grid=param_test5, cv=5) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_
結果以下:
{'reg_alpha': 0.01} 0.899800819611995
代碼以下:
def xgboost_train(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) params = { 'learning_rate': 0.1, 'n_estimators': 200, 'max_depth': 5, 'min_child_weight': 5, 'gamma': 0.0, 'colsample_bytree': 0.9, 'subsample': 0.8, 'reg_alpha': 0.01, } model = xgb.XGBRegressor(**params) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) submit = pd.read_csv(submitfile) submit['y'] = model.predict(test_feature) submit.to_csv('my_xgboost_prediction1.csv', index=False)
咱們能夠對比上面的結果,最終的結果爲15.208,比直接使用xgboost提升了3.92.
最終全部代碼總結以下:
#_*_coding:utf-8_*_ import numpy as np import pandas as pd def load_data(trainfile, testfile): traindata = pd.read_csv(trainfile) testdata = pd.read_csv(testfile) print(traindata.shape) #(10000, 9) print(testdata.shape) #(7000, 8) # print(traindata) print(type(traindata)) feature_data = traindata.iloc[:, 1:-1] label_data = traindata.iloc[:, -1] test_feature = testdata.iloc[:, 1:] return feature_data, label_data, test_feature def xgboost_train(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) params = { 'learning_rate': 0.1, 'n_estimators': 200, 'max_depth': 5, 'min_child_weight': 5, 'gamma': 0.0, 'colsample_bytree': 0.9, 'subsample': 0.8, 'reg_alpha': 0.01, } model = xgb.XGBRegressor() model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) submit = pd.read_csv(submitfile) submit['y'] = model.predict(test_feature) submit.to_csv('my_xgboost_prediction.csv', index=False) def xgboost_parameter_tuning1(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test1 = { 'n_estimators': range(100, 1000, 100) } gsearch1 = GridSearchCV(estimator= xgb.XGBRegressor( learning_rate=0.1, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1, seed=27), param_grid=param_test1, iid=False, cv=5 ) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_ def xgboost_parameter_tuning2(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test2 = { 'max_depth': range(3, 10, 1), 'min_child_weight': range(1, 6, 1), } gsearch1 = GridSearchCV(estimator= xgb.XGBRegressor( learning_rate=0.1, n_estimators=200 ), param_grid=param_test2, cv=5) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_ def xgboost_parameter_tuning3(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test3 = { 'gamma': [i/10.0 for i in range(0, 5)] } gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor( learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5 ), param_grid=param_test3, cv=5) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_ def xgboost_parameter_tuning4(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test4 = { 'subsample': [i / 10.0 for i in range(6, 10)], 'colsample_bytree': [i / 10.0 for i in range(6, 10)] } gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor( learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5,gamma=0.0 ), param_grid=param_test4, cv=5) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_ def xgboost_parameter_tuning5(feature_data, label_data, test_feature, submitfile): import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test5 = { 'reg_alpha': [0, 0.001, 0.005, 0.01, 0.05] } gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor( learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5, gamma=0.0, colsample_bytree=0.9, subsample=0.8), param_grid=param_test5, cv=5) gsearch1.fit(X_train, y_train) return gsearch1.best_params_, gsearch1.best_score_ if __name__ == '__main__': trainfile = 'data/train.csv' testfile = 'data/test.csv' submitfile = 'data/sample_submit.csv' feature_data, label_data, test_feature = load_data(trainfile, testfile) xgboost_train(feature_data, label_data, test_feature, submitfile)
該模型預測結果的RMSE爲:18.947
#_*_coding:utf-8_*_ import numpy as np import pandas as pd def load_data(trainfile, testfile): traindata = pd.read_csv(trainfile) testdata = pd.read_csv(testfile) feature_data = traindata.iloc[:, 1:-1] label_data = traindata.iloc[:, -1] test_feature = testdata.iloc[:, 1:] return feature_data, label_data, test_feature def random_forest_train(feature_data, label_data, test_feature, submitfile): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) model = RandomForestRegressor() model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) submit = pd.read_csv(submitfile) submit['y'] = model.predict(test_feature) submit.to_csv('my_random_forest_prediction.csv', index=False) if __name__ == '__main__': trainfile = 'data/train.csv' testfile = 'data/test.csv' submitfile = 'data/sample_submit.csv' feature_data, label_data, test_feature = load_data(trainfile, testfile) random_forest_train(feature_data, label_data, test_feature, submitfile)
隨機森林的相關博客:請點擊我
首先,咱們看一下隨機森林的調參過程
def random_forest_parameter_tuning1(feature_data, label_data, test_feature): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test1 = { 'n_estimators': range(10, 71, 10) } model = GridSearchCV(estimator=RandomForestRegressor( min_samples_split=100, min_samples_leaf=20, max_depth=8, max_features='sqrt', random_state=10), param_grid=param_test1, cv=5 ) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) return model.best_score_, model.best_params_
結果以下:
{'n_estimators': 70} 0.6573670183811001
這樣咱們獲得了最佳的弱學習器迭代次數,爲70.。
咱們首先獲得了最佳弱學習器迭代次數,接着咱們對決策樹最大深度max_depth和內部節點再劃分所須要最小樣本數min_samples_split進行網格搜索。
def random_forest_parameter_tuning2(feature_data, label_data, test_feature): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test2 = { 'max_depth': range(3, 14, 2), 'min_samples_split': range(50, 201, 20) } model = GridSearchCV(estimator=RandomForestRegressor( n_estimators=70, min_samples_leaf=20, max_features='sqrt', oob_score=True, random_state=10), param_grid=param_test2, cv=5 ) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) return model.best_score_, model.best_params_
結果爲:
{'max_depth': 13, 'min_samples_split': 50} 0.7107311632187736
對於內部節點再劃分所須要最小樣本數min_samples_split,咱們暫時不能一塊兒定下來,由於這個還和決策樹其餘的參數存在關聯。
下面咱們對內部節點在劃分所須要最小樣本數min_samples_split和葉子節點最小樣本數min_samples_leaf一塊兒調參。
def random_forest_parameter_tuning3(feature_data, label_data, test_feature): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test3 = { 'min_samples_split': range(10, 90, 20), 'min_samples_leaf': range(10, 60, 10), } model = GridSearchCV(estimator=RandomForestRegressor( n_estimators=70, max_depth=13, max_features='sqrt', oob_score=True, random_state=10), param_grid=param_test3, cv=5 ) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) return model.best_score_, model.best_params_
結果以下:
{'min_samples_leaf': 10, 'min_samples_split': 10} 0.7648492269870218
def random_forest_parameter_tuning4(feature_data, label_data, test_feature): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test3 = { 'max_features': range(3, 9, 2), } model = GridSearchCV(estimator=RandomForestRegressor( n_estimators=70, max_depth=13, min_samples_split=10, min_samples_leaf=10, oob_score=True, random_state=10), param_grid=param_test3, cv=5 ) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) return model.best_score_, model.best_params_
結果以下:
{'max_features': 7} 0.881211719251515
def random_forest_train(feature_data, label_data, test_feature, submitfile): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) params = { 'n_estimators': 70, 'max_depth': 13, 'min_samples_split': 10, 'min_samples_leaf': 10, 'max_features': 7 } model = RandomForestRegressor(**params) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) submit = pd.read_csv(submitfile) submit['y'] = model.predict(test_feature) submit.to_csv('my_random_forest_prediction1.csv', index=False)
最終計算獲得的結果以下:
咱們發現,通過調參,結果由17.144 優化到16.251,效果相對Xgboost來講,不是很大。因此最終咱們選擇Xgboost算法。
#_*_coding:utf-8_*_ import numpy as np import pandas as pd def load_data(trainfile, testfile): traindata = pd.read_csv(trainfile) testdata = pd.read_csv(testfile) feature_data = traindata.iloc[:, 1:-1] label_data = traindata.iloc[:, -1] test_feature = testdata.iloc[:, 1:] return feature_data, label_data, test_feature def random_forest_train(feature_data, label_data, test_feature, submitfile): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) params = { 'n_estimators': 70, 'max_depth': 13, 'min_samples_split': 10, 'min_samples_leaf': 10, 'max_features': 7 } model = RandomForestRegressor(**params) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) submit = pd.read_csv(submitfile) submit['y'] = model.predict(test_feature) submit.to_csv('my_random_forest_prediction1.csv', index=False) def random_forest_parameter_tuning1(feature_data, label_data, test_feature): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test1 = { 'n_estimators': range(10, 71, 10) } model = GridSearchCV(estimator=RandomForestRegressor( min_samples_split=100, min_samples_leaf=20, max_depth=8, max_features='sqrt', random_state=10), param_grid=param_test1, cv=5 ) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) return model.best_score_, model.best_params_ def random_forest_parameter_tuning2(feature_data, label_data, test_feature): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test2 = { 'max_depth': range(3, 14, 2), 'min_samples_split': range(50, 201, 20) } model = GridSearchCV(estimator=RandomForestRegressor( n_estimators=70, min_samples_leaf=20, max_features='sqrt', oob_score=True, random_state=10), param_grid=param_test2, cv=5 ) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) return model.best_score_, model.best_params_ def random_forest_parameter_tuning3(feature_data, label_data, test_feature): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test3 = { 'min_samples_split': range(10, 90, 20), 'min_samples_leaf': range(10, 60, 10), } model = GridSearchCV(estimator=RandomForestRegressor( n_estimators=70, max_depth=13, max_features='sqrt', oob_score=True, random_state=10), param_grid=param_test3, cv=5 ) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) return model.best_score_, model.best_params_ def random_forest_parameter_tuning4(feature_data, label_data, test_feature): from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23) param_test4 = { 'max_features': range(3, 9, 2) } model = GridSearchCV(estimator=RandomForestRegressor( n_estimators=70, max_depth=13, min_samples_split=10, min_samples_leaf=10, oob_score=True, random_state=10), param_grid=param_test4, cv=5 ) model.fit(X_train, y_train) # 對測試集進行預測 y_pred = model.predict(X_test) # 計算準確率 MSE = mean_squared_error(y_test, y_pred) RMSE = np.sqrt(MSE) print(RMSE) return model.best_score_, model.best_params_ if __name__ == '__main__': trainfile = 'data/train.csv' testfile = 'data/test.csv' submitfile = 'data/sample_submit.csv' feature_data, label_data, test_feature = load_data(trainfile, testfile) random_forest_train(feature_data, label_data, test_feature, submitfile)
參考文獻:https://www.jianshu.com/p/748b6c35773d