XGboost數據比賽實戰之調參篇(完整流程)

這一篇博客的內容是在上一篇博客Scikit中的特徵選擇,XGboost進行迴歸預測,模型優化的實戰的基礎上進行調參優化的,因此在閱讀本篇博客以前,請先移步看一下上一篇文章。html

我前面所作的工做基本都是關於特徵選擇的,這裏我想寫的是關於XGBoost參數調整的一些小經驗。以前我在網站上也看到不少相關的內容,基本是翻譯自一篇英文的博客,更坑的是不少文章步驟講的不完整,新人看了很容易一頭霧水。因爲本人也是一個新手,在這過程當中也踩了不少大坑,但願這篇博客可以幫助到你們!下面,就進入正題吧。python


首先,很幸運的是,Scikit-learn中提供了一個函數能夠幫助咱們更好地進行調參:git

sklearn.model_selection.GridSearchCVgithub

經常使用參數解讀:數組

  1. estimator:所使用的分類器,若是比賽中使用的是XGBoost的話,就是生成的model。好比: model = xgb.XGBRegressor(**other_params)
  2. param_grid:值爲字典或者列表,即須要最優化的參數的取值。好比:cv_params = {'n_estimators': [550, 575, 600, 650, 675]}
  3. scoring :準確度評價標準,默認None,這時須要使用score函數;或者如scoring='roc_auc',根據所選模型不一樣,評價準則不一樣。字符串(函數名),或是可調用對象,須要其函數簽名形如:scorer(estimator, X, y);若是是None,則使用estimator的偏差估計函數。scoring參數選擇以下:

這裏寫圖片描述

具體參考地址:http://scikit-learn.org/stable/modules/model_evaluation.htmlapp

此次實戰我使用的是r2這個得分函數,固然你們也能夠根據本身的實際須要來選擇。函數

調參剛開始的時候,通常要先初始化一些值:學習

  • learning_rate: 0.1
  • n_estimators: 500
  • max_depth: 5
  • min_child_weight: 1
  • subsample: 0.8
  • colsample_bytree:0.8
  • gamma: 0
  • reg_alpha: 0
  • reg_lambda: 1

連接:XGBoost經常使用參數一覽表測試

你能夠按照本身的實際狀況來設置初始值,上面的也只是一些經驗之談吧。優化

調參的時候通常按照如下順序來進行:

一、最佳迭代次數:n_estimators

if __name__ == '__main__':
    trainFilePath = 'dataset/soccer/train.csv'
    testFilePath = 'dataset/soccer/test.csv'
    data = pd.read_csv(trainFilePath)
    X_train, y_train = featureSet(data)
    X_test = loadTestData(testFilePath)

    cv_params = {'n_estimators': [400, 500, 600, 700, 800]}
    other_params = {'learning_rate': 0.1, 'n_estimators': 500, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
                    'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

    model = xgb.XGBRegressor(**other_params)
    optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='r2', cv=5, verbose=1, n_jobs=4)
    optimized_GBM.fit(X_train, y_train)
    evalute_result = optimized_GBM.grid_scores_
    print('每輪迭代運行結果:{0}'.format(evalute_result))
    print('參數的最佳取值:{0}'.format(optimized_GBM.best_params_))
    print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))

<font color=red size=4>寫到這裏,須要提醒你們,在代碼中有一處很關鍵:</font>

model = xgb.XGBRegressor(**other_params)兩個*號千萬不能省略!可能不少人不注意,再加上網上不少教程估計是從別人那裏直接拷貝,沒有運行結果,因此直接就用了 model = xgb.XGBRegressor(other_params)。<font color=red size=4>悲劇的是,若是直接這樣運行的話,會報以下錯誤:</font>

xgboost.core.XGBoostError: b"Invalid Parameter format for max_depth expect int but value...

不信,請看連接:xgboost issue

以上是血的教訓啊,本身不運行一遍代碼,永遠不知道會出現什麼Bug!

運行後的結果爲:

[Parallel(n_jobs=4)]: Done  25 out of  25 | elapsed:  1.5min finished
每輪迭代運行結果:[mean: 0.94051, std: 0.01244, params: {'n_estimators': 400}, mean: 0.94057, std: 0.01244, params: {'n_estimators': 500}, mean: 0.94061, std: 0.01230, params: {'n_estimators': 600}, mean: 0.94060, std: 0.01223, params: {'n_estimators': 700}, mean: 0.94058, std: 0.01231, params: {'n_estimators': 800}]
參數的最佳取值:{'n_estimators': 600}
最佳模型得分:0.9406056804545407

由輸出結果可知最佳迭代次數爲600次。可是,咱們還不能認爲這是最終的結果,因爲設置的間隔太大,因此,我又測試了一組參數,此次粒度小一些:

cv_params = {'n_estimators': [550, 575, 600, 650, 675]}
    other_params = {'learning_rate': 0.1, 'n_estimators': 600, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
                    'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

運行後的結果爲:

[Parallel(n_jobs=4)]: Done  25 out of  25 | elapsed:  1.5min finished
每輪迭代運行結果:[mean: 0.94065, std: 0.01237, params: {'n_estimators': 550}, mean: 0.94064, std: 0.01234, params: {'n_estimators': 575}, mean: 0.94061, std: 0.01230, params: {'n_estimators': 600}, mean: 0.94060, std: 0.01226, params: {'n_estimators': 650}, mean: 0.94060, std: 0.01224, params: {'n_estimators': 675}]
參數的最佳取值:{'n_estimators': 550}
最佳模型得分:0.9406545392685364

果不其然,最佳迭代次數變成了550。有人可能會問,那還要不要繼續縮小粒度測試下去呢?這個我以爲能夠看我的狀況,若是你想要更高的精度,固然是粒度越小,結果越準確,你們能夠本身慢慢去調試,我在這裏就不一一去作了。

二、接下來要調試的參數是min_child_weight以及max_depth

<font color=red size=4>注意:每次調完一個參數,要把 other_params對應的參數更新爲最優值。</font>

cv_params = {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10], 'min_child_weight': [1, 2, 3, 4, 5, 6]}
    other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
                    'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

運行後的結果爲:

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 12.3min
[Parallel(n_jobs=4)]: Done 240 out of 240 | elapsed: 17.2min finished
每輪迭代運行結果:[mean: 0.93967, std: 0.01334, params: {'min_child_weight': 1, 'max_depth': 3}, mean: 0.93826, std: 0.01202, params: {'min_child_weight': 2, 'max_depth': 3}, mean: 0.93739, std: 0.01265, params: {'min_child_weight': 3, 'max_depth': 3}, mean: 0.93827, std: 0.01285, params: {'min_child_weight': 4, 'max_depth': 3}, mean: 0.93680, std: 0.01219, params: {'min_child_weight': 5, 'max_depth': 3}, mean: 0.93640, std: 0.01231, params: {'min_child_weight': 6, 'max_depth': 3}, mean: 0.94277, std: 0.01395, params: {'min_child_weight': 1, 'max_depth': 4}, mean: 0.94261, std: 0.01173, params: {'min_child_weight': 2, 'max_depth': 4}, mean: 0.94276, std: 0.01329...]
參數的最佳取值:{'min_child_weight': 5, 'max_depth': 4}
最佳模型得分:0.94369522247392

由輸出結果可知參數的最佳取值:{'min_child_weight': 5, 'max_depth': 4}。(代碼輸出結果被我省略了一部分,由於結果太長了,如下也是如此)

三、接着咱們就開始調試參數:gamma:

cv_params = {'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
    other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0,
                    'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

運行後的結果爲:

[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:  1.5min finished
每輪迭代運行結果:[mean: 0.94370, std: 0.01010, params: {'gamma': 0.1}, mean: 0.94370, std: 0.01010, params: {'gamma': 0.2}, mean: 0.94370, std: 0.01010, params: {'gamma': 0.3}, mean: 0.94370, std: 0.01010, params: {'gamma': 0.4}, mean: 0.94370, std: 0.01010, params: {'gamma': 0.5}, mean: 0.94370, std: 0.01010, params: {'gamma': 0.6}]
參數的最佳取值:{'gamma': 0.1}
最佳模型得分:0.94369522247392

由輸出結果可知參數的最佳取值:{'gamma': 0.1}

四、接着是subsample以及colsample_bytree:

cv_params = {'subsample': [0.6, 0.7, 0.8, 0.9], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9]}
    other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0,
                    'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0.1, 'reg_alpha': 0, 'reg_lambda': 1}

運行後的結果顯示參數的最佳取值:{'subsample': 0.7,'colsample_bytree': 0.7}

五、緊接着就是:reg_alpha以及reg_lambda:

cv_params = {'reg_alpha': [0.05, 0.1, 1, 2, 3], 'reg_lambda': [0.05, 0.1, 1, 2, 3]}
    other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0,
                    'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0.1, 'reg_alpha': 0, 'reg_lambda': 1}

運行後的結果爲:

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  2.0min
[Parallel(n_jobs=4)]: Done 125 out of 125 | elapsed:  5.6min finished
每輪迭代運行結果:[mean: 0.94169, std: 0.00997, params: {'reg_alpha': 0.01, 'reg_lambda': 0.01}, mean: 0.94112, std: 0.01086, params: {'reg_alpha': 0.01, 'reg_lambda': 0.05}, mean: 0.94153, std: 0.01093, params: {'reg_alpha': 0.01, 'reg_lambda': 0.1}, mean: 0.94400, std: 0.01090, params: {'reg_alpha': 0.01, 'reg_lambda': 1}, mean: 0.93820, std: 0.01177, params: {'reg_alpha': 0.01, 'reg_lambda': 100}, mean: 0.94194, std: 0.00936, params: {'reg_alpha': 0.05, 'reg_lambda': 0.01}, mean: 0.94136, std: 0.01122, params: {'reg_alpha': 0.05, 'reg_lambda': 0.05}, mean: 0.94164, std: 0.01120...]
參數的最佳取值:{'reg_alpha': 1, 'reg_lambda': 1}
最佳模型得分:0.9441561344357595

由輸出結果可知參數的最佳取值:{'reg_alpha': 1, 'reg_lambda': 1}

六、最後就是learning_rate,通常這時候要調小學習率來測試:

cv_params = {'learning_rate': [0.01, 0.05, 0.07, 0.1, 0.2]}
    other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0,
                    'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0.1, 'reg_alpha': 1, 'reg_lambda': 1}

運行後的結果爲:

[Parallel(n_jobs=4)]: Done  25 out of  25 | elapsed:  1.1min finished
每輪迭代運行結果:[mean: 0.93675, std: 0.01080, params: {'learning_rate': 0.01}, mean: 0.94229, std: 0.01138, params: {'learning_rate': 0.05}, mean: 0.94110, std: 0.01066, params: {'learning_rate': 0.07}, mean: 0.94416, std: 0.01037, params: {'learning_rate': 0.1}, mean: 0.93985, std: 0.01109, params: {'learning_rate': 0.2}]
參數的最佳取值:{'learning_rate': 0.1}
最佳模型得分:0.9441561344357595

由輸出結果可知參數的最佳取值:{'learning_rate': 0.1}

咱們能夠很清楚地看到,隨着參數的調優,最佳模型得分是不斷提升的,這也從另外一方面驗證了調優確實是起到了必定的做用。不過,咱們也能夠注意到,其實最佳分數並無提高太多。提醒一點,這個分數是根據前面設置的得分函數算出來的,即:

optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='r2', cv=5, verbose=1, n_jobs=4)

中的scoring='r2'。在實際情境中,咱們可能須要利用各類不一樣的得分函數來評判模型的好壞。

最後,咱們把獲得的最佳參數組合扔到模型裏訓練,就能夠獲得預測的結果了:

def trainandTest(X_train, y_train, X_test):
    # XGBoost訓練過程,下面的參數就是剛纔調試出來的最佳參數組合
    model = xgb.XGBRegressor(learning_rate=0.1, n_estimators=550, max_depth=4, min_child_weight=5, seed=0,
                             subsample=0.7, colsample_bytree=0.7, gamma=0.1, reg_alpha=1, reg_lambda=1)
    model.fit(X_train, y_train)

    # 對測試集進行預測
    ans = model.predict(X_test)

    ans_len = len(ans)
    id_list = np.arange(10441, 17441)
    data_arr = []
    for row in range(0, ans_len):
        data_arr.append([int(id_list[row]), ans[row]])
    np_data = np.array(data_arr)

    # 寫入文件
    pd_data = pd.DataFrame(np_data, columns=['id', 'y'])
    # print(pd_data)
    pd_data.to_csv('submit.csv', index=None)

    # 顯示重要特徵
    # plot_importance(model)
    # plt.show()

好了,調參的過程到這裏就基本結束了。正如我在上面提到的同樣,其實調參對於模型準確率的提升有必定的幫助,但這是有限的。最重要的仍是要經過數據清洗,特徵選擇,特徵融合,模型融合等手段來進行改進!

下面我就貼出完整代碼(聲明一點,個人代碼質量不是很好,你們參考一下思路就行):

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @File  : soccer_value.py
# @Author: Huangqinjian
# @Date  : 2018/3/22
# @Desc  :

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
from sklearn import metrics
from sklearn.preprocessing import Imputer
from sklearn.grid_search import GridSearchCV
from hyperopt import hp

# 加載訓練數據
def featureSet(data):
    imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
    imputer.fit(data.loc[:, ['rw', 'st', 'lw', 'cf', 'cam', 'cm']])
    x_new = imputer.transform(data.loc[:, ['rw', 'st', 'lw', 'cf', 'cam', 'cm']])

    le = preprocessing.LabelEncoder()
    le.fit(['Low', 'Medium', 'High'])
    att_label = le.transform(data.work_rate_att.values)
    # print(att_label)
    def_label = le.transform(data.work_rate_def.values)
    # print(def_label)

    data_num = len(data)
    XList = []
    for row in range(0, data_num):
        tmp_list = []
        tmp_list.append(data.iloc[row]['club'])
        tmp_list.append(data.iloc[row]['league'])
        tmp_list.append(data.iloc[row]['potential'])
        tmp_list.append(data.iloc[row]['international_reputation'])
        tmp_list.append(data.iloc[row]['pac'])
        tmp_list.append(data.iloc[row]['sho'])
        tmp_list.append(data.iloc[row]['pas'])
        tmp_list.append(data.iloc[row]['dri'])
        tmp_list.append(data.iloc[row]['def'])
        tmp_list.append(data.iloc[row]['phy'])
        tmp_list.append(data.iloc[row]['skill_moves'])
        tmp_list.append(x_new[row][0])
        tmp_list.append(x_new[row][1])
        tmp_list.append(x_new[row][2])
        tmp_list.append(x_new[row][3])
        tmp_list.append(x_new[row][4])
        tmp_list.append(x_new[row][5])
        tmp_list.append(att_label[row])
        tmp_list.append(def_label[row])
        XList.append(tmp_list)
    yList = data.y.values
    return XList, yList

# 加載測試數據
def loadTestData(filePath):
    data = pd.read_csv(filepath_or_buffer=filePath)
    imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
    imputer.fit(data.loc[:, ['rw', 'st', 'lw', 'cf', 'cam', 'cm']])
    x_new = imputer.transform(data.loc[:, ['rw', 'st', 'lw', 'cf', 'cam', 'cm']])

    le = preprocessing.LabelEncoder()
    le.fit(['Low', 'Medium', 'High'])
    att_label = le.transform(data.work_rate_att.values)
    # print(att_label)
    def_label = le.transform(data.work_rate_def.values)
    # print(def_label)

    data_num = len(data)
    XList = []
    for row in range(0, data_num):
        tmp_list = []
        tmp_list.append(data.iloc[row]['club'])
        tmp_list.append(data.iloc[row]['league'])
        tmp_list.append(data.iloc[row]['potential'])
        tmp_list.append(data.iloc[row]['international_reputation'])
        tmp_list.append(data.iloc[row]['pac'])
        tmp_list.append(data.iloc[row]['sho'])
        tmp_list.append(data.iloc[row]['pas'])
        tmp_list.append(data.iloc[row]['dri'])
        tmp_list.append(data.iloc[row]['def'])
        tmp_list.append(data.iloc[row]['phy'])
        tmp_list.append(data.iloc[row]['skill_moves'])
        tmp_list.append(x_new[row][0])
        tmp_list.append(x_new[row][1])
        tmp_list.append(x_new[row][2])
        tmp_list.append(x_new[row][3])
        tmp_list.append(x_new[row][4])
        tmp_list.append(x_new[row][5])
        tmp_list.append(att_label[row])
        tmp_list.append(def_label[row])
        XList.append(tmp_list)
    return XList


def trainandTest(X_train, y_train, X_test):
    # XGBoost訓練過程
    model = xgb.XGBRegressor(learning_rate=0.1, n_estimators=550, max_depth=4, min_child_weight=5, seed=0,
                             subsample=0.7, colsample_bytree=0.7, gamma=0.1, reg_alpha=1, reg_lambda=1)
    model.fit(X_train, y_train)

    # 對測試集進行預測
    ans = model.predict(X_test)

    ans_len = len(ans)
    id_list = np.arange(10441, 17441)
    data_arr = []
    for row in range(0, ans_len):
        data_arr.append([int(id_list[row]), ans[row]])
    np_data = np.array(data_arr)

    # 寫入文件
    pd_data = pd.DataFrame(np_data, columns=['id', 'y'])
    # print(pd_data)
    pd_data.to_csv('submit.csv', index=None)

    # 顯示重要特徵
    # plot_importance(model)
    # plt.show()


if __name__ == '__main__':
    trainFilePath = 'dataset/soccer/train.csv'
    testFilePath = 'dataset/soccer/test.csv'
    data = pd.read_csv(trainFilePath)
    X_train, y_train = featureSet(data)
    X_test = loadTestData(testFilePath)
    # 預測最終的結果
    # trainandTest(X_train, y_train, X_test)

    """
    下面部分爲調試參數的代碼
    """

    #
    # cv_params = {'n_estimators': [400, 500, 600, 700, 800]}
    # other_params = {'learning_rate': 0.1, 'n_estimators': 500, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
    #                 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
    #
    # cv_params = {'n_estimators': [550, 575, 600, 650, 675]}
    # other_params = {'learning_rate': 0.1, 'n_estimators': 600, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
    #                 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
    #
    # cv_params = {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10], 'min_child_weight': [1, 2, 3, 4, 5, 6]}
    # other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
    #                 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
    #
    # cv_params = {'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
    # other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0,
    #                 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
    #
    # cv_params = {'subsample': [0.6, 0.7, 0.8, 0.9], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9]}
    # other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0,
    #                 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0.1, 'reg_alpha': 0, 'reg_lambda': 1}
    #
    # cv_params = {'reg_alpha': [0.05, 0.1, 1, 2, 3], 'reg_lambda': [0.05, 0.1, 1, 2, 3]}
    # other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0,
    #                 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0.1, 'reg_alpha': 0, 'reg_lambda': 1}
    #
    # cv_params = {'learning_rate': [0.01, 0.05, 0.07, 0.1, 0.2]}
    # other_params = {'learning_rate': 0.1, 'n_estimators': 550, 'max_depth': 4, 'min_child_weight': 5, 'seed': 0,
    #                 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0.1, 'reg_alpha': 1, 'reg_lambda': 1}
    #
    # model = xgb.XGBRegressor(**other_params)
    # optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='r2', cv=5, verbose=1, n_jobs=4)
    # optimized_GBM.fit(X_train, y_train)
    # evalute_result = optimized_GBM.grid_scores_
    # print('每輪迭代運行結果:{0}'.format(evalute_result))
    # print('參數的最佳取值:{0}'.format(optimized_GBM.best_params_))
    # print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))

更多幹貨,歡迎去聽個人GitChat:

這裏寫圖片描述

相關文章
相關標籤/搜索