XGBoost 參數說明

時間 2019-11-24

標籤 xgboost 參數說明简体版

原文原文鏈接

XGBoost使用key-value字典的方式存儲參數：html

params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',  # 多分類的問題
    'num_class': 10,       # 類別數，與 multisoftmax 並用
    'gamma': 0.1,      # 用於控制是否後剪枝的參數,越大越保守，通常0.一、0.2這樣子。
    'max_depth': 12,       # 構建樹的深度，越大越容易過擬合
    'lambda': 2,          # 控制模型複雜度的權重值的L2正則化項參數，參數越大，模型越不容易過擬合。
    'subsample': 0.7,           # 隨機採樣訓練樣本
    'colsample_bytree': 0.7,       # 生成樹時進行的列採樣
    'min_child_weight': 3,
    'silent': 1,                   # 設置成1則沒有運行信息輸出，最好是設置爲0.
    'eta': 0.007,                  # 如同窗習率
    'seed': 1000,
    'nthread': 4,                  # cpu 線程數
}

在運行XGboost以前，必須設置三種類型成熟：general parameters，booster parameters和task parameters：node

General parameters
該參數參數控制在提高（boosting）過程當中使用哪一種booster，經常使用的booster有樹模型（tree）和線性模型（linear model）。git
Booster parameters
這取決於使用哪一種booster。算法
Task parameters
控制學習的場景，例如在迴歸問題中會使用不一樣的參數控制排序。ide

————————————————————————
函數

booster [default=gbtree]學習
有兩中模型能夠選擇gbtree和gblinear。gbtree使用基於樹的模型進行提高計算，gblinear使用線性模型進行提高計算。缺省值爲gbtreeui
silent [default=0]lua
取0時表示打印出運行時信息，取1時表示以緘默方式運行，不打印運行時信息。缺省值爲0spa
nthread
XGBoost運行時的線程數。缺省值是當前系統能夠得到的最大線程數
num_pbuffer
預測緩衝區大小，一般設置爲訓練實例的數目。緩衝用於保存最後一步提高的預測結果，無需人爲設置。
num_feature
Boosting過程當中用到的特徵維數，設置爲特徵個數。XGBoost會自動設置，無需人爲設置。

## Parameters for Tree Booster

eta [default=0.3]
爲了防止過擬合，更新過程當中用到的收縮步長。在每次提高計算以後，算法會直接得到新特徵的權重。 eta經過縮減特徵的權重使提高計算過程更加保守。缺省值爲0.3
取值範圍爲：[0,1]
gamma [default=0]
minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
取值範圍爲：[0,∞]
max_depth [default=6]
數的最大深度。缺省值爲6
取值範圍爲：[1,∞]
min_child_weight [default=1]
孩子節點中最小的樣本權重和。若是一個葉子節點的樣本權重和小於min_child_weight則拆分過程結束。在現行迴歸模型中，這個參數是指創建每一個模型所須要的最小樣本數。該成熟越大算法越conservative
取值範圍爲：[0,∞]
max_delta_step [default=0]
咱們容許每一個樹的權重被估計的值。若是它的值被設置爲0，意味着沒有約束；若是它被設置爲一個正值，它可以使得更新的步驟更加保守。一般這個參數是沒有必要的，可是若是在邏輯迴歸中類極其不平衡這時候他有可能會起到幫助做用。把它範圍設置爲1-10之間也許能控制更新。
取值範圍爲：[0,∞]
subsample [default=1]
用於訓練模型的子樣本佔整個樣本集合的比例。若是設置爲0.5則意味着XGBoost將隨機的從整個樣本集合中隨機的抽取出50%的子樣本創建樹模型，這可以防止過擬合。
取值範圍爲：(0,1]
colsample_bytree [default=1]
在創建樹時對特徵採樣的比例。缺省值爲1
取值範圍爲：(0,1]

## Parameter for Linear Booster

lambda [default=0]
L2 正則的懲罰係數
alpha [default=0]
L1 正則的懲罰係數
lambda_bias
在偏置上的L2正則。缺省值爲0（在L1上沒有偏置項的正則，由於L1時偏置不重要）

## Task Parameters

objective [ default=reg:linear ]
定義學習任務及相應的學習目標，可選的目標函數以下：

「reg:linear」 —— 線性迴歸。
「reg:logistic」—— 邏輯迴歸。
「binary:logistic」—— 二分類的邏輯迴歸問題，輸出爲機率。
「binary:logitraw」—— 二分類的邏輯迴歸問題，輸出的結果爲wTx。
「count:poisson」—— 計數問題的poisson迴歸，輸出結果爲poisson分佈。在poisson迴歸中，max_delta_step的缺省值爲0.7。(used to safeguard optimization)
「multi:softmax」 –讓XGBoost採用softmax目標函數處理多分類問題，同時須要設置參數num_class（類別個數）
「multi:softprob」 –和softmax同樣，可是輸出的是ndata * nclass的向量，能夠將該向量reshape成ndata行nclass列的矩陣。沒行數據表示樣本所屬於每一個類別的機率。
「rank:pairwise」 –set XGBoost to do ranking task by minimizing the pairwise loss

base_score [ default=0.5 ]

全部實例的初始化預測分數，全局偏置；
爲了足夠的迭代次數，改變這個值將不會有太大的影響。

eval_metric [ default according to objective ]

「rmse」: root mean square error
「logloss」: negative log-likelihood
「error」: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
「merror」: Multiclass classification error rate. It is calculated as
「mlogloss」: Multiclass logloss
「auc」: Area under the curve for ranking evaluation.
「ndcg」:Normalized Discounted Cumulative Gain
「map」:Mean average precision
「ndcg@n」,」map@n」: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
「ndcg-「,」map-「,」ndcg@n-「,」map@n-「: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding 「-」 in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatively
校驗數據所須要的評價指標，不一樣的目標函數將會有缺省的評價指標（rmse for regression, and error for classification, mean average precision for ranking）-
用戶能夠添加多種評價指標，對於Python用戶要以list傳遞參數對給程序，而不是map參數list參數不會覆蓋’eval_metric’
可供的選擇以下: