xgboost 實踐

時間 2019-11-08

標籤 xgboost 實踐简体版

原文原文鏈接

xgboost 安裝：xgboost：Scalable and Flexible Gradient Boostinghtml

github： node

eXtreme Gradient Boosting

xgboost 用C++編寫，提供了Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Flink and DataFlow 等接口git

在python中的一套使用流程： python 平臺下實現 xgboost 算法及輸出的解釋github

參數意義：（先整理一下，實踐中再完善）官網的參數解釋： XGBoost Parameters算法

XGBoost 參數

在運行 XGboost 以前，必須設置三種類型成熟：general parameters，booster parameters 和 task parameters：apache

General parameters：參數控制在提高（boosting）過程當中使用哪一種 booster，經常使用的 booster 有樹模型（tree）和線性模型（linear model）。
Booster parameters：這取決於使用哪一種 booster。
Task parameters：控制學習的場景，例如在迴歸問題中會使用不一樣的參數控制排序。
除了以上參數還可能有其它參數，在命令行中使用

Parameters in R Package

In R-package, you can use .(dot) to replace under score in the parameters, for example, you can use max.depth as max_depth. The underscore parameters are also valid in R.c#

General Parameters

booster [default=gbtree]
- 有兩中模型能夠選擇 gbtree 和 gblinear。gbtree 使用基於樹的模型進行提高計算，gblinear 使用線性模型進行提高計算。缺省值爲gbtree
silent [default=0]
- 取 0 時表示打印出運行時信息，取 1 時表示以緘默方式運行，不打印運行時信息。缺省值爲0
nthread [default to maximum number of threads available if not set]
- XGBoost 運行時的線程數。缺省值是當前系統能夠得到的最大線程數
num_pbuffer [set automatically by xgboost, no need to be set by user]
- size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [set automatically by xgboost, no need to be set by user]
- boosting 過程當中用到的特徵維數，設置爲特徵個數。XGBoost會自動設置，不須要手工設置

Booster Parameters

From xgboost-unity, the bst: prefix is no longer needed for booster parameters. Parameter with or without bst: prefix will be equivalent(i.e. both bst:eta and eta will be valid parameter setting) .緩存

Parameter for Tree Booster

eta [default=0.3]
- 爲了防止過擬合，更新過程當中用到的收縮步長。在每次提高計算以後，算法會直接得到新特徵的權重。 eta 經過縮減特徵的權重使提高計算過程更加保守。缺省值爲0.3
- 取值範圍爲：[0,1]
gamma [default=0]
- minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
- range: [0,∞]
max_depth [default=6]
- 數的最大深度。缺省值爲6
- 取值範圍爲：[1,∞]
min_child_weight [default=1]
- 孩子節點中最小的樣本權重和。若是一個葉子節點的樣本權重和小於 min_child_weight 則拆分過程結束。在現行迴歸模型中，這個參數是指創建每一個模型所須要的最小樣本數。該成熟越大算法越 conservative
- 取值範圍爲: [0,∞]
max_delta_step [default=0]
- Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update
- 取值範圍爲：[0,∞]
subsample [default=1]
- 用於訓練模型的子樣本佔整個樣本集合的比例。若是設置爲 0.5 則意味着 XGBoost 將隨機的衝整個樣本集合中隨機的抽取出 50% 的子樣本創建樹模型，這可以防止過擬合。
- 取值範圍爲：(0,1]
colsample_bytree [default=1]
- 在創建樹時對特徵採樣的比例。缺省值爲1
- 取值範圍：(0,1]

Parameter for Linear Booster

lambda [default=0]
- L2 正則的懲罰係數
alpha [default=0]
- L1 正則的懲罰係數
lambda_bias
- 在偏置上的 L2 正則。缺省值爲0（在 L1 上沒有偏置項的正則，由於 L1 時偏置不重要）

Task Parameters

objective [default=reg:linear]
- 定義學習任務及相應的學習目標，可選的目標函數以下：
- 「reg:linear」 –線性迴歸。
- 「reg:logistic」 –邏輯迴歸。
- 「binary:logistic」 –二分類的邏輯迴歸問題，輸出爲機率。
- 「binary:logitraw」 –二分類的邏輯迴歸問題，輸出的結果爲 wTx。
- 「count:poisson」 –計數問題的 poisson 迴歸，輸出結果爲 poisson 分佈。
- 在 poisson 迴歸中，max_delta_step 的缺省值爲 0.7。(used to safeguard optimization)
- 「multi:softmax」 –讓 XGBoost 採用 softmax 目標函數處理多分類問題，同時須要設置參數 num_class（類別個數）
- 「multi:softprob」 –和 softmax 同樣，可是輸出的是 ndata * nclass 的向量，能夠將該向量 reshape 成 ndata 行 nclass 列的矩陣。沒行數據表示樣本所屬於每一個類別的機率。
- 「rank:pairwise」 –set XGBoost to do ranking task by minimizing the pairwise loss
base_score [default=0.5]
- the initial prediction score of all instances, global bias
eval_metric [default according to objective]
- 校驗數據所須要的評價指標，不一樣的目標函數將會有缺省的評價指標（rmse for regression, and error for classification, mean average precision for ranking）
- 用戶能夠添加多種評價指標，對於 Python 用戶要以 list 傳遞參數對給程序，而不是 map 參數 list 參數不會覆蓋’eval_metric’
- The choices are listed below:
- 「rmse」: root mean square error
- 「logloss」: negative log-likelihood
- 「error」: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
- 「merror」: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
- 「mlogloss」: Multiclass logloss
- 「auc」: Area under the curve for ranking evaluation.
- 「ndcg」:Normalized Discounted Cumulative Gain
- 「map」:Mean average precision
- 「ndcg@n」,」map@n」: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
- 「ndcg-「,」map-「,」ndcg@n-「,」map@n-「: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding 「-」 in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.
  training repeatively
seed [default=0]
- 隨機數的種子。缺省值爲0

Console Parameters

The following parameters are only used in the console version of xgboost
* use_buffer [default=1]
- 是否爲輸入建立二進制的緩存文件，緩存文件能夠加速計算。缺省值爲1
* num_round
- boosting 迭代計算次數。
* data
- 輸入數據的路徑
* test:data
- 測試數據的路徑
* save_period [default=0]
- 保存模型的時間段，設置 save_period = 10 意味着每 10 輪 XGBoost 將保存模型，將其設置爲 0 意味着在訓練期間不保存任何模型。
* task [default=train] options: train, pred, eval, dump
- train：訓練明顯
- pred：對測試數據進行預測
- eval：經過 eval[name]=filenam 定義評價指標
- dump：將學習模型保存成文本格式
* model_in [default=NULL]
- 指向模型的路徑在 test, eval, dump 都會用到，若是在 training 中定義 XGBoost 將會接着輸入模型繼續訓練
* model_out [default=NULL]
- 訓練完成後模型的保持路徑，若是沒有定義則會輸出相似 0003.model 這樣的結果，0003 是第三次訓練的模型結果。
* model_dir [default=models]
- 輸出模型所保存的路徑。
* fmap
- feature map, used for dump model
* name_dump [default=dump.txt]
- name of model dump file
* name_pred [default=pred.txt]
- 預測結果文件
* pred_margin [default=0]
- 輸出預測的邊界，而不是轉換後的機率函數

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。