xgboost原理

時間 2019-12-07

標籤 xgboost 原理简体版

原文原文鏈接

出處http://blog.csdn.net/a819825294

1.序

　　距離上一次編輯將近10個月，幸得愛可可老師（微博）推薦，訪問量陡增。最近畢業論文與xgboost相關，因而從新寫一下這篇文章。html

　　關於xgboost的原理網絡上的資源不多，大多數還停留在應用層面，本文經過學習陳天奇博士的PPT、論文、一些網絡資源，但願對xgboost原理進行深刻理解。（筆者在最後的參考文獻中會給出地址）python

2.xgboost vs gbdt

　　說到xgboost，不得不說gbdt，二者都是boosting方法（如圖1所示），瞭解gbdt能夠看我這篇文章地址。git

圖1

　　若是不考慮工程實現、解決問題上的一些差別，xgboost與gbdt比較大的不一樣就是目標函數的定義。github

　　注：紅色箭頭指向的l即爲損失函數；紅色方框爲正則項，包括L一、L2；紅色圓圈爲常數項。xgboost利用泰勒展開三項，作一個近似，咱們能夠很清晰地看到，最終的目標函數只依賴於每一個數據點的在偏差函數上的一階導數和二階導數。算法

3.原理

對於上面給出的目標函數，咱們能夠進一步化簡設計模式

（1）定義樹的複雜度性能優化

對於f的定義作一下細化，把樹拆分紅結構部分q和葉子權重部分w。下圖是一個具體的例子。結構函數q把輸入映射到葉子的索引號上面去，而w給定了每一個索引號對應的葉子分數是什麼。網絡

定義這個複雜度包含了一棵樹裏面節點的個數，以及每一個樹葉子節點上面輸出分數的L2模平方。固然這不是惟一的一種定義方式，不過這必定義方式學習出的樹效果通常都比較不錯。下圖還給出了複雜度計算的一個例子。數據結構

注：方框部分在最終的模型公式中控制這部分的比重,對應模型參數中的lambda ，gamma多線程

在這種新的定義下，咱們能夠把目標函數進行以下改寫，其中I被定義爲每一個葉子上面樣本集合 ,g是一階導數，h是二階導數

這一個目標包含了T個相互獨立的單變量二次函數。咱們能夠定義

最終公式能夠化簡爲

經過對求導等於0，能夠獲得

而後把最優解代入獲得：

（2）打分函數計算示例

Obj表明了當咱們指定一個樹的結構的時候，咱們在目標上面最多減小多少。咱們能夠把它叫作結構分數(structure score)

（3）分裂節點

論文中給出了兩種分裂節點的方法

（1）貪心法：

每一次嘗試去對已有的葉子加入一個分割

對於每次擴展，咱們仍是要枚舉全部可能的分割方案，如何高效地枚舉全部的分割呢？我假設咱們要枚舉全部x < a 這樣的條件，對於某個特定的分割a咱們要計算a左邊和右邊的導數和。

咱們能夠發現對於全部的a，咱們只要作一遍從左到右的掃描就能夠枚舉出全部分割的梯度和GL和GR。而後用上面的公式計算每一個分割方案的分數就能夠了。

觀察這個目標函數，你們會發現第二個值得注意的事情就是引入分割不必定會使得狀況變好，由於咱們有一個引入新葉子的懲罰項。優化這個目標對應了樹的剪枝，當引入的分割帶來的增益小於一個閥值的時候，咱們能夠剪掉這個分割。你們能夠發現，當咱們正式地推導目標的時候，像計算分數和剪枝這樣的策略都會天然地出現，而再也不是一種由於heuristic（啓發式）而進行的操做了。

下面是論文中的算法

（2）近似算法：

主要針對數據太大，不能直接進行計算

4.自定義損失函數（指定grad、hess）

（1）損失函數

（2）grad、hess推導

（3）官方代碼

#!/usr/bin/python import numpy as np import xgboost as xgb ### # advanced: customized loss function # print ('start running example to used customized objective function') dtrain = xgb.DMatrix('../data/agaricus.txt.train') dtest = xgb.DMatrix('../data/agaricus.txt.test') # note: for customized objective function, we leave objective as default # note: what we are getting is margin value in prediction # you must know what you are doing param = {'max_depth': 2, 'eta': 1, 'silent': 1} watchlist = [(dtest, 'eval'), (dtrain, 'train')] num_round = 2 # user define objective function, given prediction, return gradient and second order gradient # this is log likelihood loss def logregobj(preds, dtrain): labels = dtrain.get_label() preds = 1.0 / (1.0 + np.exp(-preds)) grad = preds - labels hess = preds * (1.0-preds) return grad, hess # user defined evaluation function, return a pair metric_name, result # NOTE: when you do customized loss function, the default prediction value is margin # this may make builtin evaluation metric not function properly # for example, we are doing logistic loss, the prediction is score before logistic transformation # the builtin evaluation error assumes input is after logistic transformation # Take this in mind when you use the customization, and maybe you need write customized evaluation function def evalerror(preds, dtrain): labels = dtrain.get_label() # return a pair metric_name, result # since preds are margin(before logistic transformation, cutoff at 0) return 'error', float(sum(labels != (preds > 0.0))) / len(labels) # training with customized objective, we can also do step by step training # simply look at xgboost.py's implementation of train bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)

5.Xgboost調參

因爲xgboost的參數過多，這裏介紹三種思路

（1）GridSearch

（2）Hyperopt

（3）老外寫的一篇文章，操做性比較強，推薦學習一下。地址

6.工程實現優化

（1）Column Blocks and Parallelization

（2）Cache Aware Access

A thread pre-fetches data from non-continuous memory into a continuous buffer.
The main thread accumulates gradients statistics in the continuous buffer.

（3）System Tricks

Block pre-fetching.
Utilize multiple disks to parallelize disk operations.
LZ4 compression(popular recent years for outstanding performance).
Unrolling loops.
OpenMP

7.代碼走讀

這塊很是感謝楊軍老師的無私奉獻【4】

我的看代碼用的是SourceInsight，因爲xgboost有些文件是cc後綴名，能夠經過如下命令修改下（默認的識別不了）

find ./ -name "*.cc" | awk -F "." '{print $2}' | xargs -i -t mv ./{}.cc ./{}.cpp

實際上，對XGBoost的源碼進行走讀分析以後，可以看到下面的主流程：



cli_main.cc: main() -> CLIRunTask() -> CLITrain() -> DMatrix::Load() -> learner = Learner::Create() -> learner->Configure() -> learner->InitModel() -> for (i = 0; i < param.num_round; ++i) -> learner->UpdateOneIter() -> learner->Save() learner.cc: Create() -> new LearnerImpl() Configure() InitModel() -> LazyInitModel() -> obj_ = ObjFunction::Create() -> objective.cc Create() -> SoftmaxMultiClassObj(multiclass_obj.cc)/ LambdaRankObj(rank_obj.cc)/ RegLossObj(regression_obj.cc)/ PoissonRegression(regression_obj.cc) -> gbm_ = GradientBooster::Create() -> gbm.cc Create() -> GBTree(gbtree.cc)/ GBLinear(gblinear.cc) -> obj_->Configure() -> gbm_->Configure() UpdateOneIter() -> PredictRaw() -> obj_->GetGradient() -> gbm_->DoBoost() gbtree.cc: Configure() -> for (up in updaters) -> up->Init() DoBoost() -> BoostNewTrees() -> new_tree = new RegTree() -> for (up in updaters) -> up->Update(new_tree) tree_updater.cc: Create() -> ColMaker/DistColMaker(updater_colmaker.cc)/ SketchMaker(updater_skmaker.cc)/ TreeRefresher(updater_refresh.cc)/ TreePruner(updater_prune.cc)/ HistMaker/CQHistMaker/ GlobalProposalHistMaker/ QuantileHistMaker(updater_histmaker.cc)/ TreeSyncher(updater_sync.cc)

從上面的代碼主流程能夠看到，在XGBoost的實現中，對算法進行了模塊化的拆解，幾個重要的部分分別是：

I. ObjFunction：對應於不一樣的Loss Function，能夠完成一階和二階導數的計算。
II. GradientBooster：用於管理Boost方法生成的Model，注意，這裏的Booster Model既能夠對應於線性Booster Model，也能夠對應於Tree Booster Model。
III. Updater：用於建樹，根據具體的建樹策略不一樣，也會有多種Updater。好比，在XGBoost裏爲了性能優化，既提供了單機多線程並行加速，也支持多機分佈式加速。也就提供了若干種不一樣的並行建樹的updater實現，按並行策略的不一樣，包括：
　　I). inter-feature exact parallelism （特徵級精確並行）
　　II). inter-feature approximate parallelism（特徵級近似並行，基於特徵分bin計算，減小了枚舉全部特徵分裂點的開銷）
　　III). intra-feature parallelism （特徵內並行）

此外，爲了不overfit，還提供了一個用於對樹進行剪枝的updater(TreePruner)，以及一個用於在分佈式場景下完成結點模型參數信息通訊的updater(TreeSyncher)，這樣設計，關於建樹的主要操做均可以經過Updater鏈的方式串接起來，比較一致乾淨，算是Decorator設計模式[4]的一種應用。

XGBoost的實現中，最重要的就是建樹環節，而建樹對應的代碼中，最主要的也是Updater的實現。因此咱們會以Updater的實現做爲介紹的入手點。

以ColMaker（單機版的inter-feature parallelism，實現了精確建樹的策略）爲例，其建樹操做大體以下：



updater_colmaker.cc: ColMaker::Update() -> Builder builder; -> builder.Update() -> InitData() -> InitNewNode() // 爲可用於split的樹結點（即葉子結點，初始狀況下只有一個 // 葉結點，也就是根結點) 計算統計量，包括gain/weight等 -> for (depth = 0; depth < 樹的最大深度; ++depth) -> FindSplit() -> for (each feature) // 經過OpenMP獲取 // inter-feature parallelism -> UpdateSolution() -> EnumerateSplit() // 每一個執行線程處理一個特徵， // 選出每一個特徵的 // 最優split point -> ParallelFindSplit() // 多個執行線程同時處理一個特徵，選出該特徵 //的最優split point; // 在每一個線程裏彙總各個線程內分配到的數據樣 //本的統計量(grad/hess); // aggregate全部線程的樣本統計(grad/hess)， //計算出每一個線程分配到的樣本集合的邊界特徵值做爲 //split point的最優分割點; // 在每一個線程分配到的樣本集合對應的特徵值集合進 //行枚舉做爲split point，選出最優分割點 -> SyncBestSolution() // 上面的UpdateSolution()/ParallelFindSplit() //會爲全部待擴展分割的葉結點找到特徵維度的最優split //point，好比對於葉結點A，OpenMP線程1會找到特徵F1 //的最優split point，OpenMP線程2會找到特徵F2的最 //優split point，因此須要進行全局sync，找到葉結點A //的最優split point。 -> 爲須要進行分割的葉結點建立孩子結點 -> ResetPosition() //根據上一步的分割動做，更新樣本到樹結點的映射關係 // Missing Value(i.e. default)和非Missing Value(i.e. //non-default)分別處理 -> UpdateQueueExpand() // 將待擴展分割的葉子結點用於替換qexpand_，做爲下一輪split的 //起始基礎 -> InitNewNode() // 爲可用於split的樹結點計算統計量

8.python、R對於xgboost的簡單使用

任務：二分類，存在樣本不均衡問題（scale_pos_weight能夠必定程度上解讀此問題）

【Python】

【R】

9.xgboost中比較重要的參數介紹

（1）objective [ default=reg:linear ] 定義學習任務及相應的學習目標，可選的目標函數以下：

「reg:linear」 –線性迴歸。
「reg:logistic」 –邏輯迴歸。
「binary:logistic」 –二分類的邏輯迴歸問題，輸出爲機率。
「binary:logitraw」 –二分類的邏輯迴歸問題，輸出的結果爲wTx。
「count:poisson」 –計數問題的poisson迴歸，輸出結果爲poisson分佈。在poisson迴歸中，max_delta_step的缺省值爲0.7。(used to safeguard optimization)
「multi:softmax」 –讓XGBoost採用softmax目標函數處理多分類問題，同時須要設置參數num_class（類別個數）
「multi:softprob」 –和softmax同樣，可是輸出的是ndata * nclass的向量，能夠將該向量reshape成ndata行nclass列的矩陣。沒行數據表示樣本所屬於每一個類別的機率。
「rank:pairwise」 –set XGBoost to do ranking task by minimizing the pairwise loss

（2）’eval_metric’ The choices are listed below，評估指標:

「rmse」: root mean square error
「logloss」: negative log-likelihood
「error」: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
「merror」: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
「mlogloss」: Multiclass logloss
「auc」: Area under the curve for ranking evaluation.
「ndcg」:Normalized Discounted Cumulative Gain
「map」:Mean average precision
「ndcg@n」,」map@n」: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
「ndcg-「,」map-「,」ndcg@n-「,」map@n-「: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding 「-」 in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.

（3）lambda [default=0] L2 正則的懲罰係數

（4）alpha [default=0] L1 正則的懲罰係數

（5）lambda_bias 在偏置上的L2正則。缺省值爲0（在L1上沒有偏置項的正則，由於L1時偏置不重要）

（6）eta [default=0.3]
爲了防止過擬合，更新過程當中用到的收縮步長。在每次提高計算以後，算法會直接得到新特徵的權重。 eta經過縮減特徵的權重使提高計算過程更加保守。缺省值爲0.3
取值範圍爲：[0,1]

（7）max_depth [default=6] 數的最大深度。缺省值爲6 ，取值範圍爲：[1,∞]

（8）min_child_weight [default=1]
孩子節點中最小的樣本權重和。若是一個葉子節點的樣本權重和小於min_child_weight則拆分過程結束。在現行迴歸模型中，這個參數是指創建每一個模型所須要的最小樣本數。該成熟越大算法越conservative
取值範圍爲: [0,∞]

10.DART

核心思想就是將dropout引入XGBoost

示例代碼




import xgboost as xgb
# read in data dtrain = xgb.DMatrix('demo/data/agaricus.txt.train') dtest = xgb.DMatrix('demo/data/agaricus.txt.test') # specify parameters via map param = {'booster': 'dart', 'max_depth': 5, 'learning_rate': 0.1, 'objective': 'binary:logistic', 'silent': True, 'sample_type': 'uniform', 'normalize_type': 'tree', 'rate_drop': 0.1, 'skip_drop': 0.5} num_round = 50 bst = xgb.train(param, dtrain, num_round) # make prediction # ntree_limit must not be 0 preds = bst.predict(dtest, ntree_limit=num_round)

更多細節能夠閱讀參考文獻5

11.csr_matrix訓練XGBoost

當數據規模比較大、較多列比較稀疏時，可使用csr_matrix訓練XGBoost模型，從而節約內存。

下面是Kaggle比賽中TalkingData開源的代碼，能夠學習一下，詳見參考文獻6。




import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os from sklearn.preprocessing import LabelEncoder from scipy.sparse import csr_matrix, hstack import xgboost as xgb from sklearn.cross_validation import StratifiedKFold from sklearn.metrics import log_loss datadir = '../input' gatrain = pd.read_csv(os.path.join(datadir,'gender_age_train.csv'), index_col='device_id') gatest = pd.read_csv(os.path.join(datadir,'gender_age_test.csv'), index_col = 'device_id') phone = pd.read_csv(os.path.join(datadir,'phone_brand_device_model.csv')) # Get rid of duplicate device ids in phone phone = phone.drop_duplicates('device_id',keep='first').set_index('device_id') events = pd.read_csv(os.path.join(datadir,'events.csv'), parse_dates=['timestamp'], index_col='event_id') appevents = pd.read_csv(os.path.join(datadir,'app_events.csv'), usecols=['event_id','app_id','is_active'], dtype={'is_active':bool}) applabels = pd.read_csv(os.path.join(datadir,'app_labels.csv')) gatrain['trainrow'] = np.arange(gatrain.shape[0]) gatest['testrow'] = np.arange(gatest.shape[0]) brandencoder = LabelEncoder().fit(phone.phone_brand) phone['brand'] = brandencoder.transform(phone['phone_brand']) gatrain['brand'] = phone['brand'] gatest['brand'] = phone['brand'] Xtr_brand = csr_matrix((np.ones(gatrain.shape[0]), (gatrain.trainrow, gatrain.brand))) Xte_brand = csr_matrix((np.ones(gatest.shape[0]), (gatest.testrow, gatest.brand))) print('Brand features: train shape {}, test shape {}'.format(Xtr_brand.shape, Xte_brand.shape)) m = phone.phone_brand.str.cat(phone.device_model) modelencoder = LabelEncoder().fit(m) phone['model'] = modelencoder.transform(m) gatrain['model'] = phone['model'] gatest['model'] = phone['model'] Xtr_model = csr_matrix((np.ones(gatrain.shape[0]), (gatrain.trainrow, gatrain.model))) Xte_model = csr_matrix((np.ones(gatest.shape[0]), (gatest.testrow, gatest.model))) print('Model features: train shape {}, test shape {}'.format(Xtr_model.shape, Xte_model.shape)) appencoder = LabelEncoder().fit(appevents.app_id) appevents['app'] = appencoder.transform(appevents.app_id) napps = len(appencoder.classes_) deviceapps = (appevents.merge(events[['device_id']], how='left',left_on='event_id',right_index=True) .groupby(['device_id','app'])['app'].agg(['size']) .merge(gatrain[['trainrow']], how='left', left_index=True, right_index=True) .merge(gatest[['testrow']], how='left', left_index=True, right_index=True) .reset_index()) d = deviceapps.dropna(subset=['trainrow']) Xtr_app = csr_matrix((np.ones(d.shape[0]), (d.trainrow, d.app)), shape=(gatrain.shape[0],napps)) d = deviceapps.dropna(subset=['testrow']) Xte_app = csr_matrix((np.ones(d.shape[0]), (d.testrow, d.app)), shape=(gatest.shape[0],napps)) print('Apps data: train shape {}, test shape {}'.format(Xtr_app.shape, Xte_app.shape)) applabels = applabels.loc[applabels.app_id.isin(appevents.app_id.unique())] applabels['app'] = appencoder.transform(applabels.app_id) labelencoder = LabelEncoder().fit(applabels.label_id) applabels['label'] = labelencoder.transform(applabels.label_id) nlabels = len(labelencoder.classes_) devicelabels = (deviceapps[['device_id','app']] .merge(applabels[['app','label']]) .groupby(['device_id','label'])['app'].agg(['size']) .merge(gatrain[['trainrow']], how='left', left_index=True, right_index=True) .merge(gatest[['testrow']], how='left', left_index=True, right_index=True) .reset_index()) devicelabels.head() d = devicelabels.dropna(subset=['trainrow']) Xtr_label = csr_matrix((np.ones(d.shape[0]), (d.trainrow, d.label)), shape=(gatrain.shape[0],nlabels)) d = devicelabels.dropna(subset=['testrow']) Xte_label = csr_matrix((np.ones(d.shape[0]), (d.testrow, d.label)), shape=(gatest.shape[0],nlabels)) print('Labels data: train shape {}, test shape {}'.format(Xtr_label.shape, Xte_label.shape)) Xtrain = hstack((Xtr_brand, Xtr_model, Xtr_app, Xtr_label), format='csr') Xtest = hstack((Xte_brand, Xte_model, Xte_app, Xte_label), format='csr') print('All features: train shape {}, test shape {}'.format(Xtrain.shape, Xtest.shape)) targetencoder = LabelEncoder().fit(gatrain.group) y = targetencoder.transform(gatrain.group) ########## XGBOOST ########## params = {} params['booster'] = 'gblinear' params['objective'] = "multi:softprob" params['eval_metric'] = 'mlogloss' params['eta'] = 0.005 params['num_class'] = 12 params['lambda'] = 3 params['alpha'] = 2 # Random 10% for validation kf = list(StratifiedKFold(y, n_folds=10, shuffle=True, random_state=4242))[0] Xtr, Xte = Xtrain[kf[0], :], Xtrain[kf[1], :] ytr, yte = y[kf[0]], y[kf[1]] print('Training set: ' + str(Xtr.shape)) print('Validation set: ' + str(Xte.shape)) d_train = xgb.DMatrix(Xtr, label=ytr) d_valid = xgb.DMatrix(Xte, label=yte) watchlist = [(d_train, 'train'), (d_valid, 'eval')] clf = xgb.train(params, d_train, 1000, watchlist, early_stopping_rounds=25) pred = clf.predict(xgb.DMatrix(Xtest)) pred = pd.DataFrame(pred, index = gatest.index, columns=targetencoder.classes_) pred.head() pred.to_csv('sparse_xgb.csv', index=True) #params['lambda'] = 1 #for alpha in [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]: # params['alpha'] = alpha # clf = xgb.train(params, d_train, 1000, watchlist, early_stopping_rounds=25) # print('^' + str(alpha))