Kaggle 神器 xgboost

時間 2019-11-18

標籤 kaggle 神器 xgboost 简体版

原文原文鏈接

Kaggle 神器 xgboost

在 Kaggle 的不少比賽中，咱們能夠看到不少 winner 喜歡用 xgboost，並且得到很是好的表現，今天就來看看 xgboost 究竟是什麼以及如何應用。html

什麼是 xgboost？

XGBoost ：eXtreme Gradient Boosting
項目地址：https://github.com/dmlc/xgboostpython

是由 Tianqi Chen http://homes.cs.washington.edu/~tqchen/ 最初開發的實現可擴展，便攜，分佈式 gradient boosting (GBDT, GBRT or GBM) 算法的一個庫，能夠下載安裝並應用於 C++，Python，R，Julia，Java，Scala，Hadoop，如今有不少協做者共同開發維護。ios

XGBoost 所應用的算法就是 gradient boosting decision tree，既能夠用於分類也能夠用於迴歸問題中。git

那什麼是 Gradient Boosting？github

Gradient boosting 是 boosting 的其中一種方法web

所謂 Boosting ，就是將弱分離器 f_i(x) 組合起來造成強分類器 F(x) 的一種方法。算法

因此 Boosting 有三個要素：api

A loss function to be optimized：
例如分類問題中用 cross entropy，迴歸問題用 mean squared error。數組
A weak learner to make predictions：
例如決策樹。dom
An additive model：
將多個弱學習器累加起來組成強學習器，進而使目標損失函數達到極小。

Gradient boosting 就是經過加入新的弱學習器，來努力糾正前面全部弱學習器的殘差，最終這樣多個學習器相加在一塊兒用來進行最終預測，準確率就會比單獨的一個要高。之因此稱爲 Gradient，是由於在添加新模型時使用了梯度降低算法來最小化的損失。

爲何要用 xgboost？

前面已經知道，XGBoost 就是對 gradient boosting decision tree 的實現，可是通常來講，gradient boosting 的實現是比較慢的，由於每次都要先構造出一個樹並添加到整個模型序列中。

而 XGBoost 的特色就是計算速度快，模型表現好，這兩點也正是這個項目的目標。

表現快是由於它具備這樣的設計：

Parallelization：
訓練時能夠用全部的 CPU 內核來並行化建樹。
Distributed Computing ：
用分佈式計算來訓練很是大的模型。
Out-of-Core Computing：
對於很是大的數據集還能夠進行 Out-of-Core Computing。
Cache Optimization of data structures and algorithms：
更好地利用硬件。

下圖就是 XGBoost 與其它 gradient boosting 和 bagged decision trees 實現的效果比較，能夠看出它比 R, Python，Spark，H2O 中的基準配置要更快。

另一個優勢就是在預測問題中模型表現很是好，下面是幾個 kaggle winner 的賽後採訪連接，能夠看出 XGBoost 的在實戰中的效果。

Vlad Sandulescu, Mihai Chiru, 1st place of the KDD Cup 2016 competition. Link to the arxiv paper.
Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the Dato Truely Native? competition. Link to the Kaggle interview.
Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition. Link to the Kaggle interview.

怎麼應用？

先來用 Xgboost 作一個簡單的二分類問題，如下面這個數據爲例，來判斷病人是否會在 5 年內患糖尿病，這個數據前 8 列是變量，最後一列是預測值爲 0 或 1。

數據描述：
https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

下載數據集，並保存爲「pima-indians-diabetes.csv「文件：
https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

1. 基礎應用

引入 xgboost 等包

from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

分出變量和標籤

dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

X = dataset[:,0:8]
Y = dataset[:,8]

將數據分爲訓練集和測試集，測試集用來預測，訓練集用來學習模型

seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

xgboost 有封裝好的分類器和迴歸器，能夠直接用 XGBClassifier 創建模型
這裏是 XGBClassifier 的文檔：
http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

model = XGBClassifier()
model.fit(X_train, y_train)

xgboost 的結果是每一個樣本屬於第一類的機率，須要用 round 將其轉換爲 0 1 值

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

獲得 Accuracy: 77.95%

accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

2. 監控模型表現

xgboost 能夠在模型訓練時，評價模型在測試集上的表現，也能夠輸出每一步的分數

只須要將

model = XGBClassifier()
model.fit(X_train, y_train)

變爲：

model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)

那麼它會在每加入一顆樹後打印出 logloss

[31]    validation_0-logloss:0.487867
[32]    validation_0-logloss:0.487297
[33]    validation_0-logloss:0.487562

並打印出 Early Stopping 的點：

Stopping. Best iteration:
[32]    validation_0-logloss:0.487297

3. 輸出特徵重要度

gradient boosting 還有一個優勢是能夠給出訓練好的模型的特徵重要性，
這樣就能夠知道哪些變量須要被保留，哪些能夠捨棄

須要引入下面兩個類

from xgboost import plot_importance
from matplotlib import pyplot

和前面的代碼相比，就是在 fit 後面加入兩行畫出特徵的重要性

model.fit(X, y)

plot_importance(model)
pyplot.show()

4. 調參

如何調參呢，下面是三個超參數的通常實踐最佳值，能夠先將它們設定爲這個範圍，而後畫出 learning curves，再調解參數找到最佳模型：

learning_rate ＝ 0.1 或更小，越小就須要多加入弱學習器；
tree_depth ＝ 2～8；
subsample ＝訓練集的 30%～80%；

接下來咱們用 GridSearchCV 來進行調參會更方便一些：

能夠調的超參數組合有：

樹的個數和大小 (n_estimators and max_depth).
學習率和樹的個數 (learning_rate and n_estimators).
行列的 subsampling rates (subsample, colsample_bytree and colsample_bylevel).

下面以學習率爲例：

先引入這兩個類

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

設定要調節的 learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
和原代碼相比就是在 model 後面加上 grid search 這幾行：

model = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
param_grid = dict(learning_rate=learning_rate)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, Y)

最後會給出最佳的學習率爲 0.1
Best: -0.483013 using {'learning_rate': 0.1}

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

咱們還能夠用下面的代碼打印出每個學習率對應的分數：

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
-0.689650 (0.000242) with: {'learning_rate': 0.0001}
-0.661274 (0.001954) with: {'learning_rate': 0.001}
-0.530747 (0.022961) with: {'learning_rate': 0.01}
-0.483013 (0.060755) with: {'learning_rate': 0.1}
-0.515440 (0.068974) with: {'learning_rate': 0.2}
-0.557315 (0.081738) with: {'learning_rate': 0.3}

最後附上完整的代碼

# coding=utf-8


from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import plot_importance
from matplotlib import pyplot
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

X = dataset[:, 0:8]
Y = dataset[:, 8]

seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

model = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
param_grid = dict(learning_rate=learning_rate)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, Y)

eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)

# plot_importance(model)
# pyplot.show()

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

#最佳的學習率
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# 打印出每個學習率對應的分數
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

轉載連接：https://www.jianshu.com/p/7e0e2d66b3d4

相關標籤/搜索

gbdt&lightgbm&xgboost

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。