高級集成學習技巧

時間 2019-12-09

標籤高級集成學習技巧简体版

原文原文鏈接

Examined ensemble methods

Averaging (or blending)
Weighted averaging
Conditional averaging
Bagging
Boosting
Stacking
StackNet

Averaging ensemble methods

舉個例子，假設咱們有一個名爲age的變量，就像年齡同樣，咱們試着預測它。咱們有兩個模型：python

低於50，模型效果更好
git
高於50，模型效果更好
github

那麼若是咱們試圖結合它們將會發生什麼呢？算法

Averaging(or blending)
app

(model1 + model2) / 2

$R^2$上升到0.95，較以前有所改善。但該模型並無比單模型作的好的地方更好，儘管如此，它平均表現更好。也許可能會有更好的組合呢？來試試加權平均框架

Weighted averaging
dom

(model1 x 0.7 + model 2 x 0.3)

看起來沒有以前的好ide

Conditional averaging
工具

各取好的部分

理想狀況下，咱們但願獲得相似的結果性能

Bagging

Why Bagging

建模中有兩個主要偏差來源

1.因爲誤差而存在偏差（underfitting）
2.因爲方差而存在偏差（overfitting）

經過略微不一樣的模型，確保預測不會有讀取很是高的方差。這一般使它更具廣泛性。

Parameters that control bagging?

Changing the seed
Row(Sub) sampling or Bootstrapping
Shuffling
Column(Sub) sampling
Model-specific parameters
Number of models (or bags)
(Optionally) parallelism

Examples of bagging

Boosting

Boosting是對每一個模型構建的模型進行加權平均的一種形式，順序地考慮之前的模型性能。

Weight based boosting

假設咱們有一個表格數據集，有四個特徵。咱們稱它們爲x0，x1，x2和x3，咱們但願使用這些功能來預測目標變量y。
咱們將預測值稱爲pred，這些預測有必定的偏差。咱們能夠計算這些絕對偏差，|y - pred|。咱們能夠基於今生成一個新列或向量，在這裏咱們建立一個權重列，使用1加上絕對偏差。固然有不一樣的方法來計算這個權重，如今咱們只是以此爲例。

全部接下來要作的是用這些特徵去擬合新的模型，但每次也要增長這個權重列。這就是按順序添加模型的方法。

Weight based boosting parameters

Learning rate (or shrinkage or eta)
每一個模型只相信一點點：predictionN = pred0*eta + pred1*eta + ... + predN*eta
Number of estimators
estimators擴大一倍，eta減少一倍
Input model - can be anything that accepts weights
Sub boosting type:
AdaBoost-Good implementation in sklearn(python)
LogitBoost-Good implementation in Weka(Java)

Residual based boosting [&]

咱們使用一樣的數據集作相同的事。預測出pred後

接下來會計算偏差

將error做爲新的y獲得新的預測new_pred

以Rownum=1爲例：

最終預測=0.75 + 0.20 = 0.95更接近於1

這種方法頗有效，能夠很好的減少偏差。

Residual based boosting parameters

Learning rate (or shrinkage or eta)
predictionN = pred0 + pred1*eta + ... + predN*eta
前面的例子，若是eta爲0.1，則Prediction=0.75 + 0.2*(0.1) = 0.77
Number of estimators
Row (sub)sampling
Column (sub)sampling
Input model - better be trees.
Sub boosting type:
Full gradient based
Dart

Residual based favourite implementations

Xgboost
Lightgbm
H2O's GBM
Catboost
Sklearn's GBM

Stacking

Methodology

Wolpert in 1992 introduced stacking. It involves:
1. Splitting the train set into two disjoint sets.
1. Train several base learners on the first part.
1. Make predictions with the base learners on the second (validation) part.

具體步驟

假設有A,B,C三個數據集，其中A,B的目標變量y已知。

而後

算法0擬合A，預測B和C，而後保存pred0到B1,C1
算法1擬合A，預測B和C，而後保存pred1到B1,C1
算法2擬合A，預測B和C，而後保存pred2到B1,C1
算法3擬合B1，預測C1，獲得最終結果preds3

Stacking example

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split
train = '' # your training set
y = ''     # your target variable
# split train data in 2 part, training and valdiation.
training, valid, ytraining, yvalid = train_test_split(train, y, test_size=0.5)
# specify models
model1 = RandomForestRegressor()
model2 = LinearRegression()
#fit models
model1.fit(training, ytraining)
model2.fit(trainging, ytraining)
# make predictions for validation
preds1 = model1.predict(valid)
preds2 = model2.predict(valid)
# make predictions for test data
test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)
# From a new dataset for valid and test via stacking the predictions
stacked_predictions = np.colum_stack((preds1, preds2))
stacked_test_predictions = np.column_stack((test_preds1, test_preds2))
# specify meta model
meta_model = LinearRegression()
meta_model.fit(stacked_predictions, yvalid)
# make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)

Stacking(past) example

能夠看到，它與咱們使用Conditional averaging的結果很是近似。只是在50附件作的不夠好，這是有道理的，由於模型沒有見到目標變量，沒法準確識別出50這個缺口。因此它只是嘗試根據模型的輸入來肯定。

Things to be mindful of

With time sensitive data - respect time
若是你的數據帶有時間元素，你須要指定你的stacking，以便尊重時間。
Diversity as important as performance
單一模型表現很重要，但模型的多樣性也很是重要。當模型是壞的或弱的狀況，你不需太擔憂，stacking實際上能夠從每一個預測中提取到精華，獲得好的結果。所以，你真正須要關注的是，我正在製做的模型能給我帶來哪些信息，即便它一般很弱。
Diversity may come from:
Different algorithms
Different input features
Performance plateauing after N models
Meta model is normally modest

StackNet

https://github.com/kaz-Anova/StackNet

Ensembling Tips and Tricks

$1^{st}$ level tips

Diversity based on algorithms:
2-3 gradient boosted trees (lightgbm, xgboost, H2O, catboost)
2-3 Neural nets (keras, pytorch)
1-2 ExtraTrees/RandomForest (sklearn)
1-2 linear models as in logistic/ridge regression, linear svm (sklearn)
1-2 knn models (sklearn)
1 Factorization machine (libfm)
1 svm with nonlinear kernel(like RBF) if size/memory allows (sklearn)
Diversity based on input data:
Categorical features: One hot, label encoding, target encoding, likelihood encoding, frequency or counts
Numerical features: outliers, binning, derivatives, percentiles, scaling
Interactions: col1*/+-col2, groupby, unsupervised