高級集成學習技巧

Examined ensemble methods

  • Averaging (or blending)
  • Weighted averaging
  • Conditional averaging
  • Bagging
  • Boosting
  • Stacking
  • StackNet

Averaging ensemble methods

舉個例子,假設咱們有一個名爲age的變量,就像年齡同樣,咱們試着預測它。咱們有兩個模型:python

  • 低於50,模型效果更好
    model1.pnggit

  • 高於50,模型效果更好
    model2.pnggithub

那麼若是咱們試圖結合它們將會發生什麼呢?算法

Averaging(or blending)
app

  • (model1 + model2) / 2
    model12.png

$R^2$上升到0.95,較以前有所改善。但該模型並無比單模型作的好的地方更好,儘管如此,它平均表現更好。也許可能會有更好的組合呢?來試試加權平均框架

Weighted averaging
dom

  • (model1 x 0.7 + model 2 x 0.3)
    model_weight.png

看起來沒有以前的好ide

Conditional averaging
工具

  • 各取好的部分
    model_best.png

理想狀況下,咱們但願獲得相似的結果性能

Bagging

Why Bagging

建模中有兩個主要偏差來源

  • 1.因爲誤差而存在偏差(underfitting)
  • 2.因爲方差而存在偏差(overfitting)

經過略微不一樣的模型,確保預測不會有讀取很是高的方差。這一般使它更具廣泛性。

Parameters that control bagging?

  • Changing the seed
  • Row(Sub) sampling or Bootstrapping
  • Shuffling
  • Column(Sub) sampling
  • Model-specific parameters
  • Number of models (or bags)
  • (Optionally) parallelism

Examples of bagging

bagging_code.png

Boosting

Boosting是對每一個模型構建的模型進行加權平均的一種形式,順序地考慮之前的模型性能。

Weight based boosting

weight_based.png

假設咱們有一個表格數據集,有四個特徵。 咱們稱它們爲x0,x1,x2和x3,咱們但願使用這些功能來預測目標變量y。
咱們將預測值稱爲pred,這些預測有必定的偏差。咱們能夠計算這些絕對偏差,|y - pred|。咱們能夠基於今生成一個新列或向量,在這裏咱們建立一個權重列,使用1加上絕對偏差。固然有不一樣的方法來計算這個權重,如今咱們只是以此爲例。

全部接下來要作的是用這些特徵去擬合新的模型,但每次也要增長這個權重列。這就是按順序添加模型的方法。

Weight based boosting parameters

  • Learning rate (or shrinkage or eta)
  • 每一個模型只相信一點點:predictionN = pred0*eta + pred1*eta + ... + predN*eta
  • Number of estimators
  • estimators擴大一倍,eta減少一倍
  • Input model - can be anything that accepts weights
  • Sub boosting type:
  • AdaBoost-Good implementation in sklearn(python)
  • LogitBoost-Good implementation in Weka(Java)

Residual based boosting [&]

咱們使用一樣的數據集作相同的事。預測出pred後
residual_pred.png

接下來會計算偏差
residual_error.png

將error做爲新的y獲得新的預測new_pred
residual_new_pred.png

以Rownum=1爲例:

最終預測=0.75 + 0.20 = 0.95更接近於1

這種方法頗有效,能夠很好的減少偏差。

Residual based boosting parameters

  • Learning rate (or shrinkage or eta)
  • predictionN = pred0 + pred1*eta + ... + predN*eta
  • 前面的例子,若是eta爲0.1,則Prediction=0.75 + 0.2*(0.1) = 0.77
  • Number of estimators
  • Row (sub)sampling
  • Column (sub)sampling
  • Input model - better be trees.
  • Sub boosting type:
  • Full gradient based
  • Dart

Residual based favourite implementations

  • Xgboost
  • Lightgbm
  • H2O's GBM
  • Catboost
  • Sklearn's GBM

Stacking

Methodology

  • Wolpert in 1992 introduced stacking. It involves:
    1. Splitting the train set into two disjoint sets.
    1. Train several base learners on the first part.
    1. Make predictions with the base learners on the second (validation) part.

具體步驟

假設有A,B,C三個數據集,其中A,B的目標變量y已知。
stacking_data.png

而後

  • 算法0擬合A,預測B和C,而後保存pred0到B1,C1
  • 算法1擬合A,預測B和C,而後保存pred1到B1,C1
  • 算法2擬合A,預測B和C,而後保存pred2到B1,C1
    stacking_data2.png

  • 算法3擬合B1,預測C1,獲得最終結果preds3

Stacking example

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split
train = '' # your training set
y = ''     # your target variable
# split train data in 2 part, training and valdiation.
training, valid, ytraining, yvalid = train_test_split(train, y, test_size=0.5)
# specify models
model1 = RandomForestRegressor()
model2 = LinearRegression()
#fit models
model1.fit(training, ytraining)
model2.fit(trainging, ytraining)
# make predictions for validation
preds1 = model1.predict(valid)
preds2 = model2.predict(valid)
# make predictions for test data
test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)
# From a new dataset for valid and test via stacking the predictions
stacked_predictions = np.colum_stack((preds1, preds2))
stacked_test_predictions = np.column_stack((test_preds1, test_preds2))
# specify meta model
meta_model = LinearRegression()
meta_model.fit(stacked_predictions, yvalid)
# make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)

Stacking(past) example

stacking_past.png

能夠看到,它與咱們使用Conditional averaging的結果很是近似。只是在50附件作的不夠好,這是有道理的,由於模型沒有見到目標變量,沒法準確識別出50這個缺口。因此它只是嘗試根據模型的輸入來肯定。

Things to be mindful of

  • With time sensitive data - respect time
  • 若是你的數據帶有時間元素,你須要指定你的stacking,以便尊重時間。
  • Diversity as important as performance
  • 單一模型表現很重要,但模型的多樣性也很是重要。當模型是壞的或弱的狀況,你不需太擔憂,stacking實際上能夠從每一個預測中提取到精華,獲得好的結果。所以,你真正須要關注的是,我正在製做的模型能給我帶來哪些信息,即便它一般很弱。
  • Diversity may come from:
  • Different algorithms
  • Different input features
  • Performance plateauing after N models
  • Meta model is normally modest

StackNet

https://github.com/kaz-Anova/StackNet

Ensembling Tips and Tricks

$1^{st}$ level tips

  • Diversity based on algorithms:
  • 2-3 gradient boosted trees (lightgbm, xgboost, H2O, catboost)
  • 2-3 Neural nets (keras, pytorch)
  • 1-2 ExtraTrees/RandomForest (sklearn)
  • 1-2 linear models as in logistic/ridge regression, linear svm (sklearn)
  • 1-2 knn models (sklearn)
  • 1 Factorization machine (libfm)
  • 1 svm with nonlinear kernel(like RBF) if size/memory allows (sklearn)
  • Diversity based on input data:
  • Categorical features: One hot, label encoding, target encoding, likelihood encoding, frequency or counts
  • Numerical features: outliers, binning, derivatives, percentiles, scaling
  • Interactions: col1*/+-col2, groupby, unsupervised

$2^{st}$ level tips

  • Simpler (or shallower) Algorithms:
  • gradient boosted trees with small depth(like 2 or 3)
  • Linear models with high regularization
  • Extra Trees (just don't make them too big)
  • Shallow networks (as in 1 hidden layer, with not that many hidden neurons)
  • knn with BrayCurtis Distance
  • Brute forcing a search for best linear weights based on cv

  • Feature engineering:
  • pairwise differences between meta features
  • row-wise statistics like averages or stds
  • Standard feature selection techniques
  • For every 7.5 models in previous level we add 1 in meta (經驗)
  • Be mindful to target leakage

Additional materials

wechat.jpg

相關文章
相關標籤/搜索