【譯】使用H2O進行集成學習【1】

時間 2019-11-16

標籤使用 h2o 進行集成學習简体版

原文原文鏈接

H2O Ensemble: Stacking in H2O

若你不能成功安裝這個版本不要糾結，你能夠看第二篇譯文，但我建議你先瀏覽一遍這篇文章
H2O Ensemble已經實現成爲一個成爲h2oEnsemble的獨立R包。該包是h2o這個包的擴展，它容許用戶在h2o集羣上使用任意的h2o監督學習算法來訓練一個集成模型。在h2o這個R包中，h2oEnsemble中的全部計算實際上都在H2O集羣內部執行，而不是在R內存中執行。html

Super Learner集成算法中的主要計算任務是初級學習器與次級學習器的訓練和交叉驗證。所以，在R中（而不是在Java中）實現集成的「plumbing」不會致使性能的損失。全部的訓練和數據處理都在高性能H2O集羣中進行。java

H2O Ensemble目前只支持迴歸和二分類任務，將在之後的版本中添加多分類支持。git

譯者注：最新版的h2o包運行下面代碼會報錯，建議按照老版本。按裝老版本代碼以下，可能會有點慢(h2o這個包有50M)，並且h2o包運行須要java環境。github

install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/9/R")))

安裝 H2O Ensemble

爲了安裝 h2oEnsemble包，你只須要按照README文件中的安裝說明，這也是爲了方便起見。算法

H2O R Package

首先，你須要安裝H2O R包，若是你尚未安裝它。R安裝說明參見：http://h2o.ai/downloadsegmentfault

H2O Ensemble R Package

推薦的h2oEnsemble R軟件包的安裝方式是直接從GitHub使用devtools軟件包。（H2O World教程參加者能夠從提供的U盤安裝軟件包）。oracle

從GitHub上進行安裝

library(devtools)
install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")

Higgs Demo

這是一個使用h2o.ensemble函數的二分類例子，h2o.ensemble是 h2oEnsemble包裏的一個函數。這個演示使用的是 HIGGS dataset數據集的子集，有28個數值特徵和一個二分類響應變量，在該示例中的機器學習任務是區分產生Higgs 玻色子(Y = 1)的和不產生玻色子的背景(Y = 0)。數據集的正例反例大體相同，也就是說這是一個類別平衡的數據集。app

若是從純R運行，請在此腳本的目錄中執行R。若是從RStudio運行，請確保setwd()到此腳本的位置。 h2o.init()在R的當前工做目錄中啓動H2O。 h2o.importFile()是h2o中的文件導入函數。dom

開啓h2o集羣

library(h2oEnsemble)  # This will load the `h2o` R package as well
h2o.init(nthreads = -1,enable_assertions = FALSE)  # Start an H2O cluster with nthreads = num cores on your machine，-1 means use  all CPUs on the host
h2o.removeAll() # (Optional) Remove all objects in H2O cluster

導入數據

首先導入訓練集和測試集機器學習

train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
y <- "response"
x <- setdiff(names(train), y)
family <- "binomial"

對於二分類問題，響應變量應該是一個factor 類型(在JAVA中爲 enum類型，Python中的Pandas爲categorial類型),用戶能夠在使用h2o.importFile函數時指定列的類型，你也能夠按照以下方法指定列類型：

train[,y] <- as.factor(train[,y])  
test[,y] <- as.factor(test[,y])

指定初級學習器與次級學習器

在這裏，咱們將使用h2o.ensemble的默認初級學習器庫,默認的函數包括GLM, Random Forest, GBM and Deep Neural Net (全部模型使用默認的參數)。同時，次級學習器咱們也使用默認的 H2O GLM。

learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper", 
             "h2o.gbm.wrapper", "h2o.deeplearning.wrapper")
metalearner <- "h2o.glm.wrapper"

訓練一個集成模型

使用5折交叉驗證進行訓練來產生level-one數據。值得注意的是，使用更多的折會消耗更多的時間，但也許會提升性能。

fit <- h2o.ensemble(x = x, y = y, 
                    training_frame = train, 
                    family = family, 
                    learner = learner, 
                    metalearner = metalearner,
                    cvControl = list(V = 5))

評估模型性能

因爲響應變量是二分類的，咱們可使用ROC曲線下面積(AUC)來評估模型性能。計算測試集性能，並按AUC(二項分類的默認度量)排序：

perf <- h2o.ensemble_performance(fit, newdata = test)

輸出各個初級學習器的性能與集成模型的性能：

> perf

Base learner performance, sorted by specified metric:
                   learner       AUC
1          h2o.glm.wrapper 0.6824304
4 h2o.deeplearning.wrapper 0.7006335
2 h2o.randomForest.wrapper 0.7570211
3          h2o.gbm.wrapper 0.7780807


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (AUC): 0.781580655670451

咱們能夠比較總體的性能與個體學習器在總體中的表現。

咱們能夠看到最好的單模型是GBM,在測試集上的AUC爲0.778，而集成之後的得分爲0.7815。起初認爲這點提升彷佛不太多，但在許多行業，如醫藥或金融，這個小優點是很是有價值的。

爲了提升集成的性能，咱們有幾個選擇。

經過 cvControl 參數來增長交叉驗證的折數。
改變初級學習器與次級學習器。

注意，上面的集成結果是不可重現的，由於 h2o.deeplearning 在使用多個核時結果不可重現，而且咱們沒有爲 h2o.randomForest.wrapper設置隨機種子。

若是你想使用不一樣的評測方式，好比說"MSE"，咱們能夠經過 print 函數來實現。

> print(perf, metric = "MSE")

Base learner performance, sorted by specified metric:
                   learner       MSE
4 h2o.deeplearning.wrapper 0.2305775
1          h2o.glm.wrapper 0.2225176
2 h2o.randomForest.wrapper 0.2014339
3          h2o.gbm.wrapper 0.1916273


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (MSE): 0.1898735479034431

Predict

若是你須要生成預測值（而不是隻看模型性能），你能夠在測試集上使用predict函數。

pred <- predict(fit, newdata = test)

若是須要將預測值返回R內存中進行進一步處理，能夠將ped轉換爲本地R 的數據框，以下所示：

predictions <- as.data.frame(pred$pred)[,3]  #third column is P(Y==1)
labels <- as.data.frame(test[,y])[,1]

h2o.ensemble擬合的predict方法將返回一個列表，它包含兩個對象。 pred$pred對象包含的是集成的預測結果， pred$basepred 返回的是一個矩陣，包含每一個初級學習器的預測值。在這個例子中，咱們使用了4個初級學習器，因此pred$basepred 返回的矩陣包含4列。

指定新的學習器

如今讓咱們再試一下更多的基學習器。h2oEnsemble包默認有四個函數，能夠自定義使用非默認參數。

這裏是如何生成自定義學習器的示例：

h2o.glm.1 <- function(..., alpha = 0.0) h2o.glm.wrapper(..., alpha = alpha)
h2o.glm.2 <- function(..., alpha = 0.5) h2o.glm.wrapper(..., alpha = alpha)
h2o.glm.3 <- function(..., alpha = 1.0) h2o.glm.wrapper(..., alpha = alpha)
h2o.randomForest.1 <- function(..., ntrees = 200, nbins = 50, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
h2o.randomForest.2 <- function(..., ntrees = 200, sample_rate = 0.75, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, sample_rate = sample_rate, seed = seed)
h2o.randomForest.3 <- function(..., ntrees = 200, sample_rate = 0.85, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, sample_rate = sample_rate, seed = seed)
h2o.randomForest.4 <- function(..., ntrees = 200, nbins = 50, balance_classes = TRUE, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, balance_classes = balance_classes, seed = seed)
h2o.gbm.1 <- function(..., ntrees = 100, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, seed = seed)
h2o.gbm.2 <- function(..., ntrees = 100, nbins = 50, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
h2o.gbm.3 <- function(..., ntrees = 100, max_depth = 10, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, max_depth = max_depth, seed = seed)
h2o.gbm.4 <- function(..., ntrees = 100, col_sample_rate = 0.8, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.5 <- function(..., ntrees = 100, col_sample_rate = 0.7, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.6 <- function(..., ntrees = 100, col_sample_rate = 0.6, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.7 <- function(..., ntrees = 100, balance_classes = TRUE, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, balance_classes = balance_classes, seed = seed)
h2o.gbm.8 <- function(..., ntrees = 100, max_depth = 3, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, max_depth = max_depth, seed = seed)
h2o.deeplearning.1 <- function(..., hidden = c(500,500), activation = "Rectifier", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.2 <- function(..., hidden = c(200,200,200), activation = "Tanh", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.3 <- function(..., hidden = c(500,500), activation = "RectifierWithDropout", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.4 <- function(..., hidden = c(500,500), activation = "Rectifier", epochs = 50, balance_classes = TRUE, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, balance_classes = balance_classes, seed = seed)
h2o.deeplearning.5 <- function(..., hidden = c(100,100,100), activation = "Rectifier", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.6 <- function(..., hidden = c(50,50), activation = "Rectifier", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.7 <- function(..., hidden = c(100,100), activation = "Rectifier", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)

咱們來選取基學習器一個子集，並從新訓練集成模型。

自定義初級學習器

learner <- c("h2o.glm.wrapper",
             "h2o.randomForest.1", "h2o.randomForest.2",
             "h2o.gbm.1", "h2o.gbm.6", "h2o.gbm.8",
             "h2o.deeplearning.1", "h2o.deeplearning.6", "h2o.deeplearning.7")

用新的初級學習器來進行訓練：

fit <- h2o.ensemble(x = x, y = y, 
                    training_frame = train,
                    family = family, 
                    learner = learner, 
                    metalearner = metalearner,
                    cvControl = list(V = 5))

評估測試集性能：

perf <- h2o.ensemble_performance(fit, newdata = test)

結果以下：

> perf

Base learner performance, sorted by specified metric:
             learner       AUC
1    h2o.glm.wrapper 0.6824304
7 h2o.deeplearning.1 0.6897187
8 h2o.deeplearning.6 0.6998472
9 h2o.deeplearning.7 0.7048874
2 h2o.randomForest.1 0.7668024
3 h2o.randomForest.2 0.7697849
4          h2o.gbm.1 0.7751240
6          h2o.gbm.8 0.7752852
5          h2o.gbm.6 0.7771115


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (AUC): 0.780924502576107

那麼，若是咱們移除一些較弱的學習器，那麼會發生什麼呢？讓咱們從學習器中刪除GLM和DL，看看會發生什麼。

learner <- c("h2o.randomForest.1", "h2o.randomForest.2",
             "h2o.gbm.1", "h2o.gbm.6", "h2o.gbm.8")

再次從新訓練集成模型並評估性能：

fit <- h2o.ensemble(x = x, y = y, 
                     training_frame = train,
                     family = family, 
                     learner = learner, 
                     metalearner = metalearner,
                     cvControl = list(V = 5))

perf <- h2o.ensemble_performance(fit, newdata = test)

實際上，移除弱學習器後咱們的集成表現有所降低！這代表了堆疊與大量和多樣化的基學習器的做用。

> perf

Base learner performance, sorted by specified metric:
             learner       AUC
1 h2o.randomForest.1 0.7668024
2 h2o.randomForest.2 0.7697849
3          h2o.gbm.1 0.7751240
5          h2o.gbm.8 0.7752852
4          h2o.gbm.6 0.7771115


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (AUC): 0.778853964308554

首先你會想到，你能夠假設去除性能較低的模型會提升系綜的性能。然而，每一個學習器都有本身對集成模型的獨特貢獻，學習器之間的多樣性一般會提升性能。Stacking 算法是以優於其餘結合方法的方式，將全部學習器組合在一塊兒的優化方式。

Stacking 現有的模型集

下面爲Stacking示意圖：

您也可使用h2o模型的做爲起點，並使用h2o.stack() 函數將它們經過指定的次級學習器。

初級學習器必須已經在相同響應變量的相同數據集上訓練，而且對於交叉驗證必須已經使用相同的折數。

示例以下。如上所述，啓動H2O集羣並加載訓練和測試數據。

library(h2oEnsemble)
h2o.init(nthreads = -1) # Start H2O cluster using all available CPU threads


# Import a sample binary outcome train/test set into R
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
y <- "response"
x <- setdiff(names(train), y)
family <- "binomial"

#For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])

使用交叉驗證訓練少數基學習器，而後使用h2o.stack()函數建立集成模型：

# The h2o.stack function is an alternative to the h2o.ensemble function, which
# allows the user to specify H2O models individually and then stack them together
# at a later time.  Saved models, re-loaded from disk, can also be stacked.

# The base models must use identical cv folds; this can be achieved in two ways:
# 1. they be specified explicitly by using the fold_column argument, or
# 2. use same value for `nfolds` and set `fold_assignment = "Modulo"`

nfolds <- 5  

glm1 <- h2o.glm(x = x, y = y, family = family, 
                training_frame = train,
                nfolds = nfolds,
                fold_assignment = "Modulo",
                keep_cross_validation_predictions = TRUE)

gbm1 <- h2o.gbm(x = x, y = y, distribution = "bernoulli",
                training_frame = train,
                seed = 1,
                nfolds = nfolds,
                fold_assignment = "Modulo",
                keep_cross_validation_predictions = TRUE)

rf1 <- h2o.randomForest(x = x, y = y, # distribution not used for RF
                        training_frame = train,
                        seed = 1,
                        nfolds = nfolds,
                        fold_assignment = "Modulo",
                        keep_cross_validation_predictions = TRUE)

dl1 <- h2o.deeplearning(x = x, y = y, distribution = "bernoulli",
                        training_frame = train,
                        nfolds = nfolds,
                        fold_assignment = "Modulo",
                        keep_cross_validation_predictions = TRUE)

models <- list(glm1, gbm1, rf1, dl1)
metalearner <- "h2o.glm.wrapper"

stack <- h2o.stack(models = models,
                   response_frame = train[,y],
                   metalearner = metalearner, 
                   seed = 1,
                   keep_levelone_data = TRUE)


# Compute test set performance:
perf <- h2o.ensemble_performance(stack, newdata = test)

輸出初級學習器和集成模型在測試集上的性能：

> print(perf)

Base learner performance, sorted by specified metric:
                                   learner       AUC
1          GLM_model_R_1480128759162_16643 0.6822933
4 DeepLearning_model_R_1480128759162_18909 0.7016809
3          DRF_model_R_1480128759162_17790 0.7546005
2          GBM_model_R_1480128759162_16661 0.7780807


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (AUC): 0.781241759877087