集成學習之Boosting —— AdaBoost實現

時間 2019-12-10

原文原文鏈接

集成學習之Boosting —— AdaBoost原理

集成學習之Boosting —— AdaBoost實現

AdaBoost的通常算法流程

輸入：訓練數據集 \(T = \left \{(x_1,y_1), (x_2,y_2), \cdots (x_N,y_N)\right \}\)，\(y\in\left\{-1,+1 \right\}\)，基學習器\(G_m(x)\)，訓練輪數Mhtml

初始化權值分佈： \(w_i^{(1)} = \frac{1}{N}\:, \;\;\;\; i=1,2,3, \cdots N\)

for m=1 to M:
(a) 使用帶有權值分佈的訓練集學習獲得基學習器\(G_m(x)\):
\[G_m(x) = \mathop{\arg\min}\limits_{G(x)}\sum\limits_{i=1}^Nw_i^{(m)}\mathbb{I}(y_i \neq G(x_i))\]
(b) 計算\(G_m(x)\)在訓練集上的偏差率：
\[\epsilon_m = \frac{\sum\limits_{i=1}^Nw_i^{(m)}\mathbb{I}(y_i \neq G_m(x_i))}{\sum\limits_{i=1}^Nw_i^{(m)}}\]
(c) 計算\(G_m(x)\)的係數： \(\alpha_m = \frac{1}{2}ln\frac{1-\epsilon_m}{\epsilon_m}\)
(d) 更新樣本權重分佈： \(w_{i}^{(m+1)} = \frac{w_i^{(m)}e^{-y_i\alpha_mG_m(x_i)}}{Z^{(m)}}\; ,\qquad i=1,2,3\cdots N\)
其中\(Z^{(m)}\)是規範化因子，\(Z^{(m)} = \sum\limits_{i=1}^Nw^{(m)}_ie^{-y_i\alpha_mG_m(x_i)}\)，以確保全部的\(w_i^{(m+1)}\)構成一個分佈。

輸出最終模型： \(G(x) = sign\left[\sum\limits_{m=1}^M\alpha_mG_m(x) \right]\)

另外具體實現了real adaboost, early_stopping，weight_trimming和分步預測 (stage_predict，見完整代碼)。

import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.base import clone
from sklearn.metrics import zero_one_loss
import time


class AdaBoost(object):
    def __init__(self, M, clf, learning_rate=1.0, method="discrete", tol=None, weight_trimming=None):
        self.M = M
        self.clf = clf
        self.learning_rate = learning_rate
        self.method = method
        self.tol = tol
        self.weight_trimming = weight_trimming

    def fit(self, X, y):
        # tol爲early_stopping的閾值，若是使用early_stopping，則從訓練集中分出驗證集
        if self.tol is not None:      
            X, X_val, y, y_val = train_test_split(X, y, random_state=2)  
            former_loss = 1
            count = 0
            tol_init = self.tol

        w = np.array([1 / len(X)] * len(X))   # 初始化權重爲1/n
        self.clf_total = []
        self.alpha_total = []

        for m in range(self.M):
            classifier = clone(self.clf)
            if self.method == "discrete":
                if m >= 1 and self.weight_trimming is not None:
                    # weight_trimming的實現，先將權重排序，計算累積和，再去除權重太小的樣本
                    sort_w = np.sort(w)[::-1]     
                    cum_sum = np.cumsum(sort_w)   
                    percent_w = sort_w[np.where(cum_sum >= self.weight_trimming)][0]   
                    w_fit, X_fit, y_fit = w[w >= percent_w], X[w >= percent_w], y[w >= percent_w]
                    y_pred = classifier.fit(X_fit, y_fit, sample_weight=w_fit).predict(X)

                else:
                    y_pred = classifier.fit(X, y, sample_weight=w).predict(X)
                loss = np.zeros(len(X))
                loss[y_pred != y] = 1
                err = np.sum(w * loss)    # 計算帶權偏差率
                alpha = 0.5 * np.log((1 - err) / err) * self.learning_rate  # 計算基學習器的係數alpha
                w = (w * np.exp(-y * alpha * y_pred)) / np.sum(w * np.exp(-y * alpha * y_pred))  # 更新權重分佈

                self.alpha_total.append(alpha)
                self.clf_total.append(classifier)

            elif self.method == "real":
                if m >= 1 and self.weight_trimming is not None:
                    sort_w = np.sort(w)[::-1]
                    cum_sum = np.cumsum(sort_w)
                    percent_w = sort_w[np.where(cum_sum >= self.weight_trimming)][0]
                    w_fit, X_fit, y_fit = w[w >= percent_w], X[w >= percent_w], y[w >= percent_w]
                    y_pred = classifier.fit(X_fit, y_fit, sample_weight=w_fit).predict_proba(X)[:, 1]

                else:
                    y_pred = classifier.fit(X, y, sample_weight=w).predict_proba(X)[:, 1]  
                y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
                clf = 0.5 * np.log(y_pred / (1 - y_pred)) * self.learning_rate
                w = (w * np.exp(-y * clf)) / np.sum(w * np.exp(-y * clf))

                self.clf_total.append(classifier)

            '''early stopping'''
            if m % 10 == 0 and m > 300 and self.tol is not None:
                if self.method == "discrete":
                    p = np.array([self.alpha_total[m] * self.clf_total[m].predict(X_val) for m in range(m)])
                elif self.method == "real":
                    p = []
                    for m in range(m):
                        ppp = self.clf_total[m].predict_proba(X_val)[:, 1]
                        ppp = np.clip(ppp, 1e-15, 1 - 1e-15)
                        p.append(self.learning_rate * 0.5 * np.log(ppp / (1 - ppp)))
                    p = np.array(p)

                stage_pred = np.sign(p.sum(axis=0))
                later_loss = zero_one_loss(stage_pred, y_val)

                if later_loss > (former_loss + self.tol):
                    count += 1
                    self.tol = self.tol / 2  
                else:
                    count = 0
                    self.tol = tol_init
                if count == 2:
                    self.M = m - 20
                    print("early stopping in round {}, best round is {}, M = {}".format(m, m - 20, self.M))
                    break
                former_loss = later_loss

        return self

    def predict(self, X):
        if self.method == "discrete":
            pred = np.array([self.alpha_total[m] * self.clf_total[m].predict(X) for m in range(self.M)])

        elif self.method == "real":
            pred = []
            for m in range(self.M):
                p = self.clf_total[m].predict_proba(X)[:, 1]
                p = np.clip(p, 1e-15, 1 - 1e-15)
                pred.append(0.5 * np.log(p / (1 - p)))

        return np.sign(np.sum(pred, axis=0))


if __name__ == "__main__":
    #測試各模型的準確率和耗時
    X, y = datasets.make_hastie_10_2(n_samples=20000, random_state=1)   # data
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

    start_time = time.time()
    model_discrete = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, 
                              method="discrete", weight_trimming=None)
    model_discrete.fit(X_train, y_train)
    pred_discrete = model_discrete.predict(X_test)
    acc = np.zeros(pred_discrete.shape)
    acc[np.where(pred_discrete == y_test)] = 1
    accuracy = np.sum(acc) / len(pred_discrete)
    print('Discrete Adaboost accuracy: ', accuracy)
    print('Discrete Adaboost time: ', '{:.2f}'.format(time.time() - start_time),'\n')


    start_time = time.time()
    model_real = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, 
                          method="real", weight_trimming=None)
    model_real.fit(X_train, y_train)
    pred_real = model_real.predict(X_test)
    acc = np.zeros(pred_real.shape)
    acc[np.where(pred_real == y_test)] = 1
    accuracy = np.sum(acc) / len(pred_real)
    print('Real Adaboost accuracy: ', accuracy)  
    print("Real Adaboost time: ", '{:.2f}'.format(time.time() - start_time),'\n')

    start_time = time.time()
    model_discrete_weight = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, 
                                     method="discrete", weight_trimming=0.995)
    model_discrete_weight.fit(X_train, y_train)
    pred_discrete_weight = model_discrete_weight.predict(X_test)
    acc = np.zeros(pred_discrete_weight.shape)
    acc[np.where(pred_discrete_weight == y_test)] = 1
    accuracy = np.sum(acc) / len(pred_discrete_weight)
    print('Discrete Adaboost(weight_trimming 0.995) accuracy: ', accuracy)
    print('Discrete Adaboost(weight_trimming 0.995) time: ', '{:.2f}'.format(time.time() - start_time),'\n')

    start_time = time.time()
    mdoel_real_weight = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, 
                                     method="real", weight_trimming=0.999)
    mdoel_real_weight.fit(X_train, y_train)
    pred_real_weight = mdoel_real_weight.predict(X_test)
    acc = np.zeros(pred_real_weight.shape)
    acc[np.where(pred_real_weight == y_test)] = 1
    accuracy = np.sum(acc) / len(pred_real_weight)
    print('Real Adaboost(weight_trimming 0.999) accuracy: ', accuracy)
    print('Real Adaboost(weight_trimming 0.999) time: ', '{:.2f}'.format(time.time() - start_time),'\n')
    
    start_time = time.time()
    model_discrete = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, 
                              method="discrete", weight_trimming=None, tol=0.0001)
    model_discrete.fit(X_train, y_train)
    pred_discrete = model_discrete.predict(X_test)
    acc = np.zeros(pred_discrete.shape)
    acc[np.where(pred_discrete == y_test)] = 1
    accuracy = np.sum(acc) / len(pred_discrete)
    print('Discrete Adaboost accuracy (early_stopping): ', accuracy)
    print('Discrete Adaboost time (early_stopping): ', '{:.2f}'.format(time.time() - start_time),'\n')

    start_time = time.time()
    model_real = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, 
                          method="real", weight_trimming=None, tol=0.0001)
    model_real.fit(X_train, y_train)
    pred_real = model_real.predict(X_test)
    acc = np.zeros(pred_real.shape)
    acc[np.where(pred_real == y_test)] = 1
    accuracy = np.sum(acc) / len(pred_real)
    print('Real Adaboost accuracy (early_stopping): ', accuracy)  
    print('Discrete Adaboost time (early_stopping): ', '{:.2f}'.format(time.time() - start_time),'\n')

輸出結果：

Discrete Adaboost accuracy:  0.954
Discrete Adaboost time:  43.47 

Real Adaboost accuracy:  0.9758
Real Adaboost time:  41.15 

Discrete Adaboost(weight_trimming 0.995) accuracy:  0.9528
Discrete Adaboost(weight_trimming 0.995) time:  39.58 

Real Adaboost(weight_trimming 0.999) accuracy:  0.9768
Real Adaboost(weight_trimming 0.999) time:  25.39 

early stopping in round 750, best round is 730, M = 730
Discrete Adaboost accuracy (early_stopping):  0.9268
Discrete Adaboost time (early_stopping):  14.60 

early stopping in round 539, best round is 519, M = 519
Real Adaboost accuracy (early_stopping):  0.974
Discrete Adaboost time (early_stopping):  11.64

能夠看到，weight_trimming對於Discrete AdaBoost的訓練速度無太大提高，而對於Real AdaBoost則較明顯，可能緣由是Discrete AdaBoost每一輪的權重較分散，而Real AdaBoost的權重集中在少數的樣本上。
early_stopping分別發生在750和539輪，最後準確率也能夠接受。

下兩張圖顯示使用weight_trimming的狀況下準確率與正常AdaBoost相差無幾 (除了0.95的狀況)。python

Discrete AdaBoost vs. Real AdaBoost - Overfitting

AdaBoost有一個吸引人的特性，那就是其「不會過擬合」，或者更準確的說法是在訓練偏差降低到零以後繼續訓練依然能提升泛化性能。以下圖所示，訓練10000棵樹，Real AdaBoost的訓練偏差早早降低爲零，而測試偏差幾乎平穩不變。並且能夠看到 Real AdaBoost 對比 Discrete AdaBoost 不管是訓練速度仍是準確率都更勝一籌。git

Margin理論能夠解釋這個現象，認爲隨着訓練輪數的增長，即便訓練偏差已經至零，對於訓練樣本預測的margin依然會擴大，這等於會不斷提高預測的信心。但過去十幾年來學術界一直對該理論存在爭議，具體可參閱AdaBoost發明者的論文 [Schapire, Explaining AdaBoost]。github

Learning Curve

Learning Curve是另外一種評估模型的方法，反映隨着訓練集的增大，訓練偏差和測試偏差的變化狀況。一般若是兩條曲線比較接近且偏差都較大，爲欠擬合；若是訓練集偏差率低，測試集偏差率高，兩者的曲線會存在較大距離，則爲過擬合。算法

下面來看AdaBoost在上面數據集中的learning curve：api

這裏總共只選用了5000個數據 (2500訓練集 + 2500測試集)，由於learning curve的繪製一般須要擬合N個模型 (N爲訓練樣本數)，計算量太大。從上圖來看Discrete AdaBoost是欠擬合，而Real AdaBoost比較像是過擬合，若是進一步增長數據，Real AdaBoost的測試偏差率可能會進一步降低。

app