Notes : Chapter 7

 

 

 

 

 

aggregate the predictions of a group predictions, you will often get better predictions than with the best individual predictor
A group of predictions called ensemble
discuss the most popular Ensemble mothod including:bagging, boosting, stracking, and a few othersjavascript

 

Voting Classifiers

  1. voting=hard爲投票,voting=soft爲平均全部獨立的分類器
  2. 根據大數定律,Ensemble Learning會比獨立的各個分類器表現的更好,並且各個分類器相互之間越獨立表現的越好
In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)  #chapter 5

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(random_state=42)
rnd_clf = RandomForestClassifier(random_state=42)
svm_clf = SVC(probability=True, random_state=42)  #voting=soft須要predict_proba(),田間probability=True來產生此方法

voting_clf = VotingClassifier(
        estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
        voting='soft'
    )
voting_clf.fit(X_train, y_train)  #統一fit

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
 
LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912
 

Bagging And Pasting

可使用徹底不相同的算法來產生多個分類器,也可使用相同的算法不相同的訓練集來產生不相同的算法css

  1. 同一個數據集從新採樣來表明整個數據集。有放回的採樣bagging,無放回的叫pasting
  2. 使用重採樣以後的數據集會產生更大的傾斜,但將它們aggregation(匯聚)以後會下降傾斜和波動
  3. 各個分類器能夠分別訓練(並行)
 

Bagging and pasting in scikit-learnhtml

In [2]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
In [3]:
accuracy_score(y_test, y_pred)
Out[3]:
0.91200000000000003
In [4]:
past_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=False, n_jobs=-1)

past_clf.fit(X_train, y_train)
y_pred = past_clf.predict(X_test)
accuracy_score(y_test, y_pred)
Out[4]:
0.91200000000000003
 

bagging has a slightly higher bias than pasing because of bootstrapping introduces a bit more diversityhtml5

 

Out-of-Bag Evaluationjava

  1. bagging屢次採樣以後約有$\frac{1}{e}=0.367879$沒被採樣到,被稱爲oob
  2. 可使用這一部分計算偏差
In [5]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_
Out[5]:
0.92266666666666663
In [6]:
y_pred=bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)
Out[6]:
0.91200000000000003
In [7]:
bag_clf.oob_decision_function_[2]
Out[7]:
array([ 0.99744898,  0.00255102])
 

Random Pathes and Random Subspaces

  1. 對於feature不少維的輸入,也有針對feature的採樣。分別稱爲max_features和bootstrap_features,和max_samples和bootstrap用法相同
  2. Sampling both instances and features is called Random Pathes
  3. Keeping all training instances but sampling features is called Random Subspaces
  4. trade a bit more bias for a lower variance
 

Random Forest

  1. use RandomForestClassifier
In [9]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf=RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
In [10]:
bad_clf = BaggingClassifier(DecisionTreeClassifier(splitter='random',max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)
In [11]:
accuracy_score(y_test, y_pred_rf)
Out[11]:
0.91200000000000003
In [12]:
accuracy_score(y_test, bag_clf.predict(X_test))
Out[12]:
0.91200000000000003
 

Extra-Treesnode

  1. when growing a tree in a random forest, at each node only a arandom subset of features is considered for splitting.
  2. possible to make trees even more random by also using random threshold rather than searching for the best possible threshold.
  3. use the $ExtraTreesClassifier()$ to creat a Extra-Trees class, it has the same API as the RandomForestClassifier.
  4. 也說不定那個表現的更好,對於某一問題須要cross-validation檢驗一下
 

Feature Importancepython

  1. importance features are likely appear closer to the root
  2. using featureimportances variable to access the average depth at which it appears across all trees in the forest
In [13]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)
 
sepal length (cm) 0.102324356672
sepal width (cm) 0.0257240474133
petal length (cm) 0.439143949318
petal width (cm) 0.432807646597
In [14]:
from six.moves import urllib
from sklearn.datasets import fetch_mldata
try:
    mnist = fetch_mldata('MNIST original')
except urllib.error.HTTPError as ex:
    print("Could not download MNIST data from mldata.org, trying alternative...")

    # Alternative method to load MNIST, if mldata.org is down
    from scipy.io import loadmat
    mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
    mnist_path = "./mnist-original.mat"
    response = urllib.request.urlopen(mnist_alternative_url)
    with open(mnist_path, "wb") as f:
        content = response.read()
        f.write(content)
    mnist_raw = loadmat(mnist_path)
    mnist = {
        "data": mnist_raw["data"].T,
        "target": mnist_raw["label"][0],
        "COL_NAMES": ["label", "data"],
        "DESCR": "mldata.org dataset: mnist-original",
    }
    print("Success!")
In [15]:
rnd_clf = RandomForestClassifier(random_state=42)
rnd_clf.fit(mnist["data"], mnist["target"])
Out[15]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)
In [20]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.hot,
               interpolation="nearest")
    plt.axis("off")
In [21]:
plot_digit(rnd_clf.feature_importances_)

cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(), rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(['Not important', 'Very important'])

plt.show()
 
 

Boosting

  1. combine several weak learners into a strong leaner
  2. 依次訓練predictor,每一個都試圖糾正predecessor
 

Adaptive Boostingjquery

  1. new predictor pay more attention to the training instances that the predecessor underfitted.
  2. For example, a second classifier is trained using the first updated weights and again it makes predictions on the training set ,weights are updated, and so on.
In [24]:
import numpy as np
m = len(X_train)

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

plt.figure(figsize=(11, 4))
for subplot, learning_rate in ((121, 1), (122, 0.5)):
    sample_weights = np.ones(m)
    for i in range(5):
        plt.subplot(subplot)
        svm_clf = SVC(kernel="rbf", C=0.05)
        svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
        y_pred = svm_clf.predict(X_train)
        sample_weights[y_pred != y_train] *= (1 + learning_rate)
        plot_decision_boundary(svm_clf, X, y, alpha=0.2)
        plt.title("learning_rate = {}".format(learning_rate - 1), fontsize=16)

plt.subplot(121)
plt.text(-0.7, -0.65, "1", fontsize=14)
plt.text(-0.6, -0.10, "2", fontsize=14)
plt.text(-0.5,  0.10, "3", fontsize=14)
plt.text(-0.4,  0.55, "4", fontsize=14)
plt.text(-0.3,  0.90, "5", fontsize=14)
plt.show()
 
 
$$ Weighted\ error\ rate\ (加權錯誤率)of\ the\ j^{th}\ predictor: \\ r_j=\frac{\sum_{i=1,\widehat{y}_{j}^{(i)}\neq y^{(i)}}^{m}w^{(i)}}{\sum_{i=1}^{m}w^{(i)}}\\ where\ \widehat y_{j}^{(i)}\ is\ the\ j^{th} \ predictor's\ prediction\ for\ the\ i^{th}\ instance. \\ \alpha_j=\eta log \frac{1-r_j}{r_j} \\ Weight\ update\ rule: \\ for\ i=1,2,...,m:\ w^{(i)}\leftarrow \begin{align*} \left\{\begin{matrix} w^{(i)} &if\ \widehat{y}_{j}^{(i)}=y^{(i)}\\ w^{(i)}e^{\alpha_j} &if\ \widehat{y}_{j}^{(i)}\neq y^{(i)} \end{matrix}\right. \end{align*} $$
 
  1. 默認$w^{(i)}=\frac{1}{m}$
  2. 第一次訓練,過程當中計算$r_j,\alpha_j$
  3. 更新$w^{(i)}$,標準化,例如:$\ /\sum_{i=1}^{m}w^{(i)}$
  4. 作下一次訓練,重複這個過程
 
$$ AdaBoost\ predictions: \\ \widehat y(x) = \underset{k}{argmax} \sum_{j=1,\widehat y_j(x)=k}^{N} \alpha_j \\ where\ N\ is\ the\ number\ of\ predictors. $$
  1. SAMME:Stagewise Additive Modeling using a Multiclass Exponential loss function, multiclass version of AdaBoost
In [29]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=500, algorithm='SAMME.R', learning_rate=0.5)
ada_clf.fit(X_train, y_train)
Out[29]:
AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          learning_rate=0.5, n_estimators=500, random_state=None)
In [30]:
accuracy_score(y_test, ada_clf.predict(X_test))
Out[30]:
0.88
 

Grandient Boostinglinux

  1. 和AdaBoost相似,後面的對前面的改進,但它改的是residual errors
In [39]:
from sklearn.tree import DecisionTreeRegressor
import numpy.random as rnd

rnd.seed(42)
X = rnd.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * rnd.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
print(y_pred)
 
[ 0.75026781]
In [40]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X,y)
Out[40]:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=1.0, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=3, presort='auto',
             random_state=None, subsample=1.0, verbose=0, warm_start=False)
In [41]:
def plot_predictions(regressors, X, y, axes, label=None, style="r-", data_style="b.", data_label=None):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    if label or data_label:
        plt.legend(loc="upper center", fontsize=16)
    plt.axis(axes)

plt.figure(figsize=(11,11))

plt.subplot(321)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h_1(x_1)$", style="g-", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Residuals and tree predictions", fontsize=16)

plt.subplot(322)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1)$", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Ensemble predictions", fontsize=16)

plt.subplot(323)
plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_2(x_1)$", style="g-", data_style="k+", data_label="Residuals")
plt.ylabel("$y - h_1(x_1)$", fontsize=16)

plt.subplot(324)
plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1)$")
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.subplot(325)
plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_3(x_1)$", style="g-", data_style="k+")
plt.ylabel("$y - h_1(x_1) - h_2(x_1)$", fontsize=16)
plt.xlabel("$x_1$", fontsize=16)

plt.subplot(326)
plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$")
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.show()
 
In [42]:
# learn_rate hypermeter scales the contribution of each tree
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=0.1, random_state=42)
gbrt.fit(X, y)

gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=200, learning_rate=0.1, random_state=42)
gbrt_slow.fit(X, y)

plt.figure(figsize=(11,4))

plt.subplot(121)
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions")
plt.title("learning_rate={}, n_estimators={}".format(gbrt.learning_rate, gbrt.n_estimators), fontsize=14)

plt.subplot(122)
plot_predictions([gbrt_slow], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow.learning_rate, gbrt_slow.n_estimators), fontsize=14)

plt.show()
 
 

to find the optimal number of trees, can use early stopping, A simple way to implement this is to use the $stage_predict()$ method: returns an iterator over the predictions made by the ensumble at each stage of training.android

In [44]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Out[44]:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=67, presort='auto',
             random_state=None, subsample=1.0, verbose=0, warm_start=False)
In [47]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break
 

打開warm_start參數,當達到一種之後5種狀態的偏差都比如今這種狀態大的時候就中止

  1. subsample hyperparameter specifies the fraction of training instances to be used for training each tree
  2. use Gradient Boosting with another cost function
 

Stacking

  1. short for stacking generalization
  2. split the training set into three subsets
  3. first one used to train the first layer
  4. second one used to creat the training set to train the second layer(using predictions made by the predictors of the first layer)
  5. thrid one used to creat the training set to train the thrid layer(using predictions made by the predictors of the second layer) Predictions in a multilayer stacking ensemble
In [ ]:
 
In [ ]:
相關文章
相關標籤/搜索