[機器學習]迴歸--Decision Tree Regression

時間 2019-11-24

標籤機器學習迴歸 decision tree regression 简体版

原文原文鏈接

CART決策樹又稱分類迴歸樹，當數據集的因變量爲連續性數值時，該樹算法就是一個迴歸樹，能夠用葉節點觀察的均值做爲預測值；當數據集的因變量爲離散型數值時，該樹算法就是一個分類樹，能夠很好的解決分類問題。但須要注意的是，該算法是一個二叉樹，即每個非葉節點只能引申出兩個分支，因此當某個非葉節點是多水平(2個以上)的離散變量時，該變量就有可能被屢次使用。node

在sklearn中咱們能夠用來提升決策樹泛化能力的超參數主要有
- max_depth:樹的最大深度,也就是說當樹的深度到達max_depth的時候不管還有多少能夠分支的特徵,決策樹都會中止運算.
- min_samples_split: 分裂所需的最小數量的節點數.當葉節點的樣本數量小於該參數後,則再也不生成分支.該分支的標籤分類以該分支下標籤最多的類別爲準
- min_samples_leaf; 一個分支所須要的最少樣本數,若是在分支以後,某一個新增葉節點的特徵樣本數小於該超參數,則退回,再也不進行剪枝.退回後的葉節點的標籤以該葉節點中最多的標籤你爲準
- min_weight_fraction_leaf: 最小的權重係數
- max_leaf_nodes:最大葉節點數,None時無限制,取整數時,忽略max_depthpython

咱們此次用的數據是公司內部不一樣的promotion level所對應的薪資算法

下面咱們來看一下在Python中是如何實現的app

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
 
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
# 這裏注意：1:2其實只有第一列，與1 的區別是這表示的是一個matrix矩陣，而非單一貫量。
y = dataset.iloc[:, 2].values

接下來，進入正題，開始Decision Tree Regression迴歸：dom

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)
 
y_pred = regressor.predict(6.5)
# 圖像中顯示
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (Decision Tree Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

下面的代碼主要是對決策樹最大深度與過擬合之間關係的探討,能夠看出對於最大深度對擬合關係影響. 測試

與分類決策樹同樣的地方在於,最大深度的增長雖然能夠增長對訓練集擬合能力的加強,但這也就可能意味着其泛化能力的降低spa

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
 
# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(10 * rng.rand(160, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 2 * (0.5 - rng.rand(32)) # 每五個點增長一次噪音
 
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=4)
regr_3 = DecisionTreeRegressor(max_depth=8)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
 
# Predict
X_test = np.arange(0.0, 10.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)
 
# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
         label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=4", linewidth=2)
plt.plot(X_test, y_3, color="r", label="max_depth=8", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

從上面的測試能夠看出隨着決策樹最大深度的增長,決策樹的擬合能力不斷上升.
在這個例子中一共有160個樣本,當最大深度爲8(大於lg(200))時,咱們的決策樹已經不單單擬合了咱們的正確樣本,同時也擬合了咱們添加的噪音,這致使了其泛化能力的降低.3d

最大深度與訓練偏差測試偏差的關係

下面咱們進行對於不一樣的最大深度決策樹的訓練偏差與測試偏差進行繪製.
固然你也能夠經過改變其餘能夠控制決策樹生成的超參數進行相關測試.code

from sklearn import model_selection
def creat_data(n):
    np.random.seed(0)
    X = 5 * np.random.rand(n, 1)
    y = np.sin(X).ravel()
    noise_num=(int)(n/5)
    y[::5] += 3 * (0.5 - np.random.rand(noise_num)) # 每第5個樣本，就在該樣本的值上添加噪音
    return model_selection.train_test_split(X, y,test_size=0.25,random_state=1)
def test_DecisionTreeRegressor_depth(*data,maxdepth):
    X_train,X_test,y_train,y_test=data
    depths=np.arange(1,maxdepth)
    training_scores=[]
    testing_scores=[]
    for depth in depths:
        regr = DecisionTreeRegressor(max_depth=depth)
        regr.fit(X_train, y_train)
        training_scores.append(regr.score(X_train,y_train))
        testing_scores.append(regr.score(X_test,y_test))
 
    ## 繪圖
    fig=plt.figure()
    ax=fig.add_subplot(1,1,1)
    ax.plot(depths,training_scores,label="traing score")
    ax.plot(depths,testing_scores,label="testing score")
    ax.set_xlabel("maxdepth")
    ax.set_ylabel("score")
    ax.set_title("Decision Tree Regression")
    ax.legend(framealpha=0.5)
    plt.show()
 
X_train,X_test,y_train,y_test=creat_data(200)    
test_DecisionTreeRegressor_depth(X_train,X_test,y_train,y_test,maxdepth=12)

由上圖咱們能夠看出,當咱們使用train_test進行數據集的分割的時候,最大深度2即爲咱們須要的最佳超參數. blog

一樣的你也能夠對其餘超參數進行測試,或者換用cv進行測試,再或者使用hyperopt or auto-sklearn等神器