CART決策樹的理解及其實現

時間 2021-02-18

標籤算法 app dom 學習測試 spa code blog 排序欄目 HTML 简体版

原文原文鏈接

CART決策樹介紹

使用CART(Classification and regression tree)算法構建的決策樹是二叉樹，它對特徵進行二分，迭代生成決策樹。算法

CART迴歸樹

假設X與Y分別爲輸入和輸出變量，而且Y是連續變量，給定訓練數據集app

$$D=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}$$dom

考慮如何生成迴歸樹。學習

一個迴歸樹對應着輸入空間(即特徵空間)的一個劃分以及在劃分的單元上的輸出值。假設已將輸入空間劃分爲M個單元$R_1|R_2,...,R_M$,而且在每一個單元$R_m$上有一個固定的輸出值$c_m$，因而回歸樹模型可表示爲測試

$$f(x)=\sum_{m=1}^Mc_mI(x\in R_m)\tag{1}$$ui

當輸入空間的劃分肯定時，能夠用平方偏差$\sum_{x_i\in R_m}(y_i-f(x_i))^2$來表示迴歸樹對於訓練數據的預測偏差，用平方偏差最小的準則求解每一個單元上的最優輸出值。易知，單元$R_m$上的$c_m$的最優值$\hat{c_m}$是$R_m$上的全部輸入實例$x_i$對應的輸出$y_i$的均值，即
$$\hat{c_m}=ave(y_i|x_i\in R_m)\tag{2}$$
這裏選擇第j個遍歷$x^{j}$和它的取值s，做爲切分遍歷和切分點，並定義兩個區域(左右結點)
$$\begin{cases} R_1(j,s)=\{x|x^{j}\leq s\}\\ R_2(j,s)=\{x|x^{j}> s\} \end{cases} \tag{3}$$
而後尋找最優切分變量j個最優切分點s。具體地，求解
$$min_{j,s}[min_{c_1}\sum_{x_i\in R_1(j,s)}(y_i-c_1)^2+min_{c_2}\sum_{x_i\in R_2(j,s)}(y_i-c_2)^2]\tag{4}$$spa

對固定輸入變量j能夠找到最優切分點s
$$\begin{cases}\hat{c_1}=ave(y_i|x_j\in R_1(j,s))\\ \hat{c_2}=ave(y_i|x_i\in R_2(j, s))\end{cases} \tag{5}$$code

遍歷全部輸入變量，找到最優的切分變量j，構成一對(j,s)。以此將輸入空間劃分爲兩個區域。接着，對每一個區域重複上述劃分過程，知道知足中止條件爲止(能夠是知足葉子結點個數或偏差閾值等條件)。這樣就生成一顆迴歸樹。這樣的樹一般稱爲最小二乘迴歸樹。blog

具體過程以下
輸入：訓練數據集D(N,J)
輸出：迴歸樹f(x)排序

選擇最優切分變量(特徵)和切分點遍歷全部特徵。對該特徵，按照式（3）依次計算第j個特徵下的每一個取值對應的平方偏差。選擇最小的特徵和特徵切分點對。
使用上一步獲得的最佳的特徵和特徵切分對數據集D進行切分，獲得左子樹(知足條件進入)和右子樹(不知足條件進入)。
繼續執行步驟1和2(生成子樹)，直至知足中止條件。
獲得迴歸樹。

這裏舉一個簡單的例子，介紹一下連續變量如何切分(和C4.5的處理方式是同樣的)。
下表是一個數據集，包含了一個特徵x，x爲連續變量。y爲類別標籤。如今利用這個數據集來構建一個CART迴歸樹。

x	1	2	3	4	5	6	7	8	9
y	0.3	0.5	0.7	0.8	0.95	1.3	1.5	1.6	1.9

首先須要選擇特徵和特徵切分點
特徵x包含了9個元素，長度爲9，這裏x已經排序好了，直接以$\frac{x_i+x_{i+1}}{2},i\in \{1,2,..., 9\}$做爲切分點(一種經常使用的切分方式)。

從第一個切分點開始，第一個切分點爲$\frac{1 + 2}{2}=1.5$。小於1.5則歸到$R_1$(左子樹)，大於1.5則歸爲$R_2$(右子樹)。
根據式（3）可得，$R_1=\{1\}$,$R_1=\{2,3,4,5,6,7,8,9\}$,根據式（5）可得，$c_1=0.3$,$c_2=\frac{0.5+0.7+0.8+0.95+1.3+1.5+1.6+1.9}{8}$,因此根據式（4）,第一個切分點對應平方偏差爲$0+0.21=0.21$。按照這種方式依次計算每一個切分點對應的偏差，選擇具備最小偏差的切分點。

CART分類樹

CART分類樹使用最小基尼指數（Gini）準則來選擇特徵,同時決定最優切分點。
基尼指數的定義以下
$$G(p)=\sum_{k=1}^Kp_k(1-p_k)=1-\sum_{k=1}^Kp_k^2\tag{6}$$
對於指定的數據集D，其基尼係數爲：
$$G(D)=\sum_{k=1}^K\frac{|C_k|}{|D|}(1-\frac{|C_k|}{|D|})\tag{6}$$
$|C_k|$表示第k類的樣本數目。

設特徵A的取值將數據集D分紅兩部分$D_1$和$D_2$。在特徵A的條件下，數據集D的基尼係數定義爲：
$$G(D,A)=\sum_{k=1}^K\frac{|D_1|}{|D|}G(D_1)+\sum_{k=1}^K\frac{|D_2|}{|D|}G(D_2)$$

G(D)表示數據集D的不肯定性，基尼指數越大，不肯定性越大。這點和熵比較類似。

CART分類決策樹和上一節中的ID3和C4.5構建決策樹的差異不大，這裏就不細說。下面直接給出CART分類樹構建的代碼。

代碼實現

結點

class Node:
    def __init__(self, val, tag=None):
        """
        Params:
            val: 特徵名(內部節點)或類別標籤(葉子節點)
            tag: 切分點
            left: 左子樹
            right: 左子樹
        """
        self.val = val
        self.left = None
        self.right = None
        self.tag = tag
        
    def __str__(self):
        return f'val: {self.val}, tag: {self.tag}'

CART分類樹

class CARTClassifier:
    def __init__(self, thresh=1e-3, feat_names=None):
        self.tree = None
        self.feat_names = feat_names
        self.thresh = thresh
    
    def fit(self, x_train, y_train):
        """
        構建決策樹
        """
        self.tree = self._build(x_train, y_train)
        print('Finish train...')

    def predict(self, x_test, y_test=None):
        """
        預測
        """
        if self.tree == None:
            return 
        
        y_pred = []
        for x in x_test:
            y_pred.append(self._search(x))
            
        y_pred = np.array(y_pred)
        if y_test is not None:
            self._score(y_test, y_pred)
            
        return y_pred
         
    
    def _search(self, x):
        """
        根據特徵取值進行搜索
        """
        root = self.tree
        tag = root.tag
        while tag is not None:
            idx = self.feat_names.index(root.val)
            if isinstance(x[idx], str):
                root = root.left if x[idx] == root.tag else root.right
            else:
                root = root.left if x[idx] < root.tag else root.right
            tag = root.tag
            
        return root.val
    
    
    
    def _score(self, y_test, y_pred):
        """
        計算預測得分(準確率)
        """
        self.score = np.count_nonzero(y_test == y_pred) / len(y_test)
        
    def _build(self, x, y):
        """
        Params:
            x(pandas.DataFrame): 特徵features
            y(pandas.DataFrame or numpy.array): 標籤labels
        """
        cks, cnts = np.unique(y, return_counts=True)
        if len(cks) == 1:
            return Node(cks[0])
        
        if x.shape[0] == 0:
            return None
        
        self.feat_names = list(x.columns)
        best_gini = float('inf')
        best_split = None
        best_feat = 0
#         特徵選擇
        for i in range(x.shape[1]):
            if x.iloc[:, i].dtypes != 'object':
                gini, split = self.calc_cond_gini_continuous(x.iloc[:, i], y)
            else:
                gini, split = self.calc_cond_gini(x.iloc[:, i], y)
            if gini < best_gini:
                best_gini = gini
                best_split = split
                best_feat = i
                
        if best_gini < self.thresh:
            return Node(cks[cnts.argmax(0)])
        
        tree = Node(self.feat_names[best_feat], best_split)
#       連續特徵處理
        if x.iloc[:, best_feat].dtypes != 'object':
            fmask = x.iloc[:, best_feat] < best_split
            bmask = x.iloc[:, best_feat] > best_split
#       離散特徵處理
        else:
            fmask = x.iloc[:, best_feat] == best_split
            bmask = x.iloc[:, best_feat] != best_split
        
        tree.left = self._build(x[fmask], y[fmask])
        tree.right = self._build(x[bmask], y[bmask])
        
        return tree
        
    #     計算基尼係數
    def calc_gini(self, label):
        gini = 0
        
        for (ck, cnt) in zip(*np.unique(label, return_counts=True)):
            prob_ck = cnt / len(label) 
            gini += prob_ck * (1 - prob_ck)  
        return gini
    

    #     處理離散特徵
    def calc_cond_gini(self, feat, label):
        cks = np.unique(feat)
        best_gini = float('inf')
        best_split = 0
        for ck in cks:
            fmask = feat == ck
            bmask = feat != ck
            cond_gini = sum(fmask) * self.calc_gini(label[fmask])/ len(label) + sum(bmask) * self.calc_gini(label[bmask])/ len(label)

            if cond_gini < best_gini:
                best_gini = cond_gini
                best_split = ck

        return best_gini, best_split

        
        
    #     處理連續特徵
    def calc_cond_gini_continuous(self, feat, label):
    #   對特徵進行升序排序
        sorted_feat = np.sort(feat, axis=0)
        sorted_feat = np.unique(sorted_feat)
    #    肯定可能的劃分點 
        split_pos = (sorted_feat[:-1] + sorted_feat[1:]) / 2
        best_gini = float('inf')
        best_split = 0
        for pos in split_pos:
            lmask = feat < pos
            rmask = feat > pos
            
            cond_gini = sum(lmask) * self.calc_gini(label[lmask])/ len(label) + sum(rmask) * self.calc_gini(label[rmask])/ len(label)

            if cond_gini < best_gini:
                best_gini = cond_gini
                best_split = pos

        return best_gini, best_split

            
    def pruning(self, tree, x_test, y_test):
        """
        後剪枝
        
        根據測試集, 對建立好的決策樹進行剪枝
        
        """
#         TODO
        pass
    
    def preorder(self):
        """
        決策樹前序遍歷
        """
        print('--- PreOrder ---')
        tree = self.tree
        self._preorder(tree)
        
    def _preorder(self, tree):
        if tree == None:
            return 
        print(tree)
        self._preorder(tree.left)
        self._preorder(tree.right)

構建CART決策樹並執行分類，這裏仍是以Iris數據集爲例

# 讀取數據
data = load_iris()
x, y = data['data'], data['target']

# 分割成訓練集和測試集
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=20190320, test_size=0.1)


x_train = pd.DataFrame(x_train, columns=data.feature_names, index=None)

# 構建決策樹
tree = CARTClassifier()
tree.fit(x_train, y_train)

# CART決策樹前序遍歷
tree.preorder()

# 執行預測
y_pred = tree.predict(x_test, y_test)
print(tree.score)

代碼實現結果