【sklearn決策樹算法】DecisionTreeClassifier(API)的使用以及決策樹代碼實例 - 鳶尾花分類

時間 2020-06-09

標籤 sklearn決策樹算法 decisiontreeclassifier api 使用以及決策樹代碼實例鳶尾花分類简体版

原文原文鏈接

決策樹算法

決策樹算法主要有ID3, C4.5, CART這三種。html

ID3算法從樹的根節點開始，老是選擇信息增益最大的特徵，對此特徵施加判斷條件創建子節點，遞歸進行，直到信息增益很小或者沒有特徵時結束。信息增益：特徵 A 對於某一訓練集 D 的信息增益 $g(D, A)$ 定義爲集合 D 的熵 $H(D)$ 與特徵 A 在給定條件下 D 的熵 $H(D/A)$ 之差。熵（Entropy）是表示隨機變量不肯定性的度量。python

$$ g(D, A) = H(D) - H(D \mid A) $$算法

C4.5是使用了信息增益比來選擇特徵，這被當作是 ID3 算法的一種改進。數組

但這兩種算法都會致使過擬合的問題，須要進行剪枝。dom

決策樹的修剪，其實就是經過優化損失函數來去掉沒必要要的一些分類特徵，下降模型的總體複雜度。ide

CART 算法在生成樹的過程當中，分類樹採用了基尼指數（Gini Index）最小化原則，而回歸樹選擇了平方損失函數最小化原則。 CART 算法也包含了樹的修剪，CART 算法從徹底生長的決策樹底端剪去一些子樹，使得模型更加簡單。函數

具體代碼實現上，scikit-learn 提供的 DecisionTreeClassifier 類能夠作多分類任務。測試

1. DecisionTreeClassifier API 的使用

和其餘分類器同樣，DecisionTreeClassifier 須要兩個數組做爲輸入：
X: 訓練數據，稀疏或稠密矩陣，大小爲 [n_samples, n_features]
Y: 類別標籤，整型數組，大小爲 [n_samples]優化

from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

模型擬合後，能夠用於預測樣本的分類ui

clf.predict([[2., 2.]])

array([1])

此外，能夠預測樣本屬於每一個分類（葉節點）的機率，（輸出結果：0%，100%）

clf.predict_proba([[2., 2.]])

array([[0., 1.]])

DecisionTreeClassifier() 模型方法中也包含很是多的參數值。例如：

criterion = gini/entropy 能夠用來選擇用基尼指數或者熵來作損失函數。
splitter = best/random 用來肯定每一個節點的分裂策略。支持「最佳」或者「隨機」。
max_depth = int 用來控制決策樹的最大深度，防止模型出現過擬合。
min_samples_leaf = int 用來設置葉節點上的最少樣本數量，用於對樹進行修剪。

2. 由鳶尾花數據集構建決策樹

鳶尾花數據集：數據集名稱的準確名稱爲 Iris Data Set，總共包含 150 行數據。每一行數據由 4 個特徵值及一個目標值組成。其中 4 個特徵值分別爲：萼片長度、萼片寬度、花瓣長度、花瓣寬度。而目標值爲三種不一樣類別的鳶尾花，分別爲：Iris Setosa，Iris Versicolour，Iris Virginica。

DecisionTreeClassifier 既能夠用於二分類，也能夠用於多分類。
對於鳶尾花數據集，能夠以下構建決策樹：

from sklearn.datasets import load_iris
from sklearn import tree
X, y = load_iris(return_X_y=True)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

2.1 簡單繪製決策樹

擬合完後，能夠用plot_tree()方法繪製出決策樹來，以下圖所示

tree.plot_tree(clf)

2.2 Graphviz形式輸出決策樹

也能夠用 Graphviz 格式（export_graphviz）輸出。
若是使用的是 conda 包管理器，能夠用以下方式安裝：

conda install python-graphviz
pip install graphviz

如下展現了用 Graphviz 輸出上述從鳶尾花數據集獲得的決策樹，結果保存爲 iris.pdf

import graphviz
iris = load_iris()
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("iris")

export_graphviz 支持使用參數進行視覺優化，包括根據分類或者回歸值繪製彩色的結點，也可使用顯式的變量或者類名。
Jupyter Notebook 還能夠自動內聯呈現這些繪圖。

dot_data = tree.export_graphviz(clf, out_file=None,
                      feature_names=iris.feature_names,
                      class_names=iris.target_names,
                      filled=True, rounded=True,
                      special_characters=True)
graph = graphviz.Source(dot_data)
graph

2.3 文本形式輸出決策樹

此外，決策樹也可使用 export_text 方法以文本形式輸出，這個方法不須要安裝其餘包，也更加的簡潔。

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
iris = load_iris()
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(iris.data, iris.target)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)

|--- petal width (cm) <= 0.80
|   |--- class: 0
|--- petal width (cm) >  0.80
|   |--- petal width (cm) <= 1.75
|   |   |--- class: 1
|   |--- petal width (cm) >  1.75
|   |   |--- class: 2

3. 繪製決策平面

繪製由特徵對構成的決策平面，決策邊界由訓練集獲得的簡單閾值組成。

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Parameters
n_classes = 3
plot_colors = "ryb"
plot_step = 0.02

# Load data
iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                    cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

plt.suptitle("Decision surface of a decision tree using paired features")
plt.legend(loc='lower right', borderpad=0, handletextpad=0)
plt.axis("tight")

plt.figure()
clf = DecisionTreeClassifier().fit(iris.data, iris.target)
plot_tree(clf, filled=True)
plt.show()

Automatically created module for IPython interactive environment

4. 數據集劃分及結果評估

數據集獲取

from sklearn import datasets # 導入方法類

iris = datasets.load_iris() # 加載 iris 數據集
iris_feature = iris.data # 特徵數據
iris_target = iris.target # 分類數據

數據集劃分

from sklearn.model_selection import train_test_split

feature_train, feature_test, target_train, target_test = train_test_split(iris_feature, iris_target, test_size=0.33, random_state=42)

模型訓練及預測

from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier() # 全部參數均置爲默認狀態
dt_model.fit(feature_train,target_train) # 使用訓練集訓練模型
predict_results = dt_model.predict(feature_test) # 使用模型對測試集進行預測

結果評估

scores = dt_model.score(feature_test, target_test)
scores

1.0

參考文檔

scikit-learn 1.10.1 DecisionTreeClassifier API User Guide Example: a decision tree on the iris dataset

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。