決策樹的python實現

時間 2019-12-08

標籤決策樹 python 實現欄目 Python 简体版

原文原文鏈接

決策樹的Python實現

2017-04-07 Anne Python技術博文

前言：node

決策樹的一個重要的任務是爲了理解數據中所蘊含的知識信息，所以決策樹可使用不熟悉的數據集合，並從中提取出一系列規則，這些機器根據數據集建立規則的過程，就是機器學習的過程。微信

決策樹優勢：app

1：計算複雜度不高less

2：輸出結果易於理解curl

3：對中間值的缺失不敏感機器學習

4：能夠處理不相關特徵數據函數

缺點：可能會產生過分匹配問題post

使用數據類型：數值型和標稱型學習

基於Python逐步實現Decision Tree(決策樹)，分爲如下幾個步驟：測試

1.加載數據集

from numpy import *
#load "iris.data" to workspace
traindata = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (0,1,2,3),dtype = float)
trainlabel = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (range(4,5)),dtype = str)
feaname = ["#0","#1","#2","#3"] # feature names of the 4 attributes (features)

2. 熵的計算

entropy是香農提出來的（信息論大牛），定義見wiki

注意這裏的entropy是H(C|X=xi)而非H(C|X), H（C|X）的計算見第下一個點，還要乘以機率加和

Code：

3. 根據最佳分割feature進行數據分割

假定咱們已經獲得了最佳分割feature，在這裏進行分割（最佳feature爲splitfea_idx）

第二個函數idx2data是根據splitdata獲得的分割數據的兩個index集合返回datal (samples less than pivot), datag(samples greater than pivot), labell, labelg。這裏咱們根據所選特徵的平均值做爲pivot

Code:

這裏args是參數，決定分裂節點的閾值（每一個參數對應一個feature，大於該值分到>branch，小於該值分到<branch）,咱們能夠定義以下：

測試：按特徵2進行分類，獲得的less和greater set of indices分別爲：

也就是按args[2]進行樣本集分割，<和>args[2]的branch分別有57和93個樣本。

4. 根據最大信息增益選擇最佳分割feature

信息增益爲代碼中的info_gain, 註釋中是熵的計算

Code:

這裏的測試針對全部數據，分裂一次選擇哪一個特徵呢？

5. 遞歸構建決策樹

詳見code註釋，buildtree遞歸地構建樹。

遞歸終止條件：

①該branch內沒有樣本（subset爲空） or

②分割出的全部樣本屬於同一類 or

③因爲每次分割消耗一個feature，當沒有feature的時候中止遞歸，返回當前樣本集中大多數sample的label

#create the decision tree based on information gain
def buildtree(oridata, label):
if label.size==0: #if no samples belong to this branch
return "NULL"
listlabel = label.tolist()
#stop when all samples in this subset belongs to one class
if listlabel.count(label[0])==label.size:
return label[0]
#return the majority of samples' label in this subset if no extra features avaliable
if len(feanamecopy)==0:
cnt = {}
for cur_l in label:
if cur_l not in cnt.keys():
cnt[cur_l] = 0
cnt[cur_l] += 1
maxx = -1
for keys in cnt:
if maxx < cnt[keys]:
maxx = cnt[keys]
maxkey = keys
return maxkey
bestsplit_fea = choosebest_splitnode(oridata,label) #get the best splitting feature
print bestsplit_fea,len(oridata[0])
cur_feaname = feanamecopy[bestsplit_fea] # add the feature name to dictionary
print cur_feaname
nodedict = {cur_feaname:{}}
del(feanamecopy[bestsplit_fea]) #delete current feature from feaname
split_idx = splitdata(oridata,bestsplit_fea) #split_idx: the split index for both less and greater
data_less,data_greater,label_less,label_greater = idx2data(oridata,label,split_idx,bestsplit_fea)
#build the tree recursively, the left and right tree are the "<" and ">" branch, respectively
nodedict[cur_feaname]["<"] = buildtree(data_less,label_less)
nodedict[cur_feaname][">"] = buildtree(data_greater,label_greater)
return nodedict
#testcode:
#mytree = buildtree(traindata,trainlabel)
#print mytree