決策樹的Python代碼實現與分析

時間 2019-11-17

原文原文鏈接

一、計算給定數據集的香農熵

def calcShannonEnt(dataSet):
    #calculate the shangnon value
    numEntries = len(dataSet)    #求dataset的元素個數，dataSet的類型是列表
    labelCounts = {}             # 建立空列表，存儲每一類數量 
    for featVec in dataSet:      #對dataSet的每一類訓練數據
        currentLabel = featVec[-1]   # 將dataSet的每個元素的最後一個元素選擇出來，dataSet的元素也是列表
        if currentLabel not in labelCounts.keys(): #返回一個字典全部的鍵。
            labelCounts[currentLabel] = 0 #若字典中不存在該類別標籤，則使用字典的自動添加進行添加值爲0的項
        labelCounts[currentLabel] += 1 #遞增類別標籤的值,labelCounts[currentLabel]主要是統計同一個label出現的次數
    shannonEnt = 0.0
    for key in labelCounts:    # 對每一分類，計算熵  
        prob = float(labelCounts[key])/numEntries  #計算某個標籤的機率 P(x)  
        shannonEnt -= prob*math.log(prob,2)   #計算信息熵 P(x) * log(P(x))
    return shannonEnt

2. 建立數據的函數

def createDataSet():
    dataSet = [[1,1,'yes'],
               [1,1,'yes'],
               [1,0,'no'],
               [0,1,'no'],
               [0,1,'no']]
    labels = ['no surfacing','flippers']
    return dataSet,labels

3.劃分數據集，按照給定的特徵劃分數據集

將知足X[aixs]==value的值（特徵aixs對應的值）都劃分到一塊兒，返回一個劃分好的集合（不包括用來劃分的aixs屬性，由於不須要）app

def splitDataSet(dataSet,axis,value):
    retDataSet = []
    for featVec in dataSet: #每一訓練數據
        if featVec[axis] == value: #判斷特徵值 ?= 指定值
            reducedFeatVec = featVec[:axis] #在新列表中加載除該特徵前面的全部特徵 
            reducedFeatVec.extend(featVec[axis+1:]) #加載該特徵值後面的全部特徵
            retDataSet.append(reducedFeatVec)
    return retDataSet

說明： featVec[：axis] 返回的是一個列表，其元素是featVec這個列表的索引從0到axis - 1的元素； featVec[axis + 1: ]返回的是一個列表，其元素是featVec這個列表的索引從axis + 1開始1. 的全部元素函數

4. 根據信息增益最大，選擇最好的數據集劃分特徵

信息增益 = 信息熵InfoA(D) - 在特徵A做用後的信息熵爲InfoA(D)測試

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) -1 #訓練集特徵個數
    baseEntropy = calcShannonEnt(dataSet) #數據集的熵
    bestInfoGain = 0.0; #信息增益
    bestFeature = -1 #最優特徵
    for i in range(numFeatures): #對數據的每個特徵
        featList = [example[i] for example in dataSet] #提取全部訓練樣本中第i個特徵 --> list
        print("featList",featList)
        uniqueVals = set(featList) # 使用set去重，得到特徵值的全部取值
        newEntropy = 0.0 #在特徵做用下的信息熵
        for value in uniqueVals: #計算該特徵值下的熵
            subDataSet = splitDataSet(dataSet,i,value) #按照特徵i的值爲value分割數據
            prob = len(subDataSet)/float(len(dataSet)) #特徵i下，分別取不一樣特徵值的機率p()
            newEntropy += prob * calcShannonEnt(subDataSet) #計算特徵i的熵
        infoGain = baseEntropy - newEntropy # 特徵值i的信息增益
        if(infoGain > bestInfoGain): #　取最大信息增益時的特徵i  
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

5.遞歸建立樹

由於咱們遞歸構建決策樹是根據屬性的消耗進行計算的，因此可能會存在最後屬性用完了，可是分類仍是沒有算完，這時候就會採用多數表決的方式計算節點分類code

def majorityCnt(classList):
    ''''' 
        最多數決定葉子節點的分類 
    '''  
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys(): 
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items,key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]  # 排序後返回出現次數最多的分類名稱

6. 用於建立樹的函數代碼

def createTree(dataSet,labels): 
    classList = [example[-1] for example in dataSet] # 數據集的全部分類標籤列表
    if classList.count(classList[0]) == len(classList): # 只有一個分類標籤，結束，返回 
        return classList[0]
    if(len(dataSet[0]) == 1): # 若是訓練數據集只有一列，一定是分類標籤，返回其中出現次數最多的分類  
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)  # 信息增益最大的特徵
    bestFeatLabel = labels[bestFeat]  #　信息增益最大的特徵標籤
    myTree = {bestFeatLabel:{}} # 開始建樹
    del(labels[bestFeat])  #　將已經建樹的特徵從數據集中刪除
    featValues = [example[bestFeat] for example in dataSet] # 特徵值列表  
    uniqueVals = set(featValues)  # 特徵值的不一樣取值
    for value in uniqueVals:
        subLabels = labels[:]  # 對特徵的每個取值，建支樹
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels) #遞歸函數使得Tree不斷建立分支，直到分類結束
    return myTree

七、根據訓練決策樹，判斷測試向量testVec

def classify(inputTree, featLabels, testVec):  #tree爲createTree()函數返回的決策樹；label爲特徵的標籤值；testVec爲測試數據，即全部特徵的具體值構成的向量  
    firstStr = list(inputTree.keys())[0]  #取出tree的第一個鍵
    secondDict = inputTree[firstStr]  #取出tree第一個鍵的值，即tree的第二個字典（包含關係）
    print("secondDict",secondDict) 
    featIndex = featLabels.index(firstStr)  #獲得第一個特徵firstFeat在標籤label中的索引(樹根節點 ---> 特徵位置 ---> 測試向量位置)
    for key in secondDict.keys():  #遍歷第二個字典的鍵
        if testVec[featIndex] == key:  #若是第一個特徵的測試值與第二個字典的鍵相等時
            if type(secondDict[key]).__name__ == 'dict':  #若是第二個字典的值仍是一個字典，說明分類還沒結束，遞歸執行classify函數
                classLabel = classify(secondDict[key], featLabels, testVec)  #遞歸函數中只有輸入的第一個參數不一樣，不斷向字典內層滲入
            else: classLabel = secondDict[key]  #最後將獲得的分類值賦給classLabel輸出
    return classLabel

myDat, labels = createDataSet()  
myTree = createTree(myDat,labels)  
print("labels",labels);

print("result",classify(myTree,['no surfacing','flippers'],[0,1]))