機器學習之決策樹（ID3）算法

最近剛把《機器學習實戰》中的決策樹過了一遍，接下來經過書中的實例，來溫習決策樹構造算法中的ID3算法。html

海洋生物數據：算法

	不浮出水面是否能夠生存	是否有腳蹼	屬於魚類
1	是	是	是
2	是	是	是
3	是	否	否
4	否	是	否
5	否	是	否

轉換成數據集：app

def createDataSet(): dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] labels = ['no surfacing','flippers'] return dataSet, labels

1、基礎知識機器學習

一、熵函數

我把它簡單的理解爲用來度量數據的無序程度。數據越有序，熵值越低；數據越混亂或者分散，熵值越高。因此數據集分類後標籤越統一，熵越低；標籤越分散，熵越高。
學習

更理論一點的解釋：spa

熵被定義爲信息的指望值，而如何理解信息？若是待分類的事物可能劃分在多個分類中，則符號的信息定義爲：code

其中x_i是選擇該分類的機率，即該類別個數 / 總個數。htm

爲了計算熵，咱們須要計算全部類別全部可能值包含的信息指望值，公式以下：blog

其中n是分類的數目。

計算給定數據集的香農熵：

def calcShannonEnt(dataSet): numEntries = len(dataSet) #建立字典，計算每種標籤對應的樣本數
    labelCounts = {} for featVec in dataSet: currentLabel = featVec[-1] if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1
    #根據上面的公式計算香農熵
    shannonEnt = 0.0
    for key in labelCounts: prob = float(labelCounts[key])/numEntries shannonEnt -= prob * log(prob,2) return shannonEnt

運行代碼，數據集myDat1只有兩個類別，myDat2有三個類別：

>>> myDat1

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

>>> trees.calcShannonEnt(myDat1)

0.9709505944546686

>>> myDat2

[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

>>> trees.calcShannonEnt(myDat2)

1.3709505944546687

二、信息增益

信息增益能夠衡量劃分數據集先後數據（標籤）向有序性發展的程度。

信息增益=原數據香農熵-劃分數據集以後的新數據香農熵

2、按給定特徵劃分數據集

三個輸入參數：待劃分的數據集、劃分數據集的特徵位置、須要知足的當前特徵的值

def splitDataSet(dataSet, axis, value): retDataSet = [] for featVec in dataSet: if featVec[axis] == value: #得到除當前位置之外的特徵元素
            reducedFeatVec = featVec[:axis] reducedFeatVec.extend(featVec[axis+1:]) #把每一個樣本特徵堆疊在一塊兒，變成一個子集合
 retDataSet.append(reducedFeatVec) return retDataSet

運行結果：

>>> myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

>>> trees.splitDataSet(myDat,0,1)

[[1, 'yes'], [1, 'yes'], [0, 'no']]

>>> trees.splitDataSet(myDat,0,0)

[[1, 'no'], [1, 'no']]

3、選擇最好的數據集劃分方式，即選擇出最合適的特徵用於劃分數據集

def chooseBestFeatureToSplit(dataSet): # 計算出數據集的特徵個數
    numFeatures = len(dataSet[0]) – 1
    # 算出原始數據集的香農熵
    baseEntropy = calcShannonEnt(dataSet) bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures): # 抽取出數據集中全部第i個特徵
        featList = [example[i] for example in dataSet] # 當前特徵集合
        uniqueVals = set(featList) newEntropy = 0.0
        # 根據特徵劃分數據集，並計算出香農熵和信息增益
        for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) infoGain = baseEntropy - newEntropy # 返回最大信息增益的特徵
        if(infoGain > bestInfoGain): bestInfoGain = infoGain bestFeature = i return bestFeature

4、若是數據集已經處理了全部特徵屬性，可是類標依然不是惟一的，此時採用多數表決的方式決定該葉子節點的分類。

def majorityCnt(classList): classCount={} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]

5、建立決策樹

接下來咱們將利用上面學習的單元模塊建立決策樹。

def createTree(dataSet,labels): classList = [example[-1] for example in dataSet] # 若是劃分的數據集只有一個類別，則返回此類別
    if classList.count(classList[0]) == len(classList): return classList[0] # 若是使用完全部特徵屬性以後，類別標籤仍不惟一，則使用majorityCnt函數，多數表決法，哪一種類別標籤多，則分爲此類別
    if len(dataSet[0]) == 1: return majorityCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel:{}} del(labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels) return myTree

每次遇到遞歸問題總會頭腦發昏，爲了便於理解，我把一個建立決策樹的處理過程重頭到尾梳理了一遍。

原始數據集:

dataset: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

labels: [no surfacing, flippers]

在調用createTree(dataSet,labels)函數以後，數據操做以下（每個色塊表明一次完整的createTree調用過程）：

一、

dataset: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

labels: [no surfacing, flippers]

classList=['yes', 'yes', 'no', 'no', 'no']

選擇最好的特徵來分類：bestFeat= 0

bestFeatLabel =no surfacing

構造樹：myTree {'no surfacing': {}}

去除這個特徵後，label=['flippers']

這個特徵（no surfacing）的值：featValues= [1, 1, 1, 0, 0]

特徵類別 uniqueVals=[0, 1]

（1）類別值爲0的時候：

子標籤=['flippers']

分出的子集 splitDataSet(dataSet, bestFeat, value) = [[1, 'no'], [1, 'no']]

myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)

1-一、

dataset: [[1, 'no'], [1, 'no']]

labels: ['flippers']

classList=['no', 'no']

知足classList中只有一個類別，返回no

myTree[bestFeatLabel][0] =’no’

myTree[bestFeatLabel] {0: 'no'}

也就是myTree {'no surfacing': {0: 'no'}}

（2）類別值爲1的時候：

子標籤=['flippers']

分出的子集 splitDataSet(dataSet, bestFeat, value) = [[1, 'yes'], [1, 'yes'], [0, 'no']]

myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)

1-二、

dataset: [[1, 'yes'], [1, 'yes'], [0, 'no']]

labels: ['flippers']

classList=['yes', 'yes', 'no']

選擇最好的特徵來分類：bestFeat= 0

bestFeatLabel = flippers

構造樹：myTree {'flippers': {}}

去除這個特徵後，label=[]

這個特徵（flippers）的值：featValues= [1, 1, 0]

特徵類別 uniqueVals=[0, 1]

（1）類別值爲0的時候：

子標籤=[]

分出的子集 splitDataSet(dataSet, bestFeat, value) = [['no']]

myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)

1-2-一、

dataset: [['no']]

labels: []

classList=['no']

知足classList中只有一個類別，返回no

myTree[bestFeatLabel][0] =’no’

myTree[bestFeatLabel] {0: 'no'}

也就是myTree {'flipper': {0: 'no'}}

（2）類別值爲1的時候：

子標籤=[]

分出的子集 splitDataSet(dataSet, bestFeat, value) = [['yes'], ['yes']]

myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)

1-2-二、

dataset: [['yes'], ['yes']]

labels: []

classList=['yes', 'yes']

知足classList中只有一個類別，返回yes

myTree[bestFeatLabel][1] =’yes’

myTree[bestFeatLabel] {0: 'no', 1: 'yes'}

也就是myTree: {'flippers': {0: 'no', 1: 'yes'}}

myTree[bestFeatLabel][1] ={'flippers': {0: 'no', 1: 'yes'}}

myTree[bestFeatLabel] {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}

也就是myTree: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

例子中的決策樹可視化圖：

6、使用決策樹作分類

def classify(inputTree, featLabels, testVec): firstStr = inputTree.keys()[0] secondDict = inputTree[firstStr] featIndex = featLabels.index(firstStr) for key in secondDict.keys(): if testVec[featIndex] == key: if type(secondDict[key]).__name__=='dict': classLabel = classify(secondDict[key], featLabels, testVec) else: classLabel = secondDict[key] return classLabel

輸出結果：

>>> myTree

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

>>> labels

['no surfacing', 'flippers']

>>> trees.classify(myTree,labels,[1,0])

'no'

>>> trees.classify(myTree,labels,[1,1])

'yes'

7、 決策樹的存儲

構造決策樹是很耗時的任務，然而用建立好的決策樹解決分類問題，則能夠很快的完成，能夠經過使用pickle模塊存儲決策樹。

def storeTree(inputTree, filename): import pickle fw = open(filename,'w') pickle.dump(inputTree,fw) fw.close() def grabTree(filename): import pickle fr = open(filename) return pickle.load(fr)

參考資料：

[1] 《機器學習實戰》

[2] 《機器學習實戰》筆記——決策樹（ID3）https://www.cnblogs.com/DianeSoHungry/p/7059104.html