python機器學習實戰（二）

時間 2019-12-19

原文原文鏈接

python機器學習實戰（二）
html

http://www.cnblogs.com/fydeblog/p/7159775.htmlpython

前言算法

這篇notebook是關於機器學習監督學習中的決策樹算法，內容包括決策樹算法的構造過程，使用matplotlib庫繪製樹形圖以及使用決策樹預測隱形眼睛類型.
操做系統：ubuntu14.04（win也ok）運行環境：anaconda-python2.7-jupyter notebook 參考書籍：機器學習實戰和源碼 notebook writer ----方陽數據庫

注意事項：在這裏說一句，默認環境python2.7的notebook，用python3.6的會出問題，還有個人目錄可能跟大家的不同，大家本身跑的時候記得改目錄，我會把notebook和代碼以及數據集放到結尾的百度雲盤，方便大家下載！ubuntu

決策樹原理：不斷經過數據集的特徵來劃分數據集，直到遍歷全部劃分數據集的屬性，或每一個分支下的實例都具備相同的分類，決策樹算法中止運行。架構

決策樹的優缺點及適用類型
優勢 :計算複雜度不高, 輸出結果易於理解,對中間值的缺失不敏感,能夠處理不相關特徵數據。
缺點 :可能會產生過分匹配問題。
適用數據類型:數值型和標稱型app

先舉一個小例子，讓你瞭解決策樹是幹嗎的，簡單來講，決策樹算法就是一種基於特徵的分類器，拿郵件來講吧，試想一下，郵件的類型有不少種，有須要及時處理的郵件，無聊是觀看的郵件，垃圾郵件等等，咱們須要去區分這些，好比根據郵件中出現裏你的名字還有你朋友的名字，這些特徵就會就能夠將郵件分紅兩類，須要及時處理的郵件和其餘郵件，這時候在分類其餘郵件，例如郵件中出現buy，money等特徵，說明這是垃圾推廣文件，又能夠將其餘文件分紅無聊是觀看的郵件和垃圾郵件了。python2.7

1.決策樹的構造機器學習

1.1 信息增益

試想一下，一個數據集是有多個特徵的，咱們該從那個特徵開始劃分呢，什麼樣的劃分方式會是最好的？

咱們知道劃分數據集的大原則是將無序的數據變得更加有序，這樣才能分類得更加清楚，這裏就提出了一種概念，叫作信息增益，它的定義是在劃分數據集以前以後信息發生的變化，變化越大，證實劃分得越好，因此在劃分數據集的時候，得到增益最高的特徵就是最好的選擇。

這裏又會扯到另外一個概念，信息論中的熵，它是集合信息的度量方式，熵變化越大，信息增益也就越大。信息增益是熵的減小或者是數據無序度的減小.

一個符號x在信息論中的信息定義是 l(x)= -log(p(x)) ,這裏都是以2爲底，再也不復述。

則熵的計算公式是 H =-∑p(xi)log(p(xi)) (i=1,2,..n)

下面開始實現給定數據集，計算熵

參考代碼：

1 from math import log         #we use log function to calculate the entropy
2 import operator

 1 def calcShannonEnt(dataSet):
 2     numEntries = len(dataSet)
 3     labelCounts = {}
 4     for featVec in dataSet: #the the number of unique elements and their occurance
 5         currentLabel = featVec[-1]
 6         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
 7         labelCounts[currentLabel] += 1
 8     shannonEnt = 0.0
 9     for key in labelCounts:
10         prob = float(labelCounts[key])/numEntries
11         shannonEnt -= prob * log(prob,2)     #log base 2
12     return shannonEnt

程序思路：首先計算數據集中實例的總數,因爲代碼中屢次用到這個值,爲了提升代碼效率,咱們顯式地聲明一個變量保存實例總數. 而後 ,建立一個數據字典labelCounts,它的鍵值是最後一列（分類的結果）的數值.若是當前鍵值不存在,則擴展字典並將當前鍵值加入字典。每一個鍵值都記錄了當前類別出現的次數。最後 , 使用全部類標籤的發生頻率計算類別出現的機率。咱們將用這個機率計算香農熵。

讓咱們來測試一下，先本身定義一個數據集

下表的數據包含 5 個海洋動物,特徵包括:不浮出水面是否能夠生存,以及是否有腳蹼。咱們能夠將這些動物分紅兩類: 魚類和非魚類。

根據上面的表格，咱們能夠定義一個createDataSet函數

參考代碼以下

1 def createDataSet():
2     dataSet = [[1, 1, 'yes'],
3                [1, 1, 'yes'],
4                [1, 0, 'no'],
5                [0, 1, 'no'],
6                [0, 1, 'no']]
7     labels = ['no surfacing','flippers']
8     #change to discrete values
9     return dataSet, labels

把全部的代碼都放在trees.py中（如下在jupyter）

cd /home/fangyang/桌面/machinelearninginaction/Ch03

/home/fangyang/桌面/machinelearninginaction/Ch03

import trees

myDat, labels = trees.createDataSet()

myDat  #old data set

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

labels

['no surfacing', 'flippers']

trees.calcShannonEnt(myDat)  #calculate  the  entropy

0.9709505944546686

myDat[0][-1]='maybe'     #change the result ,and look again the entropy

myDat  #new data set

[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

trees.calcShannonEnt(myDat)   # the new entropy

1.3709505944546687

咱們能夠看到當結果分類改變，熵也發生裏變化，主要是由於最後的結果發生裏改變，相應的機率也發生了改變，根據公式，熵也會改變

1.2 劃分數據集

前面已經獲得瞭如何去求信息熵的函數，但咱們的劃分是以哪一個特徵劃分的呢，不知道，因此咱們還要寫一個以給定特徵劃分數據集的函數。

參考代碼以下：

1 def splitDataSet(dataSet, axis, value):
2     retDataSet = []
3     for featVec in dataSet:
4         if featVec[axis] == value:
5             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
6             reducedFeatVec.extend(featVec[axis+1:])
7             retDataSet.append(reducedFeatVec)
8     return retDataSet

函數的三個輸人蔘數:待劃分的數據集（dataSet）、劃分數據集的特徵（axis）、特徵的返回值（value）。輸出是劃分後的數據集（retDataSet）

小知識：python語言在函數中傳遞的是列表的引用 ,在函數內部對列表對象的修改, 將會影響該列表對象的整個生存週期。爲了消除這個不良影響 ,咱們須要在函數的開始聲明一個新列表對象。由於該函數代碼在同一數據集上被調用屢次,爲了避免修改原始數據集,建立一個新的列表對象retDataSet

這個函數也挺簡單的，根據axis的值所指的對象來進行劃分數據集，好比axis=0，就按照第一個特徵來劃分，featVec[:axis]就是空,下面通過一個extend函數，將featVec[axis+1:]後面的數存到reduceFeatVec中，而後經過append函數以列表的形式存到retDataSet中。

這裏說一下entend和append函數的功能，舉個例子吧

a=[1,2,3]
b=[4,5,6]
a.append(b)

[1, 2, 3, [4, 5, 6]]

a=[1,2,3]
a.extend(b)

[1, 2, 3, 4, 5, 6]
可見append函數是直接將b的原型導入a中，extend是將b中的元素導入到a中
下面再來測試一下

myDat, labels = trees.createDataSet()  #initialization

myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

trees.splitDataSet(myDat,0,1)  #choose the first character to split the dataset

[[1, 'yes'], [1, 'yes'], [0, 'no']]

trees.splitDataSet(myDat,0,0)# change the value ,look  the difference  of  previous results

[[1, 'no'], [1, 'no']]

好了，咱們知道了怎樣以某個特徵劃分數據集了，但咱們須要的是最好的數據集劃分方式，因此要結合前面兩個函數，計算以每一個特徵爲劃分方式，相應最後的信息熵，咱們要找到最大信息熵，它所對應的特徵就是咱們要找的最好劃分方式。因此有了函數chooseBestFeatureToSpilt

參考代碼以下：

 1 def chooseBestFeatureToSplit(dataSet):
 2     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
 3     baseEntropy = calcShannonEnt(dataSet) #calculate the original entropy 
 4     bestInfoGain = 0.0; bestFeature = -1
 5     for i in range(numFeatures):        #iterate over all the features
 6         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
 7         uniqueVals = set(featList)       #get a set of unique values
 8         newEntropy = 0.0
 9         for value in uniqueVals:
10             subDataSet = splitDataSet(dataSet, i, value)
11             prob = len(subDataSet)/float(len(dataSet))
12             newEntropy += prob * calcShannonEnt(subDataSet)     
13         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
14         if (infoGain > bestInfoGain):       #compare this to the best gain so far
15             bestInfoGain = infoGain         #if better than current best, set to best
16             bestFeature = i
17     return bestFeature                      #returns an integer

這個函數就是把前面兩個函數整合起來了，先算出特徵的數目，因爲最後一個是標籤，不算特徵，因此以數據集長度來求特徵數時，要減1。而後求原始的信息熵，是爲了跟新的信息熵，進行比較，選出變化最大所對應的特徵。這裏有一個雙重循環，外循環是按特徵標號進行循環的，下標從小到大，featList是特徵標號對應下的每一個樣本的值，是一個列表，而uniqueVals是基於這個特徵的全部可能的值的集合，內循環作的是以特徵集合中的每個元素做爲劃分，最後求得這個特徵下的平均信息熵，而後原始的信息熵進行比較，得出信息增益，最後的if語句是要找到最大信息增益，並獲得最大信息增益所對應的特徵的標號。

如今來測試測試

import trees
myDat, labels = trees.createDataSet()
trees.chooseBestFeatureToSplit(myDat)   #return the index of best character to split

1.3 遞歸構建決策樹

好了，到如今，咱們已經知道如何基於最好的屬性值去劃分數據集了，如今進行下一步，如何去構造決策樹

決策樹的實現原理：獲得原始數據集, 而後基於最好的屬性值劃分數據集,因爲特徵值可能多於兩個,所以可能存在大於兩個分支的數據集劃分。第一次劃分以後, 數據將被向下傳遞到樹分支的下一個節點, 在這個節點上 ,咱們能夠再次劃分數據。所以咱們能夠採用遞歸的原則處理數據集。

遞歸結束的條件是:程序遍歷完全部劃分數據集的屬性, 或者每一個分支下的全部實例都具備相同的分類。

這裏先構造一個majorityCnt函數，它的做用是返回出現次數最多的分類名稱，後面會用到

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

這個函數在實戰一中的一個函數是同樣的，複述一遍，classCount定義爲存儲字典，每當，因爲後面加了1，因此每次出現鍵值就加1，就能夠就算出鍵值出現的次數裏。最後經過sorted函數將classCount字典分解爲列表，sorted函數的第二個參數導入了運算符模塊的itemgetter方法，按照第二個元素的次序（即數字）進行排序，因爲此處reverse=True，是逆序，因此按照從大到小的次序排列。

讓咱們來測試一下

import numpy as np
classList = np.array(myDat).T[-1]

classList

array(['yes', 'yes', 'no', 'no', 'no'], 
      dtype='|S21')

majorityCnt(classList)    #the number of 'no' is 3, 'yes' is 2,so return 'no'

‘no’

接下來是建立決策樹函數

代碼以下：

 1 def createTree(dataSet,labels):
 2     classList = [example[-1] for example in dataSet]
 3     if classList.count(classList[0]) == len(classList): 
 4         return classList[0]#stop splitting when all of the classes are equal
 5     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
 6         return majorityCnt(classList)
 7     bestFeat = chooseBestFeatureToSplit(dataSet)
 8     bestFeatLabel = labels[bestFeat]
 9     myTree = {bestFeatLabel:{}}
10     del(labels[bestFeat])              #delete the best feature , so it can find the next best feature
11     featValues = [example[bestFeat] for example in dataSet] 
12     uniqueVals = set(featValues)
13     for value in uniqueVals:
14         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
15         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
16     return myTree

前面兩個if語句是判斷分類是否結束，當全部的類都相等時，也就是屬於同一類時，結束再分類，又或特徵所有已經分類完成了，只剩下最後的class，也結束分類。這是判斷遞歸結束的兩個條件。通常開始的時候是不會運行這兩步的，先選最好的特徵，使用 chooseBestFeatureToSplit函數獲得最好的特徵，而後進行分類，這裏建立了一個大字典myTree，它將決策樹的整個架構全包含進去,這個等會在測試的時候說，而後對數據集進行劃分，用splitDataSet函數，就能夠獲得劃分後新的數據集，而後再進行createTrees函數，直到遞歸結束。

來測試一下

myTree = trees.createTree(myDat,labels)

myTree

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

再來講說上面沒詳細說明的大字典，myTree是特徵是‘no surfacing’,根據這個分類，獲得兩個分支‘0’和‘1‘，‘0’分支因爲全是同一類就遞歸結束裏，‘1’分支不知足遞歸結束條件，繼續進行分類，它又會生成它本身的字典，又會分紅兩個分支，而且這兩個分支知足遞歸結束的條件，因此返回‘no surfacing’上的‘1’分支是一個字典。這種嵌套的字典正是決策樹算法的結果，咱們可使用它和Matplotlib來進行畫決策

1.4 使用決策樹執行分類

這個就是將測試合成一個函數，定義爲classify函數

參考代碼以下：

 1 def classify(inputTree,featLabels,testVec):
 2     firstStr = inputTree.keys()[0]
 3     secondDict = inputTree[firstStr]
 4     featIndex = featLabels.index(firstStr)
 5     key = testVec[featIndex]
 6     valueOfFeat = secondDict[key]
 7     if isinstance(valueOfFeat, dict): 
 8         classLabel = classify(valueOfFeat, featLabels, testVec)
 9     else: classLabel = valueOfFeat
10     return classLabel

這個函數就是一個根據決策樹來判斷新的測試向量是那種類型，這也是一個遞歸函數，拿上面決策樹的結果來講吧。

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}，這是就是咱們的inputTree，首先經過函數的第一句話獲得它的第一個bestFeat，也就是‘no surfacing’，賦給了firstStr，secondDict就是‘no surfacing’的值，也就是 {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}，而後用index函數找到firstStr的標號，結果應該是0，根據下標，把測試向量的值賦給key，而後找到對應secondDict中的值，這裏有一個isinstance函數，功能是第一個參數的類型等於後面參數的類型，則返回true，不然返回false，testVec列表第一位是1，則valueOfFeat的值是 {0: 'no', 1: 'yes'}，是dict，則遞歸調用這個函數，再進行classify，知道不是字典，也就最後的結果了，其實就是將決策樹過一遍，找到對應的labels罷了。

這裏有一個小知識點，在jupyter notebook中，顯示綠色的函數，能夠經過下面查詢它的功能，例如

isinstance?     #run it , you will see a below window which is used to introduce this function

讓咱們來測試測試

trees.classify(myTree,labels,[1,0])

‘no’

trees.classify(myTree,labels,[1,1])

‘yes'

1.5 決策樹的存儲

構造決策樹是很耗時的任務,即便處理很小的數據集, 如前面的樣本數據, 也要花費幾秒的時間 ,若是數據集很大,將會耗費不少計算時間。然而用建立好的決策樹解決分類問題，能夠很快完成。所以 ,爲了節省計算時間,最好可以在每次執行分類時調用巳經構造好的決策樹。

解決方案：使用pickle模塊存儲決策樹

參考代碼：

def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
    
def grabTree(filename):
    import pickle
    fr = open(filename)
    return pickle.load(fr)

就是將決策樹寫到文件中，用的時候在取出來，測試一下就明白了

trees.storeTree(myTree,'classifierStorage.txt')   #run it ,store the tree

trees.grabTree('classifierStorage.txt')

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

決策樹的構造部分結束了，下面介紹怎樣繪製決策樹

2. 使用Matplotlib註解繪製樹形圖

前面咱們看到決策樹最後輸出是一個大字典，很是醜陋，咱們想讓它更有層次感，更加清晰，最好是圖形狀的，因而，咱們要Matplotlib去畫決策樹。

2.1 Matplotlib註解

Matplotlib提供了一個註解工具annotations,它能夠在數據圖形上添加文本註釋。

建立一個treePlotter.py文件來存儲畫圖的相關函數

首先是使用文本註解繪製樹節點，參考代碼以下：

 1 import matplotlib.pyplot as plt
 2 
 3 decisionNode = dict(boxstyle="sawtooth", fc="0.8")
 4 leafNode = dict(boxstyle="round4", fc="0.8")
 5 arrow_args = dict(arrowstyle="<-")
 6 
 7 def plotNode(nodeTxt, centerPt, parentPt, nodeType):
 8     createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',\
 9              xytext=centerPt, textcoords='axes fraction',\
10              va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
11     
12 def createPlot1():
13     fig = plt.figure(1, facecolor='white')
14     fig.clf()
15     createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
16     plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
17     plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
18     plt.show()

前面三行是定義文本框和箭頭格式，decisionNode是鋸齒形方框，文本框的大小是0.8，leafNode是4邊環繞型，跟矩形相似，大小也是4，arrow_args是指箭頭，咱們在後面結果是會看到這些東西，這些數據以字典類型存儲。第一個plotNode函數的功能是繪製帶箭頭的註解，輸入參數分別是文本框的內容，文本框的中心座標，父結點座標和文本框的類型，這些都是經過一個createPlot.ax1.annotate函數實現的，create.ax1是一個全局變量，這個函數很少將，會用就好了。第二個函數createPlot就是生出圖形，也沒什麼東西，函數第一行是生成圖像的畫框，橫縱座標最大值都是1，顏色是白色，下一個是清屏，下一個就是分圖，111中第一個1是行數，第二個是列數，第三個是第幾個圖，這裏就一個圖，跟matlab中的同樣，matplotlib裏面的函數都是和matlab差很少。

來測試一下吧

reset -f   #clear all the module and data

cd 桌面/machinelearninginaction/Ch03

/home/fangyang/桌面/machinelearninginaction/Ch03

import treePlotter
import matplotlib.pyplot as plt

treePlotter.createPlot1()

2.2 構造註解樹

繪製一棵完整的樹須要一些技巧。咱們雖然有 x 、y 座標,可是如何放置全部的樹節點倒是個問題，咱們必須知道有多少個葉節點,以即可以正確肯定x軸的長度;咱們還須要知道樹有多少層，以即可以正確肯定y軸的高度。這裏定義了兩個新函數getNumLeafs()和getTreeDepth()，以求葉節點的數目和樹的層數。

參考代碼：

 1 def getNumLeafs(myTree):
 2     numLeafs = 0
 3     firstStr = myTree.keys()[0]
 4     secondDict = myTree[firstStr]
 5     for key in secondDict.keys():
 6         if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
 7             numLeafs += getNumLeafs(secondDict[key])
 8         else:   numLeafs +=1
 9     return numLeafs
10 
11 def getTreeDepth(myTree):
12     maxDepth = 0
13     firstStr = myTree.keys()[0]
14     secondDict = myTree[firstStr]
15     for key in secondDict.keys():
16         if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
17             thisDepth = 1 + getTreeDepth(secondDict[key])
18         else:   thisDepth = 1
19         if thisDepth > maxDepth: maxDepth = thisDepth
20     return maxDepth

咱們能夠看到兩個方法有點似曾相識，沒錯，咱們在進行決策樹分類測試時，用的跟這個幾乎同樣，分類測試中的isinstance函數換了一種方式去判斷，遞歸依然在，不過是每遞歸依次，高度增長1，葉子數一樣是檢測是否爲字典，不是字典則增長相應的分支。

這裏還寫了一個函數retrieveTree，它的做用是預先存儲的樹信息,避免了每次測試代碼時都要從數據中建立樹的麻煩

參考代碼以下

1 def retrieveTree(i):
2     listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
3                   {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
4                   ]
5     return listOfTrees[i]

這個沒什麼好說的，就是把決策樹的結果存在一個函數中，方便調用，跟前面的存儲決策樹差很少。

有了前面這些基礎後，咱們就能夠來畫樹了。

參考代碼以下：

 1 def plotMidText(cntrPt, parentPt, txtString):
 2     xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
 3     yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
 4     createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
 5 
 6 def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
 7     numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
 8     depth = getTreeDepth(myTree)
 9     firstStr = myTree.keys()[0]     #the text label for this node should be this
10     cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
11     plotMidText(cntrPt, parentPt, nodeTxt)
12     plotNode(firstStr, cntrPt, parentPt, decisionNode)
13     secondDict = myTree[firstStr]
14     plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
15     for key in secondDict.keys():
16         if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes   
17             plotTree(secondDict[key],cntrPt,str(key))        #recursion
18         else:   #it's a leaf node print the leaf node
19             plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
20             plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
21             plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
22     plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
23 #if you do get a dictonary you know it's a tree, and the first element will be another dict
24 
25 def createPlot(inTree):
26     fig = plt.figure(1, facecolor='white')
27     fig.clf()
28     axprops = dict(xticks=[], yticks=[])
29     createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    
30     plotTree.totalW = float(getNumLeafs(inTree))
31     plotTree.totalD = float(getTreeDepth(inTree))
32     plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
33     plotTree(inTree, (0.5,1.0), '')
34     plt.show()

第一個函數是在父子節點中填充文本信息，函數中是將父子節點的橫縱座標相加除以2，上面寫得有一點點不同，但原理是同樣的，而後仍是在這個中間座標的基礎上添加文本，仍是用的是 createPlot.ax1這個全局變量，使用它的成員函數text來添加文本，裏面是它的一些參數。

第二個函數是關鍵，它調用前面咱們說過的函數，用樹的寬度用於計算放置判斷節點的位置 ,主要的計算原則是將它放在全部葉子節點的中間,而不只僅是它子節點的中間，根據高度就能夠平分座標系了，用座標系的最大值除以高度，就是每層的高度。這個plotTree函數也是個遞歸函數，每次都是調用，畫出一層，知道全部的分支都不是字典後，纔算畫完。每次檢測出是葉子，就記錄下它的座標，並寫出葉子的信息和父子節點間的信息。plotTree.xOff和plotTree.yOff是用來追蹤已經繪製的節點位置，以及放置下一個節點的恰當位置。

第三個函數咱們以前介紹介紹過一個相似，這個函數調用了plotTree函數，最後輸出樹狀圖，這裏只說兩點，一點是全局變量plotTree.totalW存儲樹的寬度 ,全局變量plotTree.totalD存儲樹的深度，還有一點是plotTree.xOff和plotTree.yOff是在這個函數這裏初始化的。

最後咱們來測試一下

cd 桌面/machinelearninginaction/Ch03

/home/fangyang/桌面/machinelearninginaction/Ch03

import treePlotter
myTree = treePlotter.retrieveTree(0)
treePlotter.createPlot(myTree)

改變標籤，從新繪製圖形

myTree['no surfacing'][3] = 'maybe'
treePlotter.createPlot(myTree)

至此，用matplotlib畫決策樹到此結束。

3 使用決策樹預測眼睛類型

隱形眼鏡數據集是很是著名的數據集 , 它包含不少患者眼部情況的觀察條件以及醫生推薦的隱形眼鏡類型。隱形眼鏡類型包括硬材質、軟材質以及不適合佩戴隱形眼鏡。數據來源於UCI數據庫 ,爲了更容易顯示數據 , 將數據存儲在源代碼下載路徑的文本文件中。

進行測試

import trees
lensesTree = trees.createTree(lenses,lensesLabels)
fr = open('lenses.txt')
lensesTree = trees.createTree(lenses,lensesLabels)
lenses = [inst.strip().split('\t') for inst in fr.readlines()]
lensesLabels = ['age' , 'prescript' , 'astigmatic','tearRate']
lensesTree = trees.createTree(lenses,lensesLabels)

lensesTree

{'tearRate': {'normal': {'astigmatic': {'no': {'age': {'pre': 'soft',
      'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}},
      'young': 'soft'}},
    'yes': {'prescript': {'hyper': {'age': {'pre': 'no lenses',
        'presbyopic': 'no lenses',
        'young': 'hard'}},
      'myope': 'hard'}}}},
  'reduced': 'no lenses'}}
這樣看，很是亂，看不出什麼名堂，畫出決策樹樹狀圖看看

treePlotter.createPlot(lensesTree)

這就很是清楚了，但仍是有一個問題，決策樹很是好地匹配了實驗數據,然而這些匹配選項可能太多了，咱們將這種問題稱之爲過分匹配（overfitting），爲了減小過分匹配問題,咱們能夠裁剪決策樹,去掉一些沒必要要的葉子節點。若是葉子節點只能增長少量信息, 則能夠刪除該節點, 將它並人到其餘葉子節點中，這個將在後面討論吧！

結尾

這篇notebook寫了兩天多，接近三天，好累，但願這篇關於決策樹的博客可以幫助到你，若是發現錯誤，還望不吝指教，謝謝！

以爲不錯的，賜我金筆吧，哈哈，我須要鼓勵鼓勵，(^__^) 嘻嘻……

百度雲盤：連接: https://pan.baidu.com/s/1eSeRQIQ 密碼: 3zwm