2018-02-03-PY3下經典數據集iris的機器學習算法舉例-零基礎

時間 2019-12-01

標籤 py3 經典數據 iris 機器學習算法舉例零基礎简体版

原文原文鏈接

---
layout: post
title: 2018-02-03-PY3下經典數據集iris的機器學習算法舉例-零基礎
key: 20180203
tags: 機器學習 ML IRIS python3
modify_date: 2018-02-03
---



# python3下經典數據集iris的機器學習算法舉例-零基礎
說明：
* 本文發佈於: gitee,github,博客園
* 轉載和引用請指明原做者和鏈接及出處.

正文：
* 如下內容能夠拷貝到一個python3源碼文件，好比較「iris_ml.py」當中，運行便可；

###########################
#說明：
#      撰寫本文的緣由是，筆者在研究博文「http://python.jobbole.com/83563/」中發現
#      原內容有少許筆誤，而且對入門學友缺乏一些信息。因而筆者作了增補，主要有：
#      1.查詢並簡述了涉及的大部分算法；
#      2.添加了鏈接或資源供進一步查詢；
#      3.增長了一些lib庫的基本操做及說明；
#      4.增長了必須必要的python的部分語法說明；
#      5.增長了對模型算法，數據挖掘等領域的思考和判斷；
#      6.修訂了原做者代碼上的筆誤，跑通所有程序，拷貝就可用！
#      7.其餘
#      目標是：針對python是0基礎！但喜歡數據挖掘的初級學友，方面其入門，減小挫折感！
#              經過「一份帶註釋的可用代碼」來學習！
#建議：先學習，或初步瀏覽原做者的博文（如上）。 
#連接：筆者資源收集貼「http://www.cnblogs.com/taichu/p/5216659.html」，供新老學友參考，筆者會不斷整理更新！
###########################

###########################
#（0）心得
# 1.由於數據的找尋，分析和建模一條龍代價不菲。
#   應該‘榨乾’一份數據和模型的每種可能性，完全研究掌握。
#   每每能一通百通，一個模型反覆折騰能用到各類方法和體會！
###########################

###########################
#（1）觀察原始數據（樣本）
#知識點：數據導入；數據可視化
###########################

##################
#在ubuntu15.10中經過以下6條命令來安裝python環境
#sudo apt-get install python   #安裝python最新版，通常已經自帶最新2.7版本了
#sudo apt-get install python-numpy    #安裝python的numpy模塊
#sudo apt-get install python-matplotlib
#sudo apt-get install python-networkx
#sudo apt-get install python-sklearn
#python  #看python版本並進入交互式界面，就能夠執行以下命令，所有拷貝黏貼進去試試看？
#另外，能夠下載Anaconda的Python IDE集成環境，搜一下很是好，不少SCIPY等核心庫都集成了，免去安裝之苦！
#特別注意：筆者是WIN10宿主機上安裝Ubuntu15.10最新虛擬機，在Ubuntu中默認安裝了python，升級並安裝以上lib後實踐全部以下代碼！
##################

from urllib import request
url = 'http://aima.cs.berkeley.edu/data/iris.csv'
response = request.urlopen(url)
#如下爲本地樣本存儲路徑，請根據實際狀況設定！
#localfn='/mnt/hgfs/sharedfolder/iris.csv' #for linux
#localfn='C:\\TEMP\\iris.csv' #for windows
localfn='iris.csv' #for windows
localf = open(localfn, 'w')
localf.write(response.read().decode('utf-8'))
localf.close()

# data examples
#COL1,  COL2,   COL3,   COL4,   COL5
#5.1   3.5    1.4    0.2    setosa
#… …  …  …  …
#4.7   3.2    1.3    0.2    setosa
#7 3.2    4.7    1.4    versicolor
#… …  …  …  …
#6.9   3.1    4.9    1.5    versicolor
#6.3   3.3    6  2.5    virginica
#… …  …  …  …
#7.1   3  5.9    2.1    virginica

#############################
#U can get description of 'iris.csv' 
#at 'http://aima.cs.berkeley.edu/data/iris.txt'
#Definiation of COLs:
#1. sepal length in cm (花萼長)
#2. sepal width in cm（花萼寬）
#3. petal length in cm (花瓣長)
#4. petal width in cm（花瓣寬）
#5. class: 
#      -- Iris Setosa
#      -- Iris Versicolour
#      -- Iris Virginica
#Missing Attribute Values: None
#################################


from numpy import genfromtxt, zeros
# read the first 4 columns
data = genfromtxt(localfn,delimiter=',',usecols=(0,1,2,3)) 
# read the fifth column
target = genfromtxt(localfn,delimiter=',',usecols=(4),dtype=str)

print (data.shape)
# output: (150, 4)
print (target.shape)
# output: (150,)

#auto build a collection of unique elements
print (set(target))  
# output: set(['setosa', 'versicolor', 'virginica'])
#print set(data) #wrong usage of set, numbers is unhashable

######################
#plot庫用法簡述：
#'bo'=blue+circle; 'r+'=red+plus;'g'=red+*
#search keyword 'matlab plot' on web for details
#http://www.360doc.com/content/15/0113/23/16740871_440559122.shtml
#http://zhidao.baidu.com/link?url=6JA9-A-UT3kmslX1Ba5uTY1718Xh-OgebUJVuOs3bdzfnt4jz4XXQdAmvb7R5JYMHyRbBU0MYr-OtXPyKxnxXsPPkm9u5qAciwxIVACR8k7
######################

#figure for 2D data
from pylab import plot, show
plot(data[target=='setosa',0],data[target=='setosa',2],'bo')
plot(data[target=='versicolor',0],data[target=='versicolor',2],'r+')
plot(data[target=='virginica',0],data[target=='virginica',2],'g*')
show()

#注意:若是在Ubuntu的python交互式環境下運行，則figure會打斷程序的RUN.
#若是你用Anaconda的spyder（Python2.7）則方便的多，生成的figure會自動輸出到console
#且不會打斷程序運行！

#figure for all 4D（4個維度） data, 同色一類，圈是花萼，加號花瓣
setosa_sepal_x=ssx=data[target=='setosa',0]
setosa_sepal_y=ssy=data[target=='setosa',1]
setosa_petal_x=spx=data[target=='setosa',2]
setosa_petal_y=spy=data[target=='setosa',3]

versicolor_sepal_x=vsx=data[target=='versicolor',0]
versicolor_sepal_y=vsy=data[target=='versicolor',1]
versicolor_petal_x=vpx=data[target=='versicolor',2]
versicolor_petal_y=vpy=data[target=='versicolor',3]

virginica_sepal_x=vgsx=data[target=='virginica',0]
virginica_sepal_y=vgsy=data[target=='virginica',1]
virginica_petal_x=vgpx=data[target=='virginica',2]
virginica_petal_y=vgpy=data[target=='virginica',3]

plot(ssx,ssy,'bo',spx,spy,'b+')
plot(vsx,vsy,'ro',vpx,vpy,'r+')
plot(vgsx,vgsy,'go',vgpx,vgpy,'g+')
show()


#figure for 1D（花萼的長度），三類長度及平均值的直方圖
#pylab詳細用法參考以下
#http://hyry.dip.jp/tech/book/page/scipy/matplotlib_fast_plot.html
from pylab import figure, subplot, hist, xlim, show
xmin = min(data[:,0])
xmax = max(data[:,0])
figure() #可省略，默認會生成一個figure
subplot(411) # distribution of the setosa class (1st, on the top)
hist(data[target=='setosa',0],color='b',alpha=.7)
xlim(xmin,xmax)
#subplot（行,列,plot號）；(4,1,2)合併爲412,都小於10可合成
subplot(412) # distribution of the versicolor class (2nd)
hist(data[target=='versicolor',0],color='r',alpha=.7)
xlim(xmin,xmax)
subplot(413) # distribution of the virginica class (3rd)
hist(data[target=='virginica',0],color='g',alpha=.7)
xlim(xmin,xmax)
subplot(414) # global histogram (4th, on the bottom)
hist(data[:,0],color='y',alpha=.7)
xlim(xmin,xmax)
show()

###########################
#（2）樣本分類
# 樸素貝葉斯分類器是經常使用的一種，分爲（高斯模型/非多項式模式/非伯努利模式）
###########################

#仿造target陣列(1維)弄出全0的t陣列
t = zeros(len(target))
#type(t) #show type of t (numpy.ndarray)
#print t #show contains of t
#將target陣列中特定元素的位置設置爲1(真簡潔)
t[target == 'setosa'] = 1
t[target == 'versicolor'] = 2
t[target == 'virginica'] = 3
#print t

#用所有data集來作訓練
from sklearn.naive_bayes import GaussianNB
classifier = cf = GaussianNB()
cf.fit(data,t) # training on the iris dataset
print (cf.predict(data[0])) #訓練完分類1條數據
#output:[ 1.]
print (t[0])
#output:1.0

#從原始數據data中劃分爲訓練集和驗證集，t也作一樣劃分
from sklearn import cross_validation
train, test, t_train, t_test = cross_validation.train_test_split(data, t, \
test_size=0.4, random_state=0)

print (train.shape)
#output:(90, 4)
print (test.shape)
#output:(60, 4)
print (t_train.shape)
#output:(90,)
print (t_test.shape)
#output:(60,)

#用60%數據訓練後，再用40%數據驗證，獲得93.3%
cf.fit(train,t_train) # train
print (cf.score(test,t_test)) # test
#output:0.93333333333333335
cf.score(train,t_train) #用訓練集訓練後一樣用它測試竟然不是100%分類！
#output:0.97777777777777775

#用所有數據訓練後，一樣用它測試，結果低於剛纔97%
cf.fit(data,t)
#output:GaussianNB()
cf.score(data,t)
#output:0.95999999999999996


#用100%數據訓練後，再用40%數據驗證，獲得94.99%
cf.fit(data,t)
#output:GaussianNB()
cf.score(test,t_test)
#output:0.94999999999999996

#############################################################
#TODO：研究計劃（筆者會另立博文研究此問題）
#由於樸素貝葉斯分類法基於每一個feature都是機率獨立不相關。但其實相關，可嘗試：
#1.顯然花萼長寬，花瓣的長寬，是很強的相關性，造成2個新feature；爲sepal-size，petal-size
#2.花萼與花瓣的長度合併，寬度合併，可能也有相關性，造成2個新feature！爲whole-length，whole-wide
#3.原來花萼長與寬，花瓣長與寬，就是4個初始feature;
#4.以上初步判斷的8個feature的組合關係？舉例：一種花，就是花瓣很小，花萼較大呢？生物學有必然比例ratio嗎？
#  再好比，一種花總體都很修長？或矮短？
#  咱們也懷疑sepal-size和petal-size有必定的機率聯繫（正相關或負相關或某種關係）
#  即便分類器作到了100%，對將來樣本的分類也不必定100%正確，由於樣本的收集也存在標定偏差（人爲錄入偏差）
#TRY：嘗試變動模型，數據轉換後，再次作分類測試，交叉驗證，指望提高準確率！
#############################################################


#用混淆矩陣估計分類器表現
from sklearn.metrics import confusion_matrix
print (confusion_matrix(cf.predict(test),t_test))
#output:[[16  0  0]
#output: [ 0 23  4]
#output: [ 0  0 17]]

#混淆矩陣簡單說明
#        預測狀況
#        -----------
#        類1 類2 類3
#實 |類1 43  5   2
#際 |類2 2   45  3
#情 |類3 0   1   49
#況 |
#
#說明：正確的猜想都在表格的對角線
#解讀：實際狀況是3個類每一個都50個樣本；
#      類3有1個錯誤的猜想爲類2；
#      類2有2個錯誤的猜想爲類1,3個錯誤的識別爲類3
#      類1有5個錯誤的猜想爲類2,2個錯誤的識別爲類3

#分類器性能的完整報告
#Precision：正確預測的比例
#Recall（或者叫真陽性率）：正確識別的比例
#F1-Score：precision和recall的調和平均數

from sklearn.metrics import classification_report
print (classification_report(classifier.predict(test), t_test, target_names=['setosa', 'versicolor', 'virginica']))
#output:            precision    recall  f1-score   support
#output:    setosa       1.00      1.00      1.00        16
#output:versicolor       1.00      0.85      0.92        27
#output: virginica       0.81      1.00      0.89        17
#output:avg / total      0.95      0.93      0.93        60

##############################################################
#補充調和平均數知識點
#調和平均數：Hn=n/(1/a1+1/a2+...+1/an)
#幾何平均數：Gn=(a1a2...an)^(1/n)
#算術平均數：An=(a1+a2+...+an)/n
#平方平均數：Qn=√ [(a1^2+a2^2+...+an^2)/n]
#這四種平均數知足 Hn ≤ Gn ≤ An ≤ Qn
#
#調和平均數典型舉例：
#問：有4名學生分別在一個小時內解題三、4、6、8道，求平均解題速度多少（1小時能解幾道）？
#答：就是求調和平均數，即1/[(1/3+1/4+1/6+1/8)/4]=4/(1/3+1/4+1/6+1/8)=4.57 
###########################################################


#以上僅僅只是給出用於支撐測試分類的數據量。
#分割數據、減小用於訓練的樣本數以及評估結果等操做
#都依賴於配對的訓練集和測試集的隨機選擇


#若是要切實評估一個分類器並與其它的分類器做比較的話，
#咱們須要使用一個更加精確的評估模型，例如Cross Validation。
#該模型背後的思想很簡單：屢次將數據分爲不一樣的訓練集和測試集，
#最終分類器評估選取屢次預測的平均值。
#sklearn爲咱們提供了運行模型的方法：

from sklearn.cross_validation import cross_val_score
# cross validation with 6 iterations 
scores = cross_val_score(classifier, data, t, cv=6)
print (scores)
#output:[ 0.92592593  1.          0.91666667  0.91666667  0.95833333  1.        ]
#並不是迭代越屢次越好。當前CV=6，迭代6次

#輸出是每次模型迭代產生的精確度的數組。咱們能夠很容易計算出平均精確度：
from numpy import mean
print (mean(scores))
#output:0.96

#循環不斷增長迭代cv次數，並輸出mean值
#迭代CV必須>=2,不然報錯'ValueError: k-fold cross validation requires at least one train / test split by setting n_folds=2 or more, got n_folds=1.'
#迭代CV必須小於最小的一個樣本數目（對t=50;t_train=27;t_test=16），詳見後面ndarray歸類打印！
#1.窮舉data的全部迭代cv可能的交叉驗證平均值並打印
for i in range(2, 51):
    scores = cross_val_score(classifier, data, t, cv=i)
    print (mean(scores)) #每句for語句在交互式界面必須跟一行空行（沒任何字符包括空格）才能表示輸入結束！


#2.窮舉test的全部迭代cv可能的交叉驗證平均值並打印  
for i in range(2, 17): print (mean(cross_val_score(classifier, test, t_test, cv=i)))


#3.窮舉train的全部迭代cv可能的交叉驗證平均值並打印  
for i in range(2, 28): print (mean(cross_val_score(classifier, train, t_train, cv=i)))


#
#
#對一維numpy.ndarray數字值歸類並打印
ndarray={}
for item in t: ndarray[item] = ndarray.get(item, 0) + 1
    #下面必須有一行空行（沒任何空格！），讓交互式python確認for語句完成輸入

print(ndarray)
#output:{1.0: 50, 2.0: 50, 3.0: 50}

#對一維numpy.ndarray數字值歸類並打印
ndarray={}
for item in t_train: ndarray[item] = ndarray.get(item, 0) + 1
    #下面必須有一行空行，讓交互式python確認for語句完成輸入

print(ndarray)
#output:{1.0: 34, 2.0: 27, 3.0: 29}

#對一維numpy.ndarray數字值歸類並打印
ndarray={}
for item in t_test: ndarray[item] = ndarray.get(item, 0) + 1
    #下面必須有一行空行，讓交互式python確認for語句完成輸入

print(ndarray)
#output:{1.0: 16, 2.0: 23, 3.0: 21}

#
#
#***********************************
#附加內容：寫一個循環，從1和n-1到n-1和1來劃分訓練集和驗證集；
#TODO：    並對每種劃分應用model（此處是樸素貝葉斯分類器-高斯模型）訓練後交叉驗證；
#          交叉驗證時也窮舉全部可能的cv迭代次數；
#          收集數據並顯示，看此model對已知數據集合的分類最優勢在哪裏？
#          figure的X是train/data（訓練集合佔比%）(0,1)；Y軸交叉驗證mean值的迭代窮舉後均值！(0,1)
#          由於訓練集和驗證集劃分每次是隨機的，每RUN一次會有一張不一樣的二維圖
#TODO：    進一步擴展，對一個矩陣樣本，可否自動的按照必定規律，窮舉各類算法模型的結果？
#          並能設定閾值報警。這樣咱們就有個一個遍歷全部算法的基礎toolbox，對原始矩陣樣式的樣本
#          作自動auto的掃描，提供基本的信息和狀況，而後再人爲去研究。
#***********************************

###########################
#（3）聚類
###########################
#k-means算法簡介：算法接受輸入量k ，並將n個數據對象分爲k個聚類；得到的聚類知足:同一聚類中的對象類似度較高;不一樣聚類中對象類似度低；
#                聚類類似度是利用各聚類中對象的均值所得到一個「中心對象」（引力中心）來進行計算。
#k-means 算法基本步驟：
#（1） 從 n個數據對象任意選擇k個對象做爲初始聚類中心（最終指望聚爲k類）；
#（2） 根據每一個聚類對象的均值（中心對象），計算每一個對象與這些中心對象的距離；按最小距離從新對相應對象進行劃分；
#（3） 從新計算每一個（有變化）聚類的均值（中心對象）；
#（4） 計算標準測度函數，當知足必定條件，如函數收斂時，則算法終止；若是條件不知足則回到步驟（2）。
############################


from sklearn.cluster import KMeans
kms = KMeans(n_clusters=3) # initialization 先驗知道3種植物，因此設定引力中心爲聚合成3類。
#kmeans = KMeans(k=3, init='random') # both parameters are wrong
kms.fit(data) # actual execution
c = kms.predict(data)

from sklearn.metrics import completeness_score, homogeneity_score
print (completeness_score(t,c))
#output:0.764986151449
print (homogeneity_score(t,c))
#output:0.751485402199

#特別注意！t中只要是3類值就行，不必定非要1,2,3
#當大部分數據點屬於一個給定的類而且屬於同一個羣集，那麼完整性得分就趨向於1。
#當全部羣集都幾乎只包含某個單一類的數據點時同質性得分就趨向於1.
figure()
subplot(211) # top figure with the real classes
plot(data[t==1,0],data[t==1,2],'bo')
plot(data[t==2,0],data[t==2,2],'ro')
plot(data[t==3,0],data[t==3,2],'go')
subplot(212) # bottom figure with classes assigned automatically
plot(data[c==1,0],data[c==1,2],'bo',alpha=.5)
plot(data[c==2,0],data[c==2,2],'go',alpha=.5)
plot(data[c==0,0],data[c==0,2],'mo',alpha=.5)
show()

#觀察此圖咱們能夠看到，底部左側的羣集能夠被k-means徹底識別，
#然而頂部的兩個羣集有部分識別錯誤。按照kmean的中心對象是引力中心的聚類方法
#出現識別錯誤是必然的；樣本的偶然性可能致使識別錯誤

#以下是將4個feature維度組合爲2個點放入一個平面，也能夠看到聚類爲3種後，
#邊界變得清晰了。
import matplotlib.pyplot as plt
plt.figure()
plt.subplot(211) # top figure with the real classes
plt.plot(data[t==1,0],data[t==1,1],'bo',data[t==1,2],data[t==1,3],'b+')
plt.plot(data[t==2,0],data[t==2,1],'ro',data[t==2,2],data[t==2,3],'r+')
plt.plot(data[t==3,0],data[t==3,1],'go',data[t==3,2],data[t==3,3],'g+')
plt.subplot(212) # bottom figure with classes assigned automatically
plt.plot(data[c==0,0],data[c==0,1],'bo',data[c==0,2],data[c==0,3],'b+',alpha=.7)
plt.plot(data[c==1,0],data[c==1,1],'ro',data[c==1,2],data[c==1,3],'r+',alpha=.7)
plt.plot(data[c==2,0],data[c==2,1],'go',data[c==2,2],data[c==2,3],'g+',alpha=.7)
p=plt
fig=plt.gcf()
fig.show() # p.show()也可，但兩者只能執行一次。


###########################
#（4）迴歸
###########################

#迴歸是一個用於預測變量之間函數關係調查的方法。
#假設有兩個變量：一個被認爲是因，一個被認爲是果。
#迴歸模型描述二者關係；從一個變量推斷另外一個變量；
#當這種關係是一條線時，稱爲線性迴歸。


##############
#sklear.linear_model模塊中的LinearRegression模型。
#它經過計算每一個數據點到擬合線的垂直差的平方和，
#找到平方和最小的最佳擬合線。相似sklearn模型；
#
##############

#下面舉例隨機產生了40個點樣本，但大體函數趨勢是
#在第一象限線性增加，用線性迴歸來找出擬合線並評估
#Step1-隨機產生第一象限40個點
from numpy.random import rand
x = rand(40,1) # explanatory variable
y = x*x*x+rand(40,1)/5 # depentend variable

#Step2-線性迴歸
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(x,y)

#Step3-隨機產生x變量，用線性迴歸模型推斷y變量（推斷出來是一條線）
from numpy import linspace, matrix
#產生0到1之間40個樣本值
randx = linspace(0,1,40) 
#用隨機產生的40個x軸樣本，用線性迴歸預測其y軸樣本，並輸出比較
#推斷y時先將x當作矩陣轉置爲y再推斷
plot(x,y,'o',randx,linreg.predict(matrix(randx).T),'--r')
show()

#Step4-經過測量MSE指標看擬合線與真實數據的距離平方。0最好
from sklearn.metrics import mean_squared_error
print (mean_squared_error(linreg.predict(x),y))

#########################
#針對本例實際花萼的長寬數據作線性迴歸
#########################
#獲取x和y（須要reshape來轉換數組(50,)到一維矩陣(50,1)，才能作linreg.fit!
ssx_blue=data[target=='setosa',0].reshape((50,1)) #獲取setosa的sepal花萼length
ssy_blue=data[target=='setosa',1].reshape((50,1)) #獲取setosa的sepal花萼width

#用x和y得到線性迴歸模型
linreg = LinearRegression()
linreg.fit(ssx_blue,ssy_blue)

#隨機產生x變量，用線性迴歸模型推斷y變量（推斷出來是一條線）
#根據經驗藍色品種setosa的花萼sepal的長寬尺寸通常爲X:[4.0-6.0]y:[2.5-4.5]
randx = linspace(4.0,6.0,50) 
plot(ssx_blue,ssy_blue,'o',randx,linreg.predict(matrix(randx).T),'--r')
show()

#經過測量MSE指標看擬合線與真實數據的距離平方。0最好
print (mean_squared_error(linreg.predict(ssx_blue),ssy_blue))


###########################
#（5）相關性分析
###########################

#經過研究feature之間的相關性來理解變量之間是否相關，相關強弱。
#相關性分析幫助定位被依賴的重要變量。最好的相關方法多是皮爾遜積矩相關係數。
#它是由兩個變量的協方差除以它們的標準差的乘積計算而來。
#咱們將鳶尾花數據集的4個變量兩兩組合計算出其相關性係數。
#特別說明：feature是能夠組合與變換的，因此不必定是未處理的初始feature兩兩作相關性判斷，
#          而多是人爲判斷有相關性的，嘗試組合或變換feature再不斷測試相關性。

#當值一塊兒增加時相關性爲正。當一個值減小而另外一個值增長時相關性爲負。
#1表明完美的正相關，0表明不相關，-1表明完美的負相關。

#本例紅色被關聯爲最高的正相關，能夠看出最強相關是：
#「花瓣寬度」petal width和「花瓣長度」petal length這兩個變量。

from numpy import corrcoef
corr = corrcoef(data.T) # .T gives the transpose
print (corr)
#output:[[ 1.         -0.10936925  0.87175416  0.81795363]
#output: [-0.10936925  1.         -0.4205161  -0.35654409]
#output: [ 0.87175416 -0.4205161   1.          0.9627571 ]
#output: [ 0.81795363 -0.35654409  0.9627571   1.        ]]

from pylab import pcolor, colorbar, xticks, yticks
from numpy import arange
pcolor(corr) #添加相關性矩陣，4個屬性因此是4x4
colorbar() #添加彩色註釋條
#添加X,Y軸註釋，默認一個屬性是1，座標是1,2,3,4，對應四個屬性name以下。
xticks(arange(1,5),['sepal length',  'sepal width', 'petal length', 'petal width'],rotation=-20)
yticks(arange(1,5),['sepal length',  'sepal width', 'petal length', 'petal width'],rotation=-45)
show()


###########################
#（6）成分分析（降維）
# 涉及算法之一PCA
###########################


from sklearn.decomposition import PCA
#降維到更少feature（主成分）不只僅是爲了可視化
#雖然3D也能夠看，但不直觀，最直觀的是2D平面圖，而4D或更高維人眼沒法觀察
#因此將data中原始4個feature降維到2維來觀察。
#特別注意：它等於自動的將feature作了算法組合，以指望分離不一樣種類。
pca = PCA(n_components=2)

pcad = pca.fit_transform(data)

plot(pcad[target=='setosa',0],pcad[target=='setosa',1],'bo')
plot(pcad[target=='versicolor',0],pcad[target=='versicolor',1],'ro')
plot(pcad[target=='virginica',0],pcad[target=='virginica',1],'go')
show()

#查看主成分PC
print (pca.explained_variance_ratio_)
#output: [ 0.92461621  0.05301557]
pc1, pc2 = pca.explained_variance_ratio_ #保存2個PC

print (1-sum(pca.explained_variance_ratio_))
#output:0.0223682249752
print (1.0-pc1-pc2) #等價於上述輸出

#逆變換還原數據
data_inv = pca.inverse_transform(pcad)
#比較還原後數據和原始數據的類似度
print (abs(sum(sum(data - data_inv))))
#output:6.66133814775e-15

#循環嘗試：PC數量從1維到4維（原始數據也是4維）
#看PCA覆蓋信息量；4個確定100%，3個也很高了；
for i in range(1,5):
    pca = PCA(n_components=i)
    pca.fit(data)
    print (sum(pca.explained_variance_ratio_) * 100,'%')

#output:92.4616207174 %
#output:97.7631775025 %
#output:99.481691455 %
#output:100.0 %




print ("END")
#END

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。