學機器學習，不會數據分析怎麼行——數據可視化分析(matplotlib)

時間 2019-11-10

標籤機器學習不會數據分析怎麼可視化 matplotlib 简体版

原文原文鏈接

前言

前面兩篇文章介紹了 python 中兩大模塊 pandas 和 numpy 的一些基本使用方法，然而，僅僅會處理數據仍是不夠的，咱們須要學會怎麼分析，毫無疑問，利用圖表對數據進行分析是最容易的，經過圖表能夠很好地理解數據之間的關聯性以及某些數據的變化趨勢。所以，將在這篇博客中介紹 python 中可視化工具 matplotlib 的使用。python

Figure 和 Subplot

matplotlib 的圖像都位於 Figure 對象中，能夠用 plt.figure 建立一個新的 Figure數組

fig = plt.figure()

plt.figure 有一些選項，如 figsize(用於設置圖片大小)。不能經過空 Figure 繪圖，必須用 add_subplot 建立一個或多個 subplot 才行dom

ax1 = fig.add_subplot(2,2,1) # 2*2 的圖像，選中的是4個subplot中的第一個
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)

最終獲得的圖以下所示函數

畫三張圖工具

ax3.plot(np.random.randn(50).cumsum(),'k--') # k--表示線型
_ = ax1.hist(np.random.randn(100),bins=20,color='k',alpha=0.3)
ax2.scatter(np.arange(30),np.arange(30)+3*np.random.randn(30))

上圖中只是三種圖表類型，你能夠在matplotlib的文檔中找到各類圖表類型。因爲根據特定佈局建立Figure和subplot是一件很是常見的任務，因而便出現了一個更爲方便的方法 plt.subplots ，它能夠建立一個新的Figure，並返回一個含有已建立的subplot對象的NumPy數組佈局

fig, axes = plt.subplots(2,3)
axes

這種方法很是的實用，咱們能夠輕鬆對axes數據進行索引。咱們還能夠經過 sharex 和 sharey 指定 subplot 應該具備的X軸和Y軸。在比較相同範圍的數據時，這也是很是實用的，不然，matplotlib 會自動縮放各圖表的界限。字體

圖表各要素

下面在介紹介紹如何添加圖表的各個要素url

標題、軸標籤、刻度以及刻度標籤

要修改X軸的刻度，最簡單的方法是使用set_xticks和set_xticklabels。前者告訴matplotlib要將刻度放在數據範圍中的哪些位置，默認狀況下，這些位置也就是刻度標籤。但咱們能夠經過set_xticklabels將任何其餘的值做用於標籤spa

# 繪製一段隨機漫步
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(np.random.randn(1000).cumsum())

ticks = ax.set_xticks([0,250,500,750,1000])
labels = ax.set_xticklabels(['one','two','three','four','five'])
ax.set_title('My first matplotlib plot')
ax.set_xlabel('stages')

圖例

ax.plot(np.random.randn(1000).cumsum(),label = 'one')
ax.plot(np.random.randn(1000).cumsum(),'k--',label='two')
ax.plot(np.random.randn(1000).cumsum(),'k.',label='three')
ax.legend(loc='best') # loc 告訴matplotlib要將圖例放在哪

註釋

除標準的圖標對象以外，你可能還但願繪製一些自定以的註解（好比文本、箭頭或其餘圖形等）
註解能夠經過 text, arrow和annotate等函數進行添加。text能夠將文本繪製在圖表的指定座標 (x,y) ，還能夠加上一些自定以格式.net

ax.text(x,y,'Hello World!', family='monospace', fontsize=10, verticalalignment="top", horizontalalignment="right")

x1 = np.random.normal(30, 3, 100)
x2 = np.random.normal(20, 2, 100)

plt.plot(x1, label = 'plot')
plt.plot(x2, label = '2nd plot')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=2,
           ncol=2, mode='expand', borderaxespad=0.)
plt.annotate('Important value', (55,20),
             xycoords='data',
             xytext=(5,38),
             arrowprops = dict(arrowstyle = '->'))

plt.show()

annotate(s='str' ,xy=(x,y) ,xytext=(l1,l2) ,..)

s 爲註釋文本內容
xy 爲被註釋的座標點
xytext 爲註釋文字的座標位置
xycoords 參數以下:

figure points points from the lower left of the figure 點在圖左下方

figure pixels pixels from the lower left of the figure 圖左下角的像素

figure fraction fraction of figure from lower left 左下角數字部分axes

points points from lower left corner of axes 從左下角點的座標axes

pixels pixels from lower left corner of axes 從左下角的像素座標

axes fraction fraction of axes from lower left 左下角部分

data use the coordinate system of the object being annotated(default) 使用的座標系統被註釋的對象（默認）

polar(theta,r) if not native ‘data’ coordinates t

extcoords 設置註釋文字偏移量

參數	座標系
'figure points'	距離圖形左下角的點數量
'figure pixels'	距離圖形左下角的像素數量
'figure fraction'	0,0 是圖形左下角，1,1 是右上角
'axes points'	距離軸域左下角的點數量
'axes pixels'	距離軸域左下角的像素數量
'axes fraction'	0,0 是軸域左下角，1,1 是右上角
'data'	使用軸域數據座標系

arrowprops #箭頭參數,參數類型爲字典dict

width the width of the arrow in points 點箭頭的寬度
headwidth the width of the base of the arrow head in points 在點的箭頭底座的寬度
headlength the length of the arrow head in points 點箭頭的長度
shrink fraction of total length to ‘shrink’ from both ends 總長度爲分數「縮水」從兩端
facecolor 箭頭顏色

bbox給標題增長外框，經常使用參數以下：

boxstyle方框外形
facecolor(簡寫fc)背景顏色
edgecolor(簡寫ec)邊框線條顏色
edgewidth邊框線條大小

bbox=dict(boxstyle='round,pad=0.5', fc='yellow', ec='k',lw=1 ,alpha=0.5)  #fc爲facecolor,ec爲edgecolor,lw爲lineweight

pandas中的繪圖函數

咱們平時基本都是使用pandas處理數據，因此，可以利用pandas中內置的plot來畫圖會方便許多，下面將介紹幾種經常使用圖表的畫法

線型圖

Series 和 DataFrame 都有一個用於生成各種圖表的plot方法。默認狀況下爲線型圖

s = pd.Series(np.random.randn(10).cumsum(),index=np.arange(0,100,10))
s.plot()

該Series對象的索引會被傳給matplotlib，並用以繪製X軸。能夠經過use_index=False禁用該功能。X軸的刻度和界限能夠經過xticks和xlim選項進行調節，Y軸就用yticks和ylim。

DataFrame的plot方法會在一個subplot中爲各列繪製一條線，並自動建立圖例

df = pd.DataFrame(np.random.randn(10,4).cumsum(0),columns=['A','B','C','D'],index=np.arange(0,100,10))
df.plot()

Series.plot參數以下：

label 用於圖例標籤
ax 要在其上進行繪製的matplotlib subplot對象。
style 將要傳給matplotlib的風格字符串（如'ko--'）
alpha 圖表的填充不透明度(0-1)
kind 能夠是'line' 'bar' 'barch' 'kde'
logy 在Y軸上是同對數標尺
use_index 將對象的索引用做刻度標籤
rot 旋轉刻度標籤(0-360)
xticks 用做X軸刻度的值
yticks 用做Y軸刻度的值
xlim X軸的界限
ylim Y軸的界限
grid 顯示軸網格線（默認打開）

專用於DataFrame的plot參數：

subplots 將各個DataFrame列繪製到單獨的subplot中
sharex 若是subplots=True，則共用同一個X軸，包括刻度和界限
sharey 若是subplots=True，則共用同一個Y軸
figsize 表示圖像大小的元組
title 表示圖像標題的字符串
legend 添加一個subplot圖例
sort_columns 以字母表示順序繪製各列，默認使用當前列順序 注：有關時間序列的處理這篇博客中暫時不寫，將在後續博客中補充

柱狀圖

在生成線型圖的代碼中加上kind='bar'或kind='barh'便可生成柱狀圖。這時，Series和DataFrame的索引將會被用做X或Y刻度

fig, axes = plt.subplots(2,1)
data = pd.Series(np.random.rand(16),index=list('abcdefghijklmnop'))
data.plot(kind='bar',ax=axes[0],color='k',alpha=0.8,figsize=(8,10))
data.plot(kind='barh',ax=axes[1],color='k',alpha=0.8,figsize=(8,10))

df = pd.DataFrame(np.random.rand(6,4),index=['one','two','three','four','five','six'],columns=pd.Index(['A','B','C','D'],name='Genus'))
df

df.plot(kind='bar')

df.plot(kind='barh',stacked=True) # 設置stacked爲True生成堆疊圖

注：柱狀圖能夠利用value_counts圖形化顯示Series中各值出現的頻率

df.div(df.sum(1).astype(float),axis=0).plot(kind='barh',stacked=True)

直方圖和密度圖

直方圖是一種能夠對值頻率進行離散化顯示的柱狀圖，。數據點被拆分到離散的、間隔均勻的面元中，繪製的時各面元中數據點的數量。

length = pd.DataFrame({'length': [10, 20,15,10,1,12,12,12,13,13,13,14,14,14,41,41,41,41,41,4,4,4,4]})
length.plot.hist()

與直方圖相關的一種圖表類型時密度圖，它是經過計算「可能會產生觀測數據的連續機率分佈的估計」而產生的。通常的過程是將該分佈近似爲一組核（即諸如正態（高斯）分佈之類的較爲簡單的分佈）。所以，密度圖也被稱做KDE圖，調用plot時加上kind='kde'便可生成一張密度圖（標準混合正態分佈）。

length.plot(kind='kde')

df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000), 'c': np.random.randn(1000) - 1}, index=range(1,1001), columns=['a', 'b', 'c'])  #bins=20表示數值分辨率，具體來講是將隨機數設定一個範圍，例如5.6，5.7，6.5，若是數值分辨率越低，則會將三個數分到5-7之間，若是數值分辨率越高，則會將5.6，5.7分到5-6之間，而6.5分到6-7之間；值越小表示分辨率越低，值越大表示分辨率越高；
df4.plot.hist(stacked=True, bins=20, alpha=0.5)

df4.diff().hist(color='k', alpha=0.5, bins=50) #可將DataFrame當中column分開

這兩種圖表經常會被畫在一塊兒。直方圖以規格化形式給出（以便給出面元化密度），而後再在其上繪製核密度估計。

comp1 = np.random.normal(0,1,size=200)
comp2 = np.random.normal(10,2,size=200)
values = pd.Series(np.concatenate([comp1,comp2]))
values.hist(bins=100,alpha=0.3,color='k',normed=True)
values.plot(kind='kde')

散點圖

散點圖（scatter plot）是觀察兩個一維數據序列之間的關係的有效手段。matplotlib的scatter方法是繪製散點圖的主要方法。

df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a', y='b') # 以a列爲X軸數值，b列爲Y軸數值繪製散點圖

若是想將不一樣的散點圖信息繪製到一張圖片當中，須要利用不一樣的顏色和標籤進行區分

ax = df.plot.scatter(x='a', y='b', color='Blue', label='Group 1')
df.plot.scatter(x='c', y='d', color='Green', label='Group 2', ax=ax)

在探索式數據分析（EDA）工做中，同時觀察一組變量的散點圖是頗有意義的，這也被稱爲散點圖矩陣。

from sklearn.datasets import load_iris # 使用sklearn庫裏的iris數據集
iris_dataset = load_iris()
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(iris_dataset['data'],iris_dataset['target'],random_state=0)
iris_dataframe=pd.DataFrame(X_train,columns=iris_dataset.feature_names)
grr = pd.plotting.scatter_matrix(iris_dataframe,marker='o',c = y_train,hist_kwds={'bins':20},figsize=(12,10))

餅圖

餅圖展示的是百分比關係

series = pd.Series(4 * np.random.rand(4), index=['a', 'b', 'c', 'd'], name='series')
series.plot.pie(figsize=(6, 6))

對於DataFrame對象，每個column均可獨立繪製一張餅圖，但須要利用subplots=True參數將，每一個餅圖繪製到同一張圖中。 df = pd.DataFrame(3 * np.random.rand(4, 2), index=['a', 'b', 'c', 'd'], columns=['x', 'y']) df.plot.pie(subplots=True, figsize=(8, 4))

箱型圖（略）

因爲箱型圖接觸很少，涉及內容較多且較爲重要，後面會另寫一篇關於箱型圖的博客

各類繪圖方式對於缺失值的處理

Missing values are dropped, left out, or filled depending on the plot type

圖表	處理方式
Plot Type	NaN Handling
Line	Leave gaps at NaNs
Line (stacked)	Fill 0’s
Bar	Fill 0’s
Scatter	Drop NaNs
Histogram	Drop NaNs (column-wise)
Box	Drop NaNs (column-wise)
Area	Fill 0’s
KDE	Drop NaNs (column-wise)
Hexbin	Drop NaNs
Pie	Fill 0’s