Matplotlib學習---用matplotlib畫箱線圖（boxplot）

時間 2019-11-17

標籤 matplotlib 學習線圖 boxplot 简体版

原文原文鏈接

箱線圖經過數據的四分位數來展現數據的分佈狀況。例如：數據的中心位置，數據間的離散程度，是否有異常值等。html

把數據從小到大進行排列並等分紅四份，第一分位數（Q1），第二分位數（Q2）和第三分位數（Q3）分別爲數據的第25%，50%和75%的數字。git

I-------------I o I-------------I o I-------------I o I-------------Igithub

Q1 Q2 Q3post

(lower quartile) (median) (upper quartile)學習

四分位間距（Interquartile range（IQR））=上分位數（upper quartile） - 下分位數（lower quartile）url

箱線圖分爲兩部分，分別是箱（box）和須（whisker）。箱（box）用來表示從第一分位到第三分位的數據，須（whisker）用來表示數據的範圍。spa

箱線圖從上到下各橫線分別表示：數據上限（一般是Q3+1.5*IQR），第三分位數（Q3），第二分位數（中位數），第一分位數（Q1），數據下限（一般是Q1-1.5*IQR）。有時還有一些圓點，位於數據上下限以外，表示異常值（outliers）。code

（注：若是數據上下限特別大，那麼whisker將顯示數據的最大值和最小值。）htm

下圖展現了箱線圖各部分的含義。（摘自：https://datavizcatalogue.com/methods/box_plot.html）blog

下面利用Jake Vanderplas所著的《Python數據科學手冊》一書中的數據，學習畫圖。

數據地址：https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv

這個數據文件在Matplotlib學習---用matplotlib畫折線圖（line chart）裏已經用過，這裏直接使用清洗事後的數據：

import pandas as pd
from matplotlib import pyplot as plt
birth=pd.read_csv(r"https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv")
fig,ax=plt.subplots()

birth=birth.iloc[:15067]
birth["day"]=birth["day"].astype(int)

birth["date"]=pd.to_datetime({"year":birth["year"],"month":birth["month"],"day":birth["day"]},errors='coerce')
birth=birth[birth["date"].notnull()]

這是清洗事後的數據的前5行：

       year  month  day gender  births       date
0      1969      1    1      F    4046 1969-01-01
1      1969      1    1      M    4440 1969-01-01
2      1969      1    2      F    4454 1969-01-02
3      1969      1    2      M    4548 1969-01-02
4      1969      1    3      F    4548 1969-01-03

數據展現的是美國1969年-1988年天天出生的男女人數。

讓咱們畫一個箱線圖，比較一下1986年，1987年和1988年男女天天出生人數的分佈狀況。

箱線圖： ax.boxplot(x)

完整代碼以下：

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
birth=pd.read_csv(r"https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv")
fig,ax=plt.subplots()

birth=birth.iloc[:15067]
birth["day"]=birth["day"].astype(int)

birth["date"]=pd.to_datetime({"year":birth["year"],"month":birth["month"],"day":birth["day"]},errors='coerce')
birth=birth[birth["date"].notnull()]

#提取1986年-1988年男女出生人數數據，並轉換成numpy的array格式
birth1986_female=np.array(birth.births[(birth["year"]==1986) & (birth["gender"]=="F")])
birth1986_male=np.array(birth.births[(birth["year"]==1986) & (birth["gender"]=="M")])
birth1987_female=np.array(birth.births[(birth["year"]==1987) & (birth["gender"]=="F")])
birth1987_male=np.array(birth.births[(birth["year"]==1987) & (birth["gender"]=="M")])
birth1988_female=np.array(birth.births[(birth["year"]==1988) & (birth["gender"]=="F")])
birth1988_male=np.array(birth.births[(birth["year"]==1988) & (birth["gender"]=="M")])

#因爲須要繪製多個箱線圖，所以把這些數據放入一個列表
data=[birth1986_female,birth1986_male,birth1987_female,birth1987_male,birth1988_female,birth1988_male]
ax.boxplot(data,positions=[0,0.6,1.5,2.1,3,3.6]) #用positions參數設置各箱線圖的位置
ax.set_xticklabels(["1986\nfemale","1986\nmale","1987\nfemale","1987\nmale","1988\nfemale","1988\nmale"]) #設置x軸刻度標籤

plt.show()