分位函數（四分位數）概念與pandas中的quantile函數

時間 2020-08-20

標籤函數四分 4分位數概念 pandas quantile 简体版

原文原文鏈接

p分位函數（四分位數）概念與pandas中的quantile函數

函數原型markdown

DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation=’linear’)函數

參數post

- q : float or array-like, default 0.5 (50% quantile 即中位數-第2四分位數) 0 <= q <= 1, the quantile(s) to compute - axis : {0, 1, ‘index’, ‘columns’} (default 0) 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise - interpolation（插值方法） : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’} 當選中的分爲點位於兩個數數據點 i and j 之間時: linear: i + (j - i) * fraction, fraction由計算獲得的pos的小數部分（能夠經過下面一個例子來理解這個fraction）； lower: i. higher: j. nearest: i or j whichever is nearest. midpoint: (i + j) / 2.

統計學上的四分爲函數

原則上p是能夠取0到1之間的任意值的。可是有一個四分位數是p分位數中較爲有名的。測試

所謂四分位數；即把數值由小到大排列並分紅四等份，處於三個分割點位置的數值就是四分位數。ui

第1四分位數 (Q1)，又稱「較小四分位數」，等於該樣本中全部數值由小到大排列後第25%的數字。
第2四分位數 (Q2)，又稱「中位數」，等於該樣本中全部數值由小到大排列後第50%的數字。
第3四分位數 (Q3)，又稱「較大四分位數」，等於該樣本中全部數值由小到大排列後第75%的數字。

第3四分位數與第1四分位數的差距又稱四分位距（InterQuartile Range,IQR）lua

計算方法與舉例

爲了更通常化，在計算的過程當中，咱們考慮p分位。當p=0.25 0.5 0.75 時，就是在計算四分位數。spa

首先肯定p分位數的位置（有兩種方法）：rest

方法1 pos = (n+1)*p
方法2 pos = 1+(n-1)*pcode

pandas 中使用的是方法2肯定的。server

給定測試數據：

a b 0 1 1 1 2 10 2 3 100 3 4 100

計算

df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),columns=['a', 'b']) print(df.quantile(.1))

結果是：

a 1.3 b 3.7 Name: 0.1, dtype: float64

默認使用的是linear 插值

計算a列
pos = 1 + (4 - 1)*0.1 = 1.3
fraction = 0.3

ret = 1 + (2 - 1) * 0.3 = 1.3

計算b列
pos = 1.3
ret = 1 + (10 - 1) * 0.3 = 3.7

在b中，假如pos等於2.5呢,即在2-3之間，那i對應就是10，j對應就是100，ret = 10 + (100-10) * 0.3 = 55

「分爲點p位於兩個數數據點 i and j 之間時」，好比 y= [1,10,100,100]，x= [0,1,2,3]，對應於[0,0.333,0.667,1]，當p=0.4時,i、j分別爲十、100，所以，pos = 1 + (4-1)*0.4=2.2，pos取小數部分即0.2，也即fraction=0.2（fraction由計算獲得的pos的小數部分），，，故值爲10+（100-10）* 0.2=28 。驗證： df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),columns=['a', 'b']) print df.quantile([0.1,0.2,0.4,0.5, 0.75])

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。