hive中標準誤差函數stddev()詳細講解

1.標準誤差概念函數

標準誤差(Std Dev,Standard Deviation) -統計學名詞。一種度量數據分佈的分散程度之標準,用以衡量數據值偏離算術平均值的程度。標準誤差越小,這些值偏離平均值就越少,反之亦然。標準誤差的大小可經過標準誤差與平均值的倍率關係來衡量。spa

例如,A、B兩組各有6位學生參加同一次語文測驗,A組的分數爲9五、8五、7五、6五、5五、45,B組的分數爲7三、7二、7一、6九、6八、67。這兩組的平均數都是70,但A組的標準差應該是17.078分,B組的標準差應該是2.160分,說明A組學生之間的差距要比B組學生之間的差距大得多。3d

標準誤差又分爲整體標準誤差與樣本標準誤差code

整體標準誤差:針對整體數據的誤差,因此要平均,
 
樣本標準誤差,也稱實驗標準誤差:針對從整體抽樣,利用樣原本計算整體誤差,爲了使算出的值與整體水平更接近,就必須將算出的標準誤差的值適度放大,即,
 
 
2.標準誤差計算公式:
 
樣本標準誤差
   
   
表明所採用的樣本X1,X2,...,Xn的均值。
整體標準誤差
   
   
表明整體X的均值。
例:有一組數字分別是200、50、100、200,求它們的樣本標準誤差。
 
= (200+50+100+200)/4 = 550/4 = 137.5
 
= [(200-137.5)^2+(50-137.5)^2+(100-137.5)^2+(200-137.5)^2]/(4-1)
樣本標準誤差 S = Sqrt(S^2)=75, 注:八年級(下冊)上海科學技術出版 21.2數據的離散程度中的標準差是整體標準差
 
3.hive中的標準誤差函數  stddev_pop(),stddev_samp(),stddev()
stddev_pop()  整體標準方差,stddev_samp() 樣本標準方差
 
(1) hive引擎計算標準誤差
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;

查詢結果:orm

 

 (2)spark引擎查詢標準誤差
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col

查詢結果blog

由上可看出,hive中stddev()函數默認計算整體標準誤差,spark 中stddev()函數默認計算樣本標準誤差it

 
4.stddev()也可用於窗口函數
select col, stddev(num) over(partition by col) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a

查詢結果:spark

 

5. 當計算的輸入數據只有一行時 ,hive和spark計算標準方差的結果
(1)hive
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;

  查詢結果:io

(2)sparkform

select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;

 查詢結果:

相關文章
相關標籤/搜索