1.標準誤差概念函數
標準誤差(Std Dev,Standard Deviation) -統計學名詞。一種度量數據分佈的分散程度之標準,用以衡量數據值偏離算術平均值的程度。標準誤差越小,這些值偏離平均值就越少,反之亦然。標準誤差的大小可經過標準誤差與平均值的倍率關係來衡量。spa
例如,A、B兩組各有6位學生參加同一次語文測驗,A組的分數爲9五、8五、7五、6五、5五、45,B組的分數爲7三、7二、7一、6九、6八、67。這兩組的平均數都是70,但A組的標準差應該是17.078分,B組的標準差應該是2.160分,說明A組學生之間的差距要比B組學生之間的差距大得多。3d
標準誤差又分爲整體標準誤差與樣本標準誤差code
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;
查詢結果:orm
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col
查詢結果blog
由上可看出,hive中stddev()函數默認計算整體標準誤差,spark 中stddev()函數默認計算樣本標準誤差it
select col, stddev(num) over(partition by col) as stddev_col from ( select 'A' as col, '1' as num union all select 'A' as col, '2' as num union all select 'A' as col, '3' as num union all select 'B' as col, '1' as num union all select 'B' as col, '2' as num ) as a
查詢結果:spark
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;
查詢結果:io
(2)sparkform
select col, stddev_pop(num),stddev_samp(num),stddev(num) as stddev_col from ( select 'A' as col, '1' as num union all select 'B' as col, '2' as num ) as a group by col ;
查詢結果: