大數據(big data),IT行業術語,是指沒法在必定時間範圍內用常規軟件工具進行捕捉、管理和處理的數據集合,是須要新處理模式才能具備更強的決策力、洞察發現力和流程優化能力的海量、高增加率和多樣化的信息資產。 |
在大數據場景下,報表很重要一項是UV(Unique Visitor)統計,即某時間段內用戶人數。例如,查看一週內app的用戶分佈狀況,Hive中寫HiveQL實現:html
select app, count(distinct uid) as uv from log_table where week_cal = '2016-03-27'
與之相似,Pig的寫法:linux
-- all users define DISTINCT_COUNT(A, a) returns dist { B = foreach $A generate $a; unique_B = distinct B; C = group unique_B all; $dist = foreach C generate SIZE(unique_B); } A = load '/path/to/data' using PigStorage() as (app, uid); B = DISTINCT_COUNT(A, uid); -- A = load '/path/to/data' using PigStorage() as (app, uid); B = distinct A; C = group B by app; D = foreach C generate group as app, COUNT($1) as uv; -- suitable for small cardinality scenarios D = foreach C generate group as app, SIZE($1) as uv;
DataFu 爲pig提供基數估計的UDF datafu.pig.stats.HyperLogLogPlusPlus,其採用HyperLogLog++算法,更爲快速地Distinct Count:ios
define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus(); A = load '/path/to/data' using PigStorage() as (app, uid); B = group A by app; C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;
在Spark中,Load數據後經過RDD一系列的轉換——map、distinct、reduceByKey進行Distinct Count:算法
rdd.map { row => (row.app, row.uid) } .distinct() .map { line => (line._1, 1) } .reduceByKey(_ + _) // or rdd.map { row => (row.app, row.uid) } .distinct() .mapValues{ _ => 1 } .reduceByKey(_ + _) // or rdd.map { row => (row.app, row.uid) } .distinct() .map(_._1) .countByValue()
同時,Spark提供近似Distinct Count的API:sql
rdd.map { row => (row.app, row.uid) } .countApproxDistinctByKey(0.001)
實現是基於HyperLogLog算法:app
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
或者,將Schema化的RDD轉成DataFrame後,registerTempTable而後執行sql命令亦可:工具
val sqlContext = new SQLContext(sc) val df = rdd.toDF() df.registerTempTable("app_table") val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")