大數據下的Distinct Count（一）：序

時間 2020-07-09

標籤數據 distinct count 简体版

原文原文鏈接

大數據（big data），IT行業術語，是指沒法在必定時間範圍內用常規軟件工具進行捕捉、管理和處理的數據集合，是須要新處理模式才能具備更強的決策力、洞察發現力和流程優化能力的海量、高增加率和多樣化的信息資產。

Hive

在大數據場景下，報表很重要一項是UV（Unique Visitor）統計，即某時間段內用戶人數。例如，查看一週內app的用戶分佈狀況，Hive中寫HiveQL實現：html

select app, count(distinct uid) as uv
from log_table
where week_cal = '2016-03-27'

Pig

與之相似，Pig的寫法：linux

-- all users
define DISTINCT_COUNT(A, a) returns dist {
    B = foreach $A generate $a;
    unique_B = distinct B;
    C = group unique_B all;
    $dist = foreach C generate SIZE(unique_B);
}
A = load '/path/to/data' using PigStorage() as (app, uid);
B = DISTINCT_COUNT(A, uid);

-- 
A = load '/path/to/data' using PigStorage() as (app, uid);
B = distinct A;
C = group B by app;
D = foreach C generate group as app, COUNT($1) as uv;
-- suitable for small cardinality scenarios
D = foreach C generate group as app, SIZE($1) as uv;

DataFu 爲pig提供基數估計的UDF datafu.pig.stats.HyperLogLogPlusPlus，其採用HyperLogLog++算法，更爲快速地Distinct Count：ios

define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
A = load '/path/to/data' using PigStorage() as (app, uid);
B = group A by app;
C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;

Spark

在Spark中，Load數據後經過RDD一系列的轉換——map、distinct、reduceByKey進行Distinct Count：算法

rdd.map { row => (row.app, row.uid) }
  .distinct()
  .map { line => (line._1, 1) }
  .reduceByKey(_ + _)

// or
rdd.map { row => (row.app, row.uid) }
  .distinct()
  .mapValues{ _ => 1 }
  .reduceByKey(_ + _)

// or 
rdd.map { row => (row.app, row.uid) }
  .distinct()
  .map(_._1)
  .countByValue()

同時，Spark提供近似Distinct Count的API：sql

rdd.map { row => (row.app, row.uid) }
    .countApproxDistinctByKey(0.001)

實現是基於HyperLogLog算法：app

The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

或者，將Schema化的RDD轉成DataFrame後，registerTempTable而後執行sql命令亦可：工具

val sqlContext = new SQLContext(sc)
val df = rdd.toDF()
df.registerTempTable("app_table")

val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")