Spark算子篇 --Spark算子之aggregateByKey詳解

時間 2019-11-06

標籤 spark 算子 aggregatebykey 詳解欄目 Spark 简体版

原文原文鏈接

一。基本介紹函數

rdd.aggregateByKey(3, seqFunc, combFunc) 其中第一個函數是初始值 ui

3表明每次分完組以後的每一個組的初始值。spa

seqFunc表明combine的聚合邏輯rest

每個mapTask的結果的聚合成爲combinecode

combFunc reduce端大聚合的邏輯blog

ps:aggregateByKey默認分組it

二。代碼spark

from pyspark import SparkConf,SparkContext
from __builtin__ import str
conf = SparkConf().setMaster("local").setAppName("AggregateByKey")
sc = SparkContext(conf = conf)

rdd = sc.parallelize([(1,1),(1,2),(2,1),(2,3),(2,4),(1,7)],2)

def f(index,items):
    print "partitionId:%d" %index
    for val in items:
        print val
    return items
    
rdd.mapPartitionsWithIndex(f, False).count()


def seqFunc(a,b):
    print "seqFunc:%s,%s" %(a,b)
    return max(a,b) #取最大值
def combFunc(a,b):
    print "combFunc:%s,%s" %(a ,b)
    return a + b #累加起來
'''
    aggregateByKey這個算子內部確定有分組
'''
aggregateRDD = rdd.aggregateByKey(3, seqFunc, combFunc)
rest = aggregateRDD.collectAsMap()
for k,v in rest.items():
    print k,v

sc.stop()