[Spark] 關於函數 combineByKey

時間 2019-12-07

標籤 spark 關於函數 combinebykey 欄目 Spark 简体版

原文原文鏈接

combineByKey:python

Generic function to combine the elements for each key using a custom set of aggregation functions.函數

概述

.combineByKey 方法是基於鍵進行聚合的函數（大多數基於鍵聚合的函數都是用它實現的），因此這個方法仍是挺重要的。3d

咱們設聚合前Pair RDD的鍵值對格式爲：鍵爲K，鍵值格式爲V；而聚合後，鍵格式不便，鍵值格式爲C。code

combineByKey函數的定義爲：blog

combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=<function portable_hash at 0x7fc35dbc8e60>)

該函數的參數主要爲前三個：element

createCombiner
mergeValue
mergeCombiners

示意圖以下：hash

一個例子

仍是先看一個例子，暫時看不懂能夠先看下面再回來。it

>>> test = sc.parallelize([('panda', (1,2)), ('pink',(7,2)), ('pirate',(3,1))])
>>> xx = test.combineByKey((lambda x : (x,1)),\
...                     (lambda x,y: (x[0] + y, x[1]+ 1)),\
...                     (lambda x,y : (x[0] + y[0], x[1] + y[1])) )
>>> xx.collect()
[('coffee', (3, 2)), ('panda', (3, 1))]

這裏，三個參數分別用了3個lambda表達式代替，分別爲：io