Hive UDAF開發詳解

時間 2019-11-10

標籤 hive udaf 開發詳解欄目 Hadoop 简体版

原文原文鏈接

說明

這篇文章是來自Hadoop Hive UDAF Tutorial - Extending Hive with Aggregation Functions：的不嚴格翻譯，由於翻譯的文章示例寫得比較通俗易懂，此外，我把本身對於Hive的UDAF理解穿插到文章裏面。html

udfa是Hive中用戶自定義的彙集函數，hive內置UDAF函數包括有sum()與count（），UDAF實現有簡單與通用兩種方式，簡單UDAF由於使用Java反射致使性能損失，並且有些特性不能使用，已經被棄用了；在這篇博文中咱們將關注Hive中自定義聚類函數-GenericUDAF，UDAF開發主要涉及到如下兩個抽象類：java

[java] view plain copygit

org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver github
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator sql

源碼連接

博文中的全部的代碼和數據能夠在如下連接找到：hive examples
shell

示例數據準備

首先先建立一張包含示例數據的表：people，該表只有name一列，該列中包含了一個或多個名字，該表數據保存在people.txt文件中。apache

[plain] view plain copyapi

~$ cat ./people.txt 跨域
John Smith app
John and Ann White
Ted Green
Dorothy

把該文件上載到hdfs目錄/user/matthew/people中：

[plain] view plain copy

hadoop fs -mkdir people
hadoop fs -put ./people.txt people

下面要建立hive外部表，在hive shell中執行

[sql] view plain copy

CREATE EXTERNAL TABLE people (name string)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t'
ESCAPED BY ''
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/matthew/people';

實例

下面將講述一個彙集函數UDAF的實例，咱們將計算people這張表中的name列字母的個數。

下面的函數代碼是計算指定列中字符的總數（包括空格）

代碼

[java] view plain copy

@Description(name = "letters", value = "_FUNC_(expr) - 返回該列中全部字符串的字符總數")
public class TotalNumOfLettersGenericUDAF extends AbstractGenericUDAFResolver {
@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
throws SemanticException {
if (parameters.length != 1) {
throw new UDFArgumentTypeException(parameters.length - 1,
"Exactly one argument is expected.");
}
ObjectInspector oi = TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(parameters[0]);
if (oi.getCategory() != ObjectInspector.Category.PRIMITIVE){
throw new UDFArgumentTypeException(0,
"Argument must be PRIMITIVE, but "
+ oi.getCategory().name()
+ " was passed.");
}
PrimitiveObjectInspector inputOI = (PrimitiveObjectInspector) oi;
if (inputOI.getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING){
throw new UDFArgumentTypeException(0,
"Argument must be String, but "
+ inputOI.getPrimitiveCategory().name()
+ " was passed.");
}
return new TotalNumOfLettersEvaluator();
}
public static class TotalNumOfLettersEvaluator extends GenericUDAFEvaluator {
PrimitiveObjectInspector inputOI;
ObjectInspector outputOI;
PrimitiveObjectInspector integerOI;
int total = 0;
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
assert (parameters.length == 1);
super.init(m, parameters);
//map階段讀取sql列，輸入爲String基礎數據格式
if (m == Mode.PARTIAL1 || m == Mode.COMPLETE) {
inputOI = (PrimitiveObjectInspector) parameters[0];
} else {
//其他階段，輸入爲Integer基礎數據格式
integerOI = (PrimitiveObjectInspector) parameters[0];
}
// 指定各個階段輸出數據格式都爲Integer類型
outputOI = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class,
ObjectInspectorOptions.JAVA);
return outputOI;
}
/**
* 存儲當前字符總數的類
*/
static class LetterSumAgg implements AggregationBuffer {
int sum = 0;
void add(int num){
sum += num;
}
}
@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
LetterSumAgg result = new LetterSumAgg();
return result;
}
@Override
public void reset(AggregationBuffer agg) throws HiveException {
LetterSumAgg myagg = new LetterSumAgg();
}
private boolean warned = false;
@Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
assert (parameters.length == 1);
if (parameters[0] != null) {
LetterSumAgg myagg = (LetterSumAgg) agg;
Object p1 = ((PrimitiveObjectInspector) inputOI).getPrimitiveJavaObject(parameters[0]);
myagg.add(String.valueOf(p1).length());
}
}
@Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
LetterSumAgg myagg = (LetterSumAgg) agg;
total += myagg.sum;
return total;
}
@Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
LetterSumAgg myagg1 = (LetterSumAgg) agg;
Integer partialSum = (Integer) integerOI.getPrimitiveJavaObject(partial);
LetterSumAgg myagg2 = new LetterSumAgg();
myagg2.add(partialSum);
myagg1.add(myagg2.sum);
}
}
@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
LetterSumAgg myagg = (LetterSumAgg) agg;
total = myagg.sum;
return myagg.sum;
}
}
}

代碼說明

這裏有一些關於combiner的資源，Philippe Adjiman 講得不錯。

AggregationBuffer 容許咱們保存中間結果，經過定義咱們的buffer，咱們能夠處理任何格式的數據，在代碼例子中字符總數保存在AggregationBuffer 。

[java] view plain copy

/**
* 保存當前字符總數的類
*/
static class LetterSumAgg implements AggregationBuffer {
int sum = 0;
void add(int num){
sum += num;
}
}

這意味着UDAF在不一樣的mapreduce階段會接收到不一樣的輸入。Iterate讀取咱們表中的一行（或者準確來講是表），而後輸出其餘數據格式的彙集結果。

artialAggregation合併這些彙集結果到另外相同格式的新的彙集結果，而後最終的reducer取得這些彙集結果真後輸出最終結果（該結果或許與接收數據的格式不一致）。

在init()方法中咱們指定輸入爲string，結果輸出格式爲integer，還有，部分彙集結果輸出格式爲integer（保存在aggregation buffer中）；terminate()與terminatePartial()二者輸出一個integer。

[java] view plain copy

// init方法中根據不一樣的mode指定輸出數據的格式objectinspector
if (m == Mode.PARTIAL1 || m == Mode.COMPLETE) {
inputOI = (PrimitiveObjectInspector) parameters[0];
} else {
integerOI = (PrimitiveObjectInspector) parameters[0];
}
// 不一樣model階段的輸出數據格式
outputOI = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class,
ObjectInspectorOptions.JAVA);

iterate()函數讀取到每行中列的字符串，計算與保存該字符串的長度

[java] view plain copy

public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
...
Object p1 = ((PrimitiveObjectInspector) inputOI).getPrimitiveJavaObject(parameters[0]);
myagg.add(String.valueOf(p1).length());
}
}

Merge函數增長部分彙集總數到AggregationBuffer

[java] view plain copy

public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
LetterSumAgg myagg1 = (LetterSumAgg) agg;
Integer partialSum = (Integer) integerOI.getPrimitiveJavaObject(partial);
LetterSumAgg myagg2 = new LetterSumAgg();
myagg2.add(partialSum);
myagg1.add(myagg2.sum);
}
}

Terminate()函數返回AggregationBuffer中的內容，這裏產生了最終結果。

[java] view plain copy

public Object terminate(AggregationBuffer agg) throws HiveException {
LetterSumAgg myagg = (LetterSumAgg) agg;
total = myagg.sum;
return myagg.sum;
}

使用自定義函數

[plain] view plain copy

ADD JAR ./hive-extension-examples-master/target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
CREATE TEMPORARY FUNCTION letters as 'com.matthewrathbone.example.TotalNumOfLettersGenericUDAF';
SELECT letters(name) FROM people;
OK
44
Time taken: 20.688 seconds