[Hive_12] Hive 的自定義函數

時間 2020-06-12

標籤 hive 自定義函數欄目 Hadoop 简体版

原文原文鏈接

0. 說明

　　UDF 　　//user define function
　　　　　　//輸入單行，輸出單行，相似於 format_number(age,'000')java

　　UDTF 　　//user define table-gen function
　　　　　　 //輸入單行，輸出多行，相似於 explode(array);git

　　UDAF 　　//user define aggr function
　　　　　　 //輸入多行，輸出單行，相似於 sum(xxx)github

　　Hive 經過 UDF 實現對 temptags 的解析json

1. UDF

　　1.1 代碼示例

　　Code
centos

　　1.2 用戶自定義函數的使用

　　1. 將 Hive 自定義函數打包併發送到 /soft/hive/lib 下
　　2. 重啓 Hive
　　3. 註冊函數數組

# 永久函數 　　create function myudf as 'com.share.udf.MyUDF'; # 臨時函數 　　create temporary function myudf as 'com.share.udf.MyUDF';

　　1.3 Demo

　　Hive 經過 UDF 實現對 temptags 的解析併發

　　0. 準備數據函數

　　1. 建表oop

create table temptags(id int,json string) row format delimited fields terminated by '\t';

　　2. 加載數據測試

load data local inpath '/home/centos/files/temptags.txt' into table temptags;

　　3. 代碼編寫

　　Code

　　4. 打包

　　5. 添加 fastjson-1.2.47.jar & myhive-1.0-SNAPSHOT.jar 到 /soft/hive/lib 中

　　6. 重啓 Hive

　　7. 註冊臨時函數

create temporary function parsejson as 'com.share.udf.ParseJson';

　　8. 測試

select id ,parsejson(json) as tags from temptags;

# 將 id 和 tag 炸開 select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag; # 開始統計每一個商家每一個標籤個數 select id, tag, count(*) as count
from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id, tag; # 進行商家內標籤數的排序 select id, tag , count, row_number()over(partition by id order by count desc) as rank
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b ; # 將標籤和個數進行拼串，取得前 10 標籤數 select id, concat(tag,'_',count)
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank 
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c
where rank<=10; #聚合拼串 //concat_ws(',', List<>) //collect_set(name) 將全部字段變爲數組,去重 //collect_list(name) 將全部字段變爲數組,不去重 select id, concat_ws(',',collect_set(concat(tag,'_',count))) as tags
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c  where rank<=10 group by id;

　　1.4 虛列：lateral view

　　123456 味道好_10,環境衛生_9

　　id　　 tags
　　1 　　[味道好，環境衛生]　　 =>　　 1 味道好
　　　　　　　　　　　　　　　　　　1 環境衛生

select name, workplace from employee lateral view explode(work_place) xx as workplace;

　　1.5 類找不到異常

　　缺乏 jar 包致使的: 類找不到異常的解決方案

　　問題描述

　　Caused by: java.lang.ClassNotFoundException: com.share.udf.ParseJson

　　解決方案

　　1. 將 fastjson 和 myhive.jar 放在 /soft/hadoop/share/hadoop/common/lib 下

　　cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/ 　　cp /soft/hive/lib/fastjson-1.2.47.jar /soft/hadoop/share/hadoop/common/lib/

　　2. 同步到其餘節點

　　xsync.sh /soft/hadoop/share/hadoop/common/lib/fastjson-1.2.47.jar 　　xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

　　3. 重啓 Hadoop 和 Hive

　　stop-all.sh 　　hive

2. UDTF

　　2.0 說明

　　Hive 實現 Word Count 經過如下兩種方式

　　array => explode

　　string => split => explode

　　如今直接經過 UDTF 實現 WordCount

　　string => myudtf

　　2.1 代碼編寫

　　Code

　　2.2 打包

　　將 myhive-1.0-SNAPSHOT.jar 添加到 /soft/hive/lib 中

　　2.3 重啓 Hive

　　2.4 註冊臨時函數

　　create function myudtf as 'com.share.udtf.MyUDTF';

　　2.5 測試

select myudtf(line) from wc2;

　　2.6 流程分析

　　1. 經過 initialize的參數(方法參數)類型或參數個數

　　2. 返回輸出表的表結構(字段名+字段類型)

　　3. 經過 process函數，取出參數值

　　4. 進行處理後經過 forward函數將其輸出

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。