UDF //user define function
//輸入單行,輸出單行,相似於 format_number(age,'000')java
UDTF //user define table-gen function
//輸入單行,輸出多行,相似於 explode(array);git
UDAF //user define aggr function
//輸入多行,輸出單行,相似於 sum(xxx)github
Hive 經過 UDF 實現對 temptags 的解析json
Code
centos
1. 將 Hive 自定義函數打包併發送到 /soft/hive/lib 下
2. 重啓 Hive
3. 註冊函數數組
# 永久函數 create function myudf as 'com.share.udf.MyUDF'; # 臨時函數 create temporary function myudf as 'com.share.udf.MyUDF';
Hive 經過 UDF 實現對 temptags 的解析併發
0. 準備數據函數
1. 建表oop
create table temptags(id int,json string) row format delimited fields terminated by '\t';
2. 加載數據測試
load data local inpath '/home/centos/files/temptags.txt' into table temptags;
3. 代碼編寫
4. 打包
5. 添加 fastjson-1.2.47.jar & myhive-1.0-SNAPSHOT.jar 到 /soft/hive/lib 中
6. 重啓 Hive
7. 註冊臨時函數
create temporary function parsejson as 'com.share.udf.ParseJson';
8. 測試
select id ,parsejson(json) as tags from temptags;
# 將 id 和 tag 炸開 select id, tag from temptags lateral view explode(parsejson(json)) xx as tag; # 開始統計每一個商家每一個標籤個數 select id, tag, count(*) as count
from (select id, tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id, tag; # 進行商家內標籤數的排序 select id, tag , count, row_number()over(partition by id order by count desc) as rank
from (select id, tag, count(*) as count from (select id, tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b ; # 將標籤和個數進行拼串,取得前 10 標籤數 select id, concat(tag,'_',count)
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank
from (select id, tag, count(*) as count from (select id, tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c
where rank<=10; #聚合拼串 //concat_ws(',', List<>) //collect_set(name) 將全部字段變爲數組,去重 //collect_list(name) 將全部字段變爲數組,不去重 select id, concat_ws(',',collect_set(concat(tag,'_',count))) as tags
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank
from (select id, tag, count(*) as count from (select id, tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c where rank<=10 group by id;
123456 味道好_10,環境衛生_9
id tags
1 [味道好,環境衛生] => 1 味道好
1 環境衛生
select name, workplace from employee lateral view explode(work_place) xx as workplace;
缺乏 jar 包致使的: 類找不到異常的解決方案
問題描述
Caused by: java.lang.ClassNotFoundException: com.share.udf.ParseJson
解決方案
1. 將 fastjson 和 myhive.jar 放在 /soft/hadoop/share/hadoop/common/lib 下
cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/ cp /soft/hive/lib/fastjson-1.2.47.jar /soft/hadoop/share/hadoop/common/lib/
2. 同步到其餘節點
xsync.sh /soft/hadoop/share/hadoop/common/lib/fastjson-1.2.47.jar xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar
3. 重啓 Hadoop 和 Hive
stop-all.sh hive
Hive 實現 Word Count 經過如下兩種方式
array => explode
string => split => explode
如今直接經過 UDTF 實現 WordCount
string => myudtf
將 myhive-1.0-SNAPSHOT.jar 添加到 /soft/hive/lib 中
create function myudtf as 'com.share.udtf.MyUDTF';
select myudtf(line) from wc2;
1. 經過 initialize的參數(方法參數)類型或參數個數
2. 返回輸出表的表結構(字段名+字段類型)
3. 經過 process函數,取出參數值
4. 進行處理後經過 forward函數 將其輸出