python實現hive udf

時間 2019-11-16

標籤 python 實現 hive udf 欄目 Python 简体版

原文原文鏈接

流程

主要分爲兩個部分，一個部分爲Python腳本實現想要實現的功能，另一個部分爲HQL部分，調用Python腳本對數據進行處理。python

Python部分

HQL調用Python實現的UDF其實有一個重定向的過程，把數據表中之列的列重定向Python的標準輸入中,按行操做，首先將每行按照指定的分割符分開，通常爲’\t’，而後剩下的就是對其進行操做，print須要的列,以’\t’分割。app

example:spa

import sys

ans = {}

for line in sys.stdin:
        line = line.split()
        shopid = line[0]
        if shopid not in ans:
                ans[shopid] = []
                ans[shopid].append(line[1])
        else:
                ans[shopid].append(line[1])

for shop in ans:
        print shop,'\t',ans[shop]

HQL部分

這裏主要就是一個調用的過程：code

--首先須要添加Python文件
add file pythonfile_location;
--而後經過transform(指定的列) ，指定的列是須要處理的列
select transform(指定的列)
using "python filename" 
as (newname) 
--newname指輸出的列的別名

注意: 使用transform的時候不能查詢別的列
好比：orm

select a,trans(b,c)
using "python udf.py"
as(d,e)
from table1
where hp_statdate='2016-05-10'

這樣就是錯的，不能選擇a,若是須要a的話能夠直接放到transform裏，而後將其不做處理，直接輸出便可。it

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。