1五、Hive函數詳解與案列實戰

時間 2021-01-30

標籤 php java node sql 數據庫 express apache 編程 json vim 欄目 Hadoop 简体版

原文原文鏈接

一、Hive系統內置函數

1.一、數值計算函數

一、取整函數: round

語法: round(double a)
返回值: BIGINT
說明: 返回double類型的整數值部分（遵循四捨五入）php

hive> select round(3.1415926) from tableName;
3
hive> select round(3.5) from tableName;
4
hive> create table tableName as select round(9542.158) from tableName;

二、指定精度取整函數: round

語法: round(double a, int d)
返回值: DOUBLE
說明: 返回指定精度d的double類型java

hive> select round(3.1415926,4) from tableName;
3.1416

三、向下取整函數: floor

語法: floor(double a)
返回值: BIGINT
說明: 返回等於或者小於該double變量的最大的整數node

hive> select floor(3.1415926) from tableName;
3
hive> select floor(25) from tableName;
25

四、向上取整函數: ceil

語法: ceil(double a)
返回值: BIGINT
說明: 返回等於或者大於該double變量的最小的整數sql

hive> select ceil(3.1415926) from tableName;
4
hive> select ceil(46) from tableName;
46

五、向上取整函數: ceiling

語法: ceiling(double a)
返回值: BIGINT
說明: 與ceil功能相同數據庫

hive> select ceiling(3.1415926) from tableName;
4
hive> select ceiling(46) from tableName;
46

六、取隨機數函數: rand

語法: rand(),rand(int seed)
返回值: double
說明: 返回一個0到1範圍內的隨機數。若是指定種子seed，則會等到一個穩定的隨機數序列express

hive> select rand() from tableName;
0.5577432776034763
hive> select rand() from tableName;
0.6638336467363424
hive> select rand(100) from tableName;
0.7220096548596434
hive> select rand(100) from tableName;
0.7220096548596434

1.二、日期函數

一、UNIX時間戳轉日期函數: from_unixtime

語法: from_unixtime(bigint unixtime[, string format])
返回值: string
說明: 轉化UNIX時間戳（從1970-01-01 00:00:00 UTC到指定時間的秒數）到當前時區的時間格式apache

hive> select from_unixtime(1323308943,'yyyyMMdd') from tableName;
20111208

二、獲取當前UNIX時間戳函數: unix_timestamp

語法: unix_timestamp()
返回值: bigint
說明: 得到當前時區的UNIX時間戳編程

hive> select unix_timestamp() from tableName;
1323309615

三、日期轉UNIX時間戳函數: unix_timestamp

語法: unix_timestamp(string date)
返回值: bigint
說明: 轉換格式爲"yyyy-MM-dd HH:mm:ss"的日期到UNIX時間戳。若是轉化失敗，則返回0。json

hive> select unix_timestamp('2011-12-07 13:01:03') from tableName;
1323234063

四、指定格式日期轉UNIX時間戳函數: unix_timestamp

語法: unix_timestamp(string date, string pattern)
返回值: bigint
說明: 轉換pattern格式的日期到UNIX時間戳。若是轉化失敗，則返回0。vim

hive> select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from tableName;
1323234063

五、日期時間轉日期函數: to_date

語法: to_date(string timestamp)
返回值: string
說明: 返回日期時間字段中的日期部分。

hive> select to_date('2011-12-08 10:03:01') from tableName;
2011-12-08

六、日期轉年函數: year

語法: year(string date)
返回值: int
說明: 返回日期中的年。

hive> select year('2011-12-08 10:03:01') from tableName;
2011
hive> select year('2012-12-08') from tableName;
2012

七、日期轉月函數: month

語法: month (string date)
返回值: int
說明: 返回日期中的月份。

hive> select month('2011-12-08 10:03:01') from tableName;
12
hive> select month('2011-08-08') from tableName;
8

八、日期轉天函數: day

語法: day (string date)
返回值: int
說明: 返回日期中的天。

hive> select day('2011-12-08 10:03:01') from tableName;
8
hive> select day('2011-12-24') from tableName;
24

九、日期轉小時函數: hour

語法: hour (string date)
返回值: int
說明: 返回日期中的小時。

hive> select hour('2011-12-08 10:03:01') from tableName;
10

十、日期轉分鐘函數: minute

語法: minute (string date)
返回值: int
說明: 返回日期中的分鐘。

hive> select minute('2011-12-08 10:03:01') from tableName;
3

hive> select second('2011-12-08 10:03:01') from tableName;
1

十二、日期轉周函數: weekofyear

語法: weekofyear (string date)
返回值: int
說明: 返回日期在當前的週數。

hive> select weekofyear('2011-12-08 10:03:01') from tableName;
49

1三、日期比較函數: datediff

語法: datediff(string enddate, string startdate)
返回值: int
說明: 返回結束日期減去開始日期的天數。

hive> select datediff('2012-12-08','2012-05-09') from tableName;
213

1四、日期增長函數: date_add

語法: date_add(string startdate, int days)
返回值: string
說明: 返回開始日期startdate增長days天后的日期。

hive> select date_add('2012-12-08',10) from tableName;
2012-12-18

1五、日期減小函數: date_sub

語法: date_sub (string startdate, int days)
返回值: string
說明: 返回開始日期startdate減小days天后的日期。

hive> select date_sub('2012-12-08',10) from tableName;
2012-11-28

1.三、條件函數

一、If函數: if

語法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
返回值: T
說明: 當條件testCondition爲TRUE時，返回valueTrue；不然返回valueFalseOrNull

hive> select if(1=2,100,200) from tableName;
200
hive> select if(1=1,100,200) from tableName;
100

二、非空查找函數: COALESCE

語法: COALESCE(T v1, T v2, …)
返回值: T
說明: 返回參數中的第一個非空值；若是全部值都爲NULL，那麼返回NULL

hive> select COALESCE(null,'100','50') from tableName;
100

三、條件判斷函數：CASE

語法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
返回值: T
說明：若是a等於b，那麼返回c；若是a等於d，那麼返回e；不然返回f

hive> Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
mary
hive> Select case 200 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
tim

四、條件判斷函數：CASE

語法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
返回值: T
說明：若是a爲TRUE,則返回b；若是c爲TRUE，則返回d；不然返回e

hive> select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
mary
hive> select case when 1=1 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
tom

1.四、字符串函數

一、字符串長度函數：length

語法: length(string A)
返回值: int
說明：返回字符串A的長度

hive> select length('abcedfg') from tableName;

二、字符串反轉函數：reverse

語法: reverse(string A)
返回值: string
說明：返回字符串A的反轉結果

hive> select reverse('abcedfg') from tableName;
gfdecba

三、字符串鏈接函數：concat

語法: concat(string A, string B…)
返回值: string
說明：返回輸入字符串鏈接後的結果，支持任意個輸入字符串

hive> select concat('abc','def','gh') from tableName;
abcdefgh

四、字符串鏈接並指定字符串分隔符：concat_ws

語法: concat_ws(string SEP, string A, string B…)
返回值: string
說明：返回輸入字符串鏈接後的結果，SEP表示各個字符串間的分隔符

hive> select concat_ws(',','abc','def','gh')from tableName;
abc,def,gh

五、字符串截取函數：substr

語法: substr(string A, int start),substring(string A, int start)
返回值: string
說明：返回字符串A從start位置到結尾的字符串

hive> select substr('abcde',3) from tableName;
cde
hive> select substring('abcde',3) from tableName;
cde
hive>  select substr('abcde',-1) from tableName;  （和ORACLE相同）
e

六、字符串截取函數：substr,substring

語法: substr(string A, int start, int len),substring(string A, int start, int len)
返回值: string
說明：返回字符串A從start位置開始，長度爲len的字符串

hive> select substr('abcde',3,2) from tableName;
cd
hive> select substring('abcde',3,2) from tableName;
cd
hive>select substring('abcde',-2,2) from tableName;
de

七、字符串轉大寫函數：upper,ucase

語法: upper(string A) ucase(string A)
返回值: string
說明：返回字符串A的大寫格式

hive> select upper('abSEd') from tableName;
ABSED
hive> select ucase('abSEd') from tableName;
ABSED

八、字符串轉小寫函數：lower,lcase

語法: lower(string A) lcase(string A)
返回值: string
說明：返回字符串A的小寫格式

hive> select lower('abSEd') from tableName;
absed
hive> select lcase('abSEd') from tableName;
absed

九、去空格函數：trim

語法: trim(string A)
返回值: string
說明：去除字符串兩邊的空格

hive> select trim(' abc ') from tableName;
abc

十、url解析函數 parse_url

語法:
parse_url(string urlString, string partToExtract [, string keyToExtract])
返回值: string
說明：返回URL中指定的部分。partToExtract的有效值爲：HOST, PATH,
QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.

hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') 
from tableName;
www.tableName.com 
hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1')
 from tableName;
v1

十一、json解析 get_json_object

語法: get_json_object(string json_string, string path)
返回值: string
說明：解析json的字符串json_string,返回path指定的內容。若是輸入的json字符串無效，那麼返回NULL。

hive> select  get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} },"email":"amy@only_for_json_udf_test.net","owner":"amy"}','$.owner') from tableName;

十二、重複字符串函數：repeat

語法: repeat(string str, int n)
返回值: string
說明：返回重複n次後的str字符串

hive> select repeat('abc',5) from tableName;
abcabcabcabcabc

1三、分割字符串函數: split

語法: split(string str, string pat)
返回值: array
說明: 按照pat字符串分割str，會返回分割後的字符串數組

hive> select split('abtcdtef','t') from tableName;
["ab","cd","ef"]

1.五、集合統計函數

一、個數統計函數: count

語法: count(*), count(expr), count(DISTINCT expr[, expr_.])
返回值：Int

說明: count(*)統計檢索出的行的個數，包括NULL值的行；count(expr)返回指定字段的非空值的個數；count(DISTINCT
expr[, expr_.])返回指定字段的不一樣的非空值的個數

hive> select count(*) from tableName;
20
hive> select count(distinct t) from tableName;
10

二、總和統計函數: sum

語法: sum(col), sum(DISTINCT col)
返回值: double
說明: sum(col)統計結果集中col的相加的結果；sum(DISTINCT col)統計結果中col不一樣值相加的結果

hive> select sum(t) from tableName;
100
hive> select sum(distinct t) from tableName;
70

三、平均值統計函數: avg

語法: avg(col), avg(DISTINCT col)
返回值: double
說明: avg(col)統計結果集中col的平均值；avg(DISTINCT col)統計結果中col不一樣值相加的平均值

hive> select avg(t) from tableName;
50
hive> select avg (distinct t) from tableName;
30

四、最小值統計函數: min

語法: min(col)
返回值: double
說明: 統計結果集中col字段的最小值

hive> select min(t) from tableName;
20

五、最大值統計函數: max

語法: maxcol)
返回值: double
說明: 統計結果集中col字段的最大值

hive> select max(t) from tableName;
120

1.六、複合型構建函數

一、Map類型構建: map

語法: map (key1, value1, key2, value2, …)
說明：根據輸入的key和value對構建map類型

create table score_map(name string, score map<string,int>)
row format delimited fields terminated by '\t' 
collection items terminated by ',' map keys terminated by ':';

建立數據內容以下並加載數據
cd /kkb/install/hivedatas/
vim score_map.txt

zhangsan    數學:80,語文:89,英語:95
lisi    語文:60,數學:80,英語:99

加載數據到hive表當中去
load data local inpath '/kkb/install/hivedatas/score_map.txt' overwrite into table score_map;

map結構數據訪問：
獲取全部的value：
select name,map_values(score) from score_map;

獲取全部的key：
select name,map_keys(score) from score_map;

按照key來進行獲取value值
select name,score["數學"]  from score_map;

查看map元素個數
select name,size(score) from score_map;

二、Struct類型構建: struct

語法: struct(val1, val2, val3, …)
說明：根據輸入的參數構建結構體struct類型，似於C語言中的結構體，內部數據經過X.X來獲取，假設咱們的數據格式是這樣的，電影ABC，有1254人評價過，打分爲7.4分

建立struct表
hive> create table movie_score( name string,  info struct<number:int,score:float> )row format delimited fields terminated by "\t"  collection items terminated by ":"; 

加載數據
cd /kkb/install/hivedatas/
vim struct.txt

ABC 1254:7.4  
DEF 256:4.9  
XYZ 456:5.4

加載數據
load data local inpath '/kkb/install/hivedatas/struct.txt' overwrite into table movie_score;

hive當中查詢數據
hive> select * from movie_score;  
hive> select info.number,info.score from movie_score;  
OK  
1254    7.4  
256     4.9  
456     5.4

三、array類型構建: array

語法: array(val1, val2, …)
說明：根據輸入的參數構建數組array類型

hive> create table  person(name string,work_locations array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ',';

加載數據到person表當中去
cd /kkb/install/hivedatas/
vim person.txt

數據內容格式以下
biansutao   beijing,shanghai,tianjin,hangzhou
linan   changchu,chengdu,wuhan

加載數據
hive > load  data local inpath '/kkb/install/hivedatas/person.txt' overwrite into table person;

查詢全部數據數據
hive > select * from person;

按照下表索引進行查詢
hive > select work_locations[0] from person;

查詢全部集合數據
hive  > select work_locations from person; 

查詢元素個數
hive >  select size(work_locations) from person;

1.七、複雜型長度統計函數

1.Map類型長度函數: size(Map<k .V>)

語法: size(Map<k .V>)
返回值: int
說明: 返回map類型的長度

hive> select size(t) from map_table2;
2

2.array類型長度函數: size(Array<T>)

語法: size(Array<T>)
返回值: int
說明: 返回array類型的長度

hive> select size(t) from arr_table2;
4

3.類型轉換函數

類型轉換函數: cast
語法: cast(expr as <type>)
返回值: Expected "=" to follow "type"
說明: 返回轉換後的數據類型

hive> select cast('1' as bigint) from tableName;
1

1.八、explode函數

一、使用explode函數將hive表中的Map和Array字段數據進行拆分

lateral view用於和split、explode等UDTF一塊兒使用的，能將一行數據拆分紅多行數據，在此基礎上能夠對拆分的數據進行聚合，lateral view首先爲原始表的每行調用UDTF，UDTF會把一行拆分紅一行或者多行，lateral view在把結果組合，產生一個支持別名表的虛擬表。
其中explode還能夠用於將hive一列中複雜的array或者map結構拆分紅多行

需求：如今有數據格式以下
zhangsan    child1,child2,child3,child4 k1:v1,k2:v2
lisi    child5,child6,child7,child8  k3:v3,k4:v4

字段之間使用\t分割，需求將全部的child進行拆開成爲一列 
+----------+--+
| mychild  |
+----------+--+
| child1   |
| child2   |
| child3   |
| child4   |
| child5   |
| child6   |
| child7   |
| child8   |
+----------+--+

將map的key和value也進行拆開，成爲以下結果

+-----------+-------------+--+
| mymapkey  | mymapvalue  |
+-----------+-------------+--+
| k1        | v1          |
| k2        | v2          |
| k3        | v3          |
| k4        | v4          |
+-----------+-------------+--+

第一步：建立hive數據庫

建立hive數據庫d

第一步：建立hive數據庫

建立hive數據庫d

hive (default)> create database hive_explode;
hive (default)> use hive_explode;

第二步：建立hive表，而後使用explode拆分map和array

create  table hive_explode.t3(name string,
children array<string>,
address Map<string,string>)
row format delimited fields terminated by '\t'  
collection items terminated by ','
map keys terminated by ':' 
stored as textFile;

第三步：加載數據

node03執行如下命令建立表數據文件

cd  /kkb/install/hivedatas/

vim maparray
數據內容格式以下

zhangsan    child1,child2,child3,child4 k1:v1,k2:v2
lisi    child5,child6,child7,child8 k3:v3,k4:v4

hive表當中加載數據

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/maparray' into table hive_explode.t3;

第四步：使用explode將hive當中數據拆開

將array當中的數據拆分開

hive (hive_explode)> SELECT explode(children) AS myChild FROM hive_explode.t3;

將map當中的數據拆分開

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM hive_explode.t3;

二、使用explode拆分json字符串

需求：如今有一些數據格式以下：

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

其中字段與字段之間的分隔符是 |

咱們要解析獲得全部的monthSales對應的值爲如下這一列（行轉列）

4900
2090
6987

第一步：建立hive表

hive (hive_explode)> 
create table hive_explode.explode_lateral_view (
area string, 
goods_id string,
sale_info string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS textfile;

第二步：準備數據並加載數據

準備數據以下

cd /kkb/install/hivedatas
vim explode_json

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

加載數據到hive表當中去

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/explode_json' overwrite into table hive_explode.explode_lateral_view;

第三步：使用explode拆分Array

hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from hive_explode.explode_lateral_view;

第四步：使用explode拆解Map

hive (hive_explode)> select explode(split(area,',')) as area from hive_explode.explode_lateral_view;

第五步：拆解json字段

hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) as  sale_info from hive_explode.explode_lateral_view;

而後咱們想用get_json_object來獲取key爲monthSales的數據：

hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')),'$.monthSales') as  sale_info from hive_explode.explode_lateral_view;
而後出現異常FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
UDTF explode不能寫在別的函數內
若是你這麼寫，想查兩個字段，select explode(split(area,',')) as area,good_id from explode_lateral_view;
會報錯FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id'
使用UDTF的時候，只支持一個字段，這時候就須要LATERAL VIEW出場了

三、配合LATERAL VIEW使用

配合lateral view查詢多個字段

hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2;

其中LATERAL VIEW explode(split(goods_id,','))goods至關於一個虛擬表，與原表explode_lateral_view笛卡爾積關聯。

也能夠多重使用

hive (hive_explode)> select goods_id2,sale_info,area2 from explode_lateral_view  LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2;

也是三個表笛卡爾積的結果

最終，咱們能夠經過下面的句子，把這個json格式的一行數據，徹底轉換成二維表的方式展示

hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),'$.source') as source, get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales, get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as monthSales,  get_json_object(concat('{',sale_info_1,'}'),'$.score') as monthSales from explode_lateral_view   LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_1;

總結：

Lateral View一般和UDTF一塊兒出現，爲了解決UDTF不容許在select字段的問題。
Multiple Lateral View能夠實現相似笛卡爾乘積。
Outer關鍵字能夠把不輸出的UDTF的空結果，輸出成NULL，防止丟失數據。

1.九、列、行互轉函數

1.9.一、列轉行

1．相關函數說明

CONCAT(string A/col, string B/col…)：返回輸入字符串鏈接後的結果，支持任意個輸入字符串;

CONCAT_WS(separator, str1, str2,...)：它是一個特殊形式的 CONCAT()。第一個參數剩餘參數間的分隔符。分隔符能夠是與剩餘參數同樣的字符串。若是分隔符是 NULL，返回值也將爲 NULL。這個函數會跳過度隔符參數後的任何 NULL 和空字符串。分隔符將被加到被鏈接的字符串之間;

COLLECT_SET(col)：函數只接受基本數據類型，它的主要做用是將某字段的值進行去重彙總，產生array類型字段。

2．數據準備

表6-6 數據準備

name	constellation	blood_type
孫悟空	白羊座	A
老王	射手座	A
宋宋	白羊座	B
豬八戒	白羊座	A
冰冰	射手座	A

3．需求

把星座和血型同樣的人歸類到一塊兒。結果以下：

射手座,A            老王|冰冰
白羊座,A            孫悟空|豬八戒
白羊座,B            宋宋

4．建立本地constellation.txt，導入數據

node03服務器執行如下命令建立文件，注意數據使用\t進行分割

cd /kkb/install/hivedatas
vim constellation.txt

孫悟空 白羊座 A
老王  射手座 A
宋宋  白羊座 B       
豬八戒 白羊座 A
鳳姐  射手座 A

5．建立hive表並導入數據

建立hive表並加載數據

hive (hive_explode)> create table person_info(  name string,  constellation string,  blood_type string)  row format delimited fields terminated by "\t";

加載數據

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/constellation.txt' into table person_info;

6．按需求查詢數據

hive (hive_explode)> select t1.base, concat_ws('|', collect_set(t1.name)) name from    (select name, concat(constellation, "," , blood_type) base from person_info) t1 group by  t1.base;

1.9.二、行轉列

1．函數說明

EXPLODE(col)：將hive一列中複雜的array或者map結構拆分紅多行。

LATERAL VIEW

用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias

解釋：用於和split, explode等UDTF一塊兒使用，它可以將一列數據拆成多行數據，在此基礎上能夠對拆分後的數據進行聚合。

2．數據準備

數據內容以下，字段之間都是使用\t進行分割

cd /kkb/install/hivedatas

vim movie.txt
《疑犯追蹤》  懸疑,動做,科幻,劇情
《Lie to me》 懸疑,警匪,動做,心理,劇情
《戰狼2》   戰爭,動做,災難

3．需求

將電影分類中的數組數據展開。結果以下：

《疑犯追蹤》  懸疑
《疑犯追蹤》  動做
《疑犯追蹤》  科幻
《疑犯追蹤》  劇情
《Lie to me》 懸疑
《Lie to me》 警匪
《Lie to me》 動做
《Lie to me》 心理
《Lie to me》 劇情
《戰狼2》   戰爭
《戰狼2》   動做
《戰狼2》   災難

4．建立hive表並導入數據

建立hive表

hive (hive_explode)> create table movie_info(
movie string, 
category array<string>
) 
row format delimited fields terminated by "\t" 
collection items terminated by ",";

加載數據

load data local inpath "/kkb/install/hivedatas/movie.txt" into table movie_info;

5．按需求查詢數據

hive (hive_explode)>  
select movie, category_name 
from 
movie_info lateral view explode(category) table_tmp as category_name;

1.十、reflect函數

reflect函數能夠支持在sql中調用java中的自帶函數

使用java.lang.Math當中的Max求兩列中最大值

建立hive表

hive (hive_explode)>  
create table test_udf(col1 int,col2 int)
row format delimited fields terminated by ',';

準備數據並加載數據

cd /kkb/install/hivedatas

vim test_udf

1,2
4,3
6,4
7,5
5,6

加載數據

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/test_udf' overwrite into table test_udf;

使用java.lang.Math當中的Max求兩列當中的最大值

hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf;

不一樣記錄執行不一樣的java內置函數

建立hive表

hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';

準備數據

cd /export/servers/hivedatas

vim test_udf2

java.lang.Math,min,1,2
java.lang.Math,max,2,3

加載數據

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/test_udf2' overwrite into table test_udf2;

執行查詢

hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;

判斷是否爲數字

使用apache commons中的函數，commons下的jar已經包含在hadoop的classpath中，因此能夠直接使用。

使用方式以下：

hive (hive_explode)> select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123");

1.十一、分析函數

一、分析函數的做用介紹

對於一些比較複雜的數據求取過程，咱們可能就要用到分析函數，分析函數主要用於分組求topN，或者求取百分比，或者進行數據的切片等等，咱們均可以使用分析函數來解決

二、經常使用的分析函數介紹

一、ROW_NUMBER()：

從1開始，按照順序，生成分組內記錄的序列,好比，按照pv降序排列，生成分組內天天的pv名次,ROW_NUMBER()的應用場景很是多，再好比，獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。

二、RANK() ：

生成數據項在分組中的排名，排名相等會在名次中留下空位

三、DENSE_RANK() ：

生成數據項在分組中的排名，排名相等會在名次中不會留下空位

四、CUME_DIST ：

小於等於當前值的行數/分組內總行數。好比，統計小於等於當前薪水的人數，所佔總人數的比例

五、PERCENT_RANK ：

分組內當前行的RANK值/分組內總行數

六、NTILE(n) ：

用於將分組數據按照順序切分紅n片，返回當前切片值，若是切片不均勻，默認增長第一個切片的分佈。NTILE不支持ROWS BETWEEN，好比 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。

三、需求描述

現有數據內容格式以下，分別對應三個字段，cookieid，createtime ，pv，求取每一個cookie訪問pv前三名的數據記錄，其實就是分組求topN，求取每組當中的前三個值

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

第一步：建立數據庫表

在hive當中建立數據庫表

CREATE EXTERNAL TABLE cookie_pv (
cookieid string,
createtime string, 
pv INT
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' ;

第二步：準備數據並加載

node03執行如下命令，建立數據，並加載到hive表當中去

cd /kkb/install/hivedatas
vim cookiepv.txt

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

加載數據到hive表當中去

load  data  local inpath '/kkb/install/hivedatas/cookiepv.txt'  overwrite into table  cookie_pv

第三步：使用分析函數來求取每一個cookie訪問PV的前三條記錄

SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM cookie_pv 
WHERE rn1 <=  3 ;

二、Hive自定義函數

2.一、自定義函數的基本介紹

1）Hive 自帶了一些函數，好比：max/min等，可是數量有限，本身能夠經過自定義UDF來方便的擴展。

2）當Hive提供的內置函數沒法知足你的業務處理須要時，此時就能夠考慮使用用戶自定義函數（UDF：user-defined function）。

3）根據用戶自定義函數類別分爲如下三種：

（1）UDF（User-Defined-Function）

一進一出

（2）UDAF（User-Defined Aggregation Function）

彙集函數，多進一出

相似於：count/max/min

（3）UDTF（User-Defined Table-Generating Functions）

一進多出

如lateral view explode()

4）官方文檔地址

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

5）編程步驟：

（1）繼承org.apache.hadoop.hive.ql.UDF

（2）須要實現evaluate函數；evaluate函數支持重載；

6）注意事項

（1）UDF必需要有返回類型，能夠返回null，可是返回類型不能爲void；

（2）UDF中經常使用Text/LongWritable等類型，不推薦使用java類型；

2.二、自定義函數開發

一、自定義函數的基本介紹

1）Hive 自帶了一些函數，好比：max/min等，可是數量有限，本身能夠經過自定義UDF來方便的擴展。

2）當Hive提供的內置函數沒法知足你的業務處理須要時，此時就能夠考慮使用用戶自定義函數（UDF：user-defined function）。

3）根據用戶自定義函數類別分爲如下三種：

（1）UDF（User-Defined-Function）

一進一出

（2）UDAF（User-Defined Aggregation Function）

彙集函數，多進一出

相似於：count/max/min

（3）UDTF（User-Defined Table-Generating Functions）

一進多出

如lateral view explode()

4）官方文檔地址

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

5）編程步驟：

（1）繼承org.apache.hadoop.hive.ql.UDF

（2）須要實現evaluate函數；evaluate函數支持重載；

6）注意事項

（1）UDF必需要有返回類型，能夠返回null，可是返回類型不能爲void；

（2）UDF中經常使用Text/LongWritable等類型，不推薦使用java類型；

二、自定義函數開發

第一步：建立maven java 工程，並導入jar包

<repositories>
    <repository>
        <id>cloudera</id>
 <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.0-cdh5.14.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.1.0-cdh5.14.2</version>
    </dependency>
</dependencies>
<build>
<plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.0</version>
        <configuration>
            <source>1.8</source>
            <target>1.8</target>
            <encoding>UTF-8</encoding>
        </configuration>
    </plugin>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-shade-plugin</artifactId>
         <version>2.2</version>
         <executions>
             <execution>
                 <phase>package</phase>
                 <goals>
                     <goal>shade</goal>
                 </goals>
                 <configuration>
                     <filters>
                         <filter>
                             <artifact>*:*</artifact>
                             <excludes>
                                 <exclude>META-INF/*.SF</exclude>
                                 <exclude>META-INF/*.DSA</exclude>
                                 <exclude>META-INF/*/RSA</exclude>
                             </excludes>
                         </filter>
                     </filters>
                 </configuration>
             </execution>
         </executions>
     </plugin>
</plugins>
</build>

第二步：開發java類繼承UDF，並重載evaluate 方法

public class MyUDF extends UDF {
     public Text evaluate(final Text s) {
         if (null == s) {
             return null;
         }
         //**返回大寫字母         
         return new Text(s.toString().toUpperCase());
     }
 }

第三步：將咱們的項目打包，並上傳到hive的lib目錄下

使用maven的package進行打包，將咱們打包好的jar包上傳到node03服務器的/kkb/install/hive-1.1.0-cdh5.14.2/lib 這個路徑下

第四步：添加咱們的jar包

重命名咱們的jar包名稱

cd /kkb/install/hive-1.1.0-cdh5.14.2/lib
mv original-day_hive_udf-1.0-SNAPSHOT.jar udf.jar

hive的客戶端添加咱們的jar包

0: jdbc:hive2://node03:10000> add jar /kkb/install/hive-1.1.0-cdh5.14.2/lib/udf.jar;

第五步：設置函數與咱們的自定義函數關聯

0: jdbc:hive2://node03:10000> create temporary function tolowercase as 'com.kkb.udf.MyUDF';

第六步：使用自定義函數

0: jdbc:hive2://node03:10000>select tolowercase('abc');

hive當中如何建立永久函數

在hive當中添加臨時函數，須要咱們每次進入hive客戶端的時候都須要添加如下，退出hive客戶端臨時函數就會失效，那麼咱們也能夠建立永久函數來讓其不會失效

建立永久函數

一、指定數據庫，將咱們的函數建立到指定的數據庫下面
0: jdbc:hive2://node03:10000>use myhive;

二、使用add jar添加咱們的jar包到hive當中來
0: jdbc:hive2://node03:10000>add jar /kkb/install/hive-1.1.0-cdh5.14.2/lib/udf.jar;

三、查看咱們添加的全部的jar包
0: jdbc:hive2://node03:10000>list  jars;

四、建立永久函數，與咱們的函數進行關聯
0: jdbc:hive2://node03:10000>create  function myuppercase as 'com.kkb.udf.MyUDF';

五、查看咱們的永久函數
0: jdbc:hive2://node03:10000>show functions like 'my*';

六、使用永久函數
0: jdbc:hive2://node03:10000>select myhive.myuppercase('helloworld');

七、刪除永久函數
0: jdbc:hive2://node03:10000>drop function myhive.myuppercase;

八、查看函數
 show functions like 'my*';