大數據技術之_08_Hive學習_03_查詢+函數

時間 2020-05-28

標籤數據技術 hive 學習查詢函數欄目 Hadoop 简体版

原文原文鏈接

第6章查詢6.1 基本查詢（select … from）6.1.1 全表和特定列查詢6.1.2 列別名6.1.3 算術運算符6.1.4 經常使用函數（聚合函數）6.1.5 limit語句6.2 where語句6.2.1 比較運算符（between/in/is null）6.2.2 like和rlike6.2.3 邏輯運算符（and/or/not）6.3 分組6.3.1 group by語句6.3.2 having語句6.4 join語句6.4.1 等值join6.4.2 表的別名6.4.3 內鏈接6.4.4 左外鏈接6.4.5 右外鏈接6.4.6 滿外鏈接6.4.7 多表鏈接6.4.8 笛卡爾積6.4.9 鏈接謂詞中不支持or6.5 排序6.5.1 全局排序（order by）6.5.2 按照別名排序6.5.3 多個列排序6.5.4 每一個MapReduce內部排序（sort by）6.5.5 分區排序（distribute by）6.5.6 cluster by6.6 分桶及抽樣查詢6.6.1 分桶表數據存儲6.6.2 分桶抽樣查詢6.7 其餘經常使用查詢函數（Hive高級）6.7.1 給空字段賦值函數6.7.2 case … when … then … else … end 函數6.7.2 行轉列相關函數6.7.3 列轉行相關函數6.7.4 窗口函數6.7.5 rank函數第7章函數（Hive高級）7.1 系統內置函數7.2 自定義函數7.3 自定義UDF函數php

第6章查詢

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
查詢語句語法：css

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
  [LIMIT number]

6.1 基本查詢（select … from）

6.1.1 全表和特定列查詢

一、全表查詢html

hive (default)> select * from emp;

二、選擇特定列查詢java

hive (default)> select empno, ename from emp;

注意：
（1）SQL 語言大小寫不敏感。
（2）SQL 能夠寫在一行或者多行。
（3）關鍵字不能被縮寫也不能分行。
（4）各子句通常要分行寫。
（5）使用縮進提升語句的可讀性。linux

6.1.2 列別名

一、重命名一個列
二、便於計算
三、緊跟列名，也能夠在列名和別名之間加入關鍵字as
四、案例實操
查詢名稱和部門nginx

hive (default)> select ename as name, deptno dn from emp;

6.1.3 算術運算符

案例實操：
查詢出全部員工的薪水後加1顯示。

hive (default)> select sal+1 from emp;

6.1.4 經常使用函數（聚合函數）

一、求總行數（count）面試

hive (default)> select count(*) cnt from emp;

面試題：count(1)、count(*)、count(column)的區別？
參考連接：https://www.cnblogs.com/chenmingjun/p/10436316.html
二、求工資的最大值（max）正則表達式

hive (default)> select max(sal) max_sal from emp;

三、求工資的最小值（min）sql

hive (default)> select min(sal) min_sal from emp;

四、求工資的總和（sum）shell

hive (default)> select sum(sal) sum_sal from emp;

五、求工資的平均值（avg）

hive (default)> select avg(sal) avg_sal from emp;

6.1.5 limit語句

典型的查詢會返回多行數據。LIMIT子句用於限制返回的行數。

hive (default)> select * from emp limit 5;

6.2 where語句

一、使用WHERE子句，將不知足條件的行過濾掉
二、WHERE子句緊隨FROM子句
三、案例實操
查詢出薪水大於1000的全部員工

hive (default)> select * from emp where sal>1000;

6.2.1 比較運算符（between/in/is null）

1）下面表中描述了謂詞操做符，這些操做符一樣能夠用於JOIN...ON和HAVING語句中。

2）案例實操
（1）查詢出薪水等於5000的全部員工

hive (default)> select * from emp where sal=5000;

（2）查詢工資在500到1000的員工信息

hive (default)> select * from emp where sal between 500 and 1000;

（3）查詢comm爲空的全部員工信息

hive (default)> select * from emp where comm is null;

（4）查詢工資是1500和5000的員工信息

hive (default)> select * from emp where sal IN (1500, 5000);

6.2.2 like和rlike

1）使用LIKE運算選擇相似的值
2）選擇條件能夠包含字符或數字:
%表明零個或多個字符(任意個字符)。
_ 表明一個字符。
3）RLIKE子句是Hive中這個功能的一個擴展，其能夠經過Java的正則表達式這個更強大的語言來指定匹配條件。
4）案例實操
（1）查找以2開頭薪水的員工信息

hive (default)> select * from emp where sal LIKE '2%';

emp.empno    emp.ename   emp.job emp.mgr emp.hiredate    emp.sal emp.comm    emp.deptno
7698    BLAKE   MANAGER 7839    1981-5-1    2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9    2450.0  NULL    10

（2）查找第二個數值爲2的薪水的員工信息

hive (default)> select * from emp where sal LIKE '_2%';

emp.empno    emp.ename   emp.job emp.mgr emp.hiredate    emp.sal emp.comm    emp.deptno
7521    WARD    SALESMAN    7698    1981-2-22   1250.0  500.0   30
7654    MARTIN  SALESMAN    7698    1981-9-28   1250.0  1400.0  30

（3）查找薪水中含有2的員工信息

hive (default)> select sal from emp where sal RLIKE '[2]';

sal
1250.0
1250.0
2850.0
2450.0

6.2.3 邏輯運算符（and/or/not）

案例實操
（1）查詢薪水大於1000，部門是30

hive (default)> select * from emp where sal>1000 and deptno=30;

（2）查詢薪水大於1000，或者部門是30

hive (default)> select * from emp where sal>1000 or deptno=30;

（3）查詢除了20部門和30部門之外的員工信息

hive (default)> select * from emp where deptno not in(30, 20);

6.3 分組

6.3.1 group by語句

GROUP BY語句一般會和聚合函數一塊兒使用，按照一個或者多個列隊結果進行分組，而後對每一個組執行聚合操做。
案例實操：
（1）計算emp表中每一個部門的平均工資

hive (default)> select avg(sal) avg_sal from emp group by deptno;
avg_sal
NULL
2916.6666666666665
1975.0
1566.6666666666667

hive (default)> select e.deptno, avg(e.sal) avg_sal from emp e group by e.deptno;
e.deptno    avg_sal
NULL    NULL
10    2916.6666666666665
20    1975.0
30    1566.6666666666667

注意：要將查詢字段放在group by裏面。（不包括聚合函數）
（2）計算emp表中每一個部門中每一個崗位的最高薪水

hive (default)> select e.deptno, e.job, max(e.sal) max_sal from emp e group by e.deptno, e.job;

e.deptno    e.job   max_sal
NULL    MANAGER 7839    NULL
10    CLERK   1300.0
10    MANAGER 2450.0
10    PRESIDENT   5000.0
20    ANALYST 3000.0
20    CLERK   1100.0
30    CLERK   950.0
30    MANAGER 2850.0
30    SALESMAN    1600.0

6.3.2 having語句

一、having與where不一樣點
（1）where針對表中的列發揮做用，查詢數據；having針對查詢結果中的列發揮做用，篩選數據。
（2）where後面不能寫分組函數，而having後面可使用分組函數。
（3）having只用於group by分組統計語句。
二、案例實操
（1）求每一個部門的平均薪水大於2000的部門
求emp表中每一個部門的平均工資

hive (default)> select deptno, avg(sal) avg_sal from emp 
group by deptno;

deptno    avg_sal
NULL    NULL
10    2916.6666666666665
20    1975.0
30    1566.6666666666667

求emp表中每一個部門的平均薪水大於2000的部門

hive (default)> select deptno, avg(sal) avg_sal from emp 
group by deptno 
having avg_sal>2000;

deptno    avg_sal
10    2916.6666666666665

6.4 join語句

6.4.1 等值join

Hive支持一般的SQL JOIN語句，可是只支持等值鏈接，不支持非等值鏈接。
案例實操
（1）根據員工表和部門表中的部門編號相等，查詢員工編號、員工名稱、部門編號和部門名稱；

hive (default)> select e.empno, e.ename, d.deptno, d.dname 
from emp e join dept d
on e.deptno = d.deptno;

6.4.2 表的別名

一、好處
（1）使用別名能夠簡化查詢。
（2）使用表名前綴能夠提升執行效率。
二、案例實操
合併員工表和部門表

hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e join dept d on e.deptno=d.deptno;

6.4.3 內鏈接

內鏈接（A和B表的交集）：只有進行鏈接的兩個表中都存在與鏈接條件相匹配的數據纔會被保留下來。

hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e join dept d on e.deptno=d.deptno;

6.4.4 左外鏈接

左外鏈接（A和B表的交集+A集合）：JOIN操做符左邊表中符合WHERE子句的全部記錄將會被返回。

hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e left join dept d on e.deptno=d.deptno;

6.4.5 右外鏈接

右外鏈接（A和B表的交集+B集合）：JOIN操做符右邊表中符合WHERE子句的全部記錄將會被返回。

hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e right join dept d on e.deptno=d.deptno;

6.4.6 滿外鏈接

滿外鏈接（A和B表的交集+A集合+B集合）：將會返回全部表中符合WHERE語句條件的全部記錄。若是任一表的指定字段沒有符合條件的值的話，那麼就使用NULL值替代。

hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e full join dept d on e.deptno=d.deptno;

6.4.7 多表鏈接

注意：鏈接 n 個表，至少須要 n-1 個鏈接條件。例如：鏈接三個表，至少須要兩個鏈接條件。
數據準備
location.txt

1700    Beijing
1800    London
1900    Tokyo

一、建立位置表

create table if not exists default.location(
loc int,
loc_name string
)
row format delimited fields terminated by '\t';

二、導入數據

hive (default)> load data local inpath '/opt/module/datas/location.txt' into table default.location;

三、多表鏈接查詢

hive (default)> select e.ename, d.dname, l.loc_name
from emp e 
join dept d
on e.deptno=d.deptno
join location l
on l.loc=d.loc;

大多數狀況下，Hive會對每對JOIN鏈接對象啓動一個MapReduce任務。本例中會首先啓動一個MapReduce job對錶e和表d進行鏈接操做，而後會再啓動一個MapReduce job將第一個MapReduce job的輸出和表l進行鏈接操做。
注意：爲何不是表d和表l先進行鏈接操做呢？這是由於Hive老是按照從左到右的順序執行的。

6.4.8 笛卡爾積

一、笛卡爾集會在下面條件下產生
（1）省略鏈接條件
（2）鏈接條件無效
（3）全部表中的全部行互相鏈接
二、案例實操

hive (default)> select empno, dname from emp, dept;

6.4.9 鏈接謂詞中不支持or

hive (default)> select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno=d.deptno or e.ename=d.ename;   錯誤的

6.5 排序

6.5.1 全局排序（order by）

order by：全局排序，只有一個Reducer，不管是否手動設置了Reducer的個數，Reducer只有一個。
一、使用 ORDER BY 子句排序
ASC（ascend）: 升序（默認）從小到大
DESC（descend）: 降序
二、ORDER BY 子句在SELECT語句的結尾
三、案例實操
（1）查詢員工信息按工資升序排列

hive (default)> select * from emp order by sal;

（2）查詢員工信息按工資降序排列

hive (default)> select * from emp order by sal desc;

6.5.2 按照別名排序

按照員工薪水的2倍排序

hive (default)> select ename, sal*2 twosal from emp order by twosal;

6.5.3 多個列排序

按照部門和工資升序排序

hive (default)> select ename, deptno, sal from emp order by deptno, sal;

6.5.4 每一個MapReduce內部排序（sort by）

sort by：對於每一個Reducer內部進行排序，對全局結果集來講不是排序，有多個Reducer。
一、設置reduce個數

hive (default)> set mapreduce.job.reduces=3;

二、查看設置reduce個數

hive (default)> set mapreduce.job.reduces;

三、根據部門編號降序查看員工信息

hive (default)> select * from emp sort by empno desc;

四、將查詢結果導入到文件中（按照部門編號降序排序）

hive (default)> insert overwrite local directory '/opt/module/datas/sortby-result'
select * from emp sort by deptno desc;

6.5.5 分區排序（distribute by）

distribute by：相似MR中partition，做用是進行分區，須要結合sort by使用。
注意：Hive要求DISTRIBUTE BY語句要寫在SORT BY語句以前。
對於distribute by進行測試，必定要分配多reduce進行處理，不然沒法看到distribute by的效果。
案例實操：
（1）先按照部門編號分區，再按照員工編號降序排序。

hive (default)> set mapreduce.job.reduces=3;
hive (default)> insert overwrite local directory '/opt/module/datas/distributeby-result' 
select * from emp distribute by deptno sort by empno desc;

6.5.6 cluster by

當distribute by和sorts by的字段相同時，可使用cluster by方式。
cluster by除了具備distribute by的功能外還兼具sort by的功能。可是排序只能是升序排序，不能指定排序規則爲ASC或者DESC。
1）如下兩種寫法等價

hive (default)> select * from emp cluster by deptno;
hive (default)> select * from emp distribute by deptno sort by deptno;

注意：按照部門編號分區，不必定就是固定死的數值，能夠是20號和30號部門分到一個分區裏面去。

6.6 分桶及抽樣查詢

6.6.1 分桶表數據存儲

分區針對的是數據的存儲路徑(文件夾)；分桶針對的是數據文件(文件)。
分區提供一個隔離數據和優化查詢的便利方式。不過，並不是全部的數據集均可造成合理的分區，特別是以前所提到過的要肯定合適的劃分大小這個疑慮。
分桶是將數據集分解成更容易管理的若干部分的另外一個技術。說明單個文件很大很大。
一、先建立分桶表，經過直接導入數據文件的方式
（1）數據準備
stu_buck.txt

1001    ss1
1002    ss2
1003    ss3
1004    ss4
1005    ss5
1006    ss6
1007    ss7
1008    ss8
1009    ss9
1010    ss10
1011    ss11
1012    ss12
1013    ss13
1014    ss14
1015    ss15
1016    ss16

（2）建立分桶表

create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

（3）查看錶結構

hive (default)> desc formatted stu_buck;
Num Buckets:            4

（4）導入數據到分桶表中

hive (default)> load data local inpath '/opt/module/datas/stu_buck.txt' into table stu_buck;

（5）查看建立的分桶表中是否分紅4個桶，以下圖所示

發現並無分紅4個桶。是什麼緣由呢？

二、建立分桶表時，數據經過子查詢的方式導入
（1）先建一個普通的stu表

create table stu(id int, name string)
row format delimited fields terminated by '\t';

（2）向普通的stu表中導入數據

load data local inpath '/opt/module/datas/stu_buck.txt' into table stu;

（3）清空stu_buck表中數據

hive (default)> truncate table stu_buck;
hive (default)> select * from stu_buck;

（4）導入數據到分桶表，經過子查詢的方式

hive (default)> insert into table stu_buck
select id, name from stu;

（5）發現仍是隻有一個分桶，以下圖所示

（6）須要設置一個屬性

hive (default)> set hive.enforce.bucketing=true;
hive (default)> set mapreduce.job.reduces=-1; -- -1表示reduce的個數不是預先設置好了，而是在執行HQL語句的時候自動分析出來須要幾個reduce。
hive (default)> truncate table stu_buck;
hive (default)> insert into table stu_buck
select id, name from stu;

分桶成功截圖以下圖所示

（7）查詢分桶的數據

hive (default)> select * from stu_buck;

stu_buck.id    stu_buck.name
1016    ss16
1012    ss12
1008    ss8
1004    ss4
1009    ss9
1005    ss5
1001    ss1
1013    ss13
1010    ss10
1002    ss2
1006    ss6
1014    ss14
1003    ss3
1011    ss11
1007    ss7
1015    ss15

讀取文件順序的解釋以下圖所示：

6.6.2 分桶抽樣查詢

對於很是大的數據集，有時用戶須要使用的是一個具備表明性的查詢結果而不是所有結果。Hive能夠經過對錶進行抽樣來知足這個需求。
查詢表stu_buck中的數據。

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
OK
stu_buck.id    stu_buck.name
1016    ss16
1012    ss12
1008    ss8
1004    ss4

或者

hive (default)> select * from stu_buck tablesample(bucket 1 out of 8 on id);
OK
stu_buck.id    stu_buck.name
1016    ss16
1008    ss8

注意：tablesample是抽樣語句，語法：TABLESAMPLE(BUCKET x OUT OF y) 。
y必須是table總bucket數的倍數或者因子。hive根據y的大小，決定抽樣的比例。例如，table總共分了4份，當y=2時，抽取(4/2=)2個bucket的數據，當y=8時，抽取(4/8=)1/2個bucket的數據。
x表示從哪一個bucket開始抽取，若是須要取多個分區，之後的分區號爲當前分區號加上y。例如，table總bucket數爲4，tablesample(bucket 1 out of 2)，表示總共抽取（4/2=）2個bucket的數據，抽取第1(x)個和第3(x+y)個bucket的數據。
注意：x的值必須小於等於y的值，不然報錯以下：

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

6.7 其餘經常使用查詢函數（Hive高級）

6.7.1 給空字段賦值函數

一、函數說明
NVL：給值爲NULL的數據賦值，它的格式是NVL(string1, replace_with)。它的功能是若是string1爲NULL，則NVL函數返回replace_with的值，不然返回string1的值，若是兩個參數都爲NULL，則返回NULL。
二、數據準備：採用員工表
三、查詢：若是員工的comm爲NULL，則用-1代替

hive (default)> select nvl(comm,-1) from emp;
OK
_c0
-1.0
300.0
500.0
20.0
1400.0
-1.0
-1.0
-1.0
-1.0
0.0
-1.0
-1.0
-1.0
-1.0

或者

hive (default)> select nvl(comm,ename) from emp;
OK
_c0
SMITH
300.0
500.0
20.0
1400.0
BLAKE
CLARK
SCOTT
KING
0.0
ADAMS
JAMES
FORD
MILLER

四、查詢：若是員工的comm爲NULL，則用領導id代替

hive (default)> select nvl(comm,mgr) from emp;
OK
_c0
7902.0
300.0
500.0
20.0
1400.0
7839.0
7839.0
7566.0
NULL
0.0
7788.0
7698.0
7566.0
7782.0

6.7.2 case … when … then … else … end 函數

做用：替換數據。
一、數據準備

二、需求
求出不一樣部門男女各多少人。結果以下：

A     2       1
B     1       2

三、建立本地emp_sex.txt，導入數據

[atguigu@hadoop102 datas]$ vim emp_sex.txt
悟空    A   男
大海    A   男
宋宋    B   男
鳳姐    A   女
婷姐    B   女
婷婷    B   女

四、建立hive表並導入數據

create table emp_sex(
name string, 
dept_id string, 
sex string
) 
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/emp_sex.txt' into table emp_sex;

五、按需求查詢數據

select 
  dept_id,
  sum(case sex when '男' then 1 else 0 end) male_count,
  sum(case sex when '女' then 1 else 0 end) female_count
from 
  emp_sex
group by
  dept_id;

6.7.2 行轉列相關函數

一、相關函數說明
1）CONCAT(string A/col, string B/col, …)：返回輸入字符串鏈接後的結果，支持任意個輸入字符串。
2）CONCAT_WS(separator, str1, str2,…)：它是一個特殊形式的CONCAT()。第一個參數是剩餘參數間的分隔符。分隔符能夠是與剩餘參數同樣的字符串。若是分隔符是 NULL，返回值也將爲 NULL。這個函數會跳過度隔符參數後的任何 NULL 和空字符串。分隔符將被加到被鏈接的字符串之間。
3）COLLECT_SET(col)：函數只接受基本數據類型，它的主要做用是將某字段的值進行去重彙總，產生array類型字段。

注意：CONCAT()和CONCAT_WS()都是UDTF函數，COLLECT_SET()函數相似聚合函數。

示例1）

示例2）

示例3)

二、數據準備
person_info.txt

三、需求
把星座和血型同樣的人歸類到一塊兒。結果以下：

射手座,A            大海|鳳姐
白羊座,A            孫悟空|豬八戒
白羊座,B            宋宋

分析過程：

四、建立本地person_info.txt，導入數據

[atguigu@hadoop102 datas]$ vim person_info.txt
孫悟空    白羊座 A
大海    射手座 A
宋宋    白羊座 B
豬八戒    白羊座 A
鳳姐    射手座 A

五、建立hive表並導入數據

create table person_info(
name string, 
constellation string, 
blood_type string
) 
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/person_info.txt' into table person_info;

六、按需求查詢數據

select concat_ws(",", constellation, blood_type) as c_b, name from person_info;

--------------------

select 
  t1.c_b, collect_set(t1.name)
from 
  (select concat_ws(",", constellation, blood_type) as c_b, name from person_info) t1
group by
  t1.c_b;

--------------------

select 
  t1.c_b, concat_ws("|", collect_set(t1.name))
from 
  (select concat_ws(",", constellation, blood_type) as c_b, name from person_info) t1
group by
  t1.c_b;

6.7.3 列轉行相關函數

一、函數說明
EXPLODE(col)：將hive一列中複雜的array或者map結構拆分紅多行。
LATERAL VIEW
用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias
解釋：lateral view 用於和split,explode等UDTF函數一塊兒使用，它可以將一列數據拆成多行數據，在此基礎上能夠對拆分後的數據進行聚合。
二、數據準備
movie_info.txt

movie           category

《疑犯追蹤》    懸疑,動做,科幻,劇情
《Lie to me》   懸疑,警匪,動做,心理,劇情
《戰狼2》   戰爭,動做,災難

三、需求
將電影分類中的數組數據展開。結果以下：

《疑犯追蹤》    懸疑
《疑犯追蹤》    動做
《疑犯追蹤》    科幻
《疑犯追蹤》    劇情
《Lie to me》   懸疑
《Lie to me》   警匪
《Lie to me》   動做
《Lie to me》   心理
《Lie to me》   劇情
《戰狼2》   戰爭
《戰狼2》   動做
《戰狼2》   災難

四、建立本地movie.txt，導入數據

[atguigu@hadoop102 datas]$ vim movie_info.txt
《疑犯追蹤》    懸疑,動做,科幻,劇情
《Lie to me》   懸疑,警匪,動做,心理,劇情
《戰狼2》   戰爭,動做,災難

五、建立hive表並導入數據

create table movie_info(
movie string, 
category array<string>
) 
row format delimited fields terminated by "\t"
collection items terminated by ",";

load data local inpath "/opt/module/datas/movie_info.txt" into table movie_info;

六、按需求查詢數據

select 
  movie
  explode(category)
from
  movie_info;

上面是錯誤的。假設能執行的話，獲得的是笛卡爾積。

小結：像split,explode等UDTF函數，是不能跟原表的字段直接進行查詢的，UDTF函數必定要和lateral view聯合在一塊用。
-----------------------------------------

select
  movie,
  category_name
from 
  movie_info 
lateral view explode(category) table_tmp as category_name; --lateral view 對原表的字段進行了側寫，獲得側寫表和側寫字段。

6.7.4 窗口函數

一、相關函數說明
注意：窗口是針對每一行數據來講的。默認窗口大小，就是每一行數據就是一個窗口。
OVER()：指定分析函數工做的數據窗口大小，這個數據窗口大小可能會隨着行的變化而變化。
CURRENT ROW：當前行。
n PRECEDING：往前n行數據。
n FOLLOWING：日後n行數據。
UNBOUNDED：起點，UNBOUNDED PRECEDING 表示從前面的起點， UNBOUNDED FOLLOWING 表示到後面的終點。
LAG(col,n)：往前第n行數據。
LEAD(col,n)：日後第n行數據。
NTILE(n)：把有序分區中的行分發到指定數據的組中，各個組有編號，編號從1開始，對於每一行，NTILE返回此行所屬的組的編號。注意：n必須爲int類型。
二、數據準備
business.txt

name    orderdate   cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

三、需求
（1）查詢在2017年4月份購買過的顧客及總人數
（2）查詢顧客的購買明細及月購買總額
（3）上述的場景,要將cost按照日期進行累加
（4）查詢顧客上次的購買時間
（5）查詢前20%時間的訂單信息
四、建立本地business.txt，導入數據

[atguigu@hadoop102 datas]$ vim business.txt

五、建立hive表並導入數據

create table business(
name string, 
orderdate string,
cost int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/datas/business.txt" into table business;

六、按需求查詢數據
（1）查詢在2017年4月份購買過的顧客及總人數

select name, count(*) 
from business
where substring(orderdate, 1, 7)="2017-04"
group by name;

+-------+------+--+
| name  | _c1  |
+-------+------+--+
| jack  | 1    |
| mart  | 4    |
+-------+------+--+

-----------------------------------------
select name, count(*) over()
from business
where substring(orderdate, 1, 7)="2017-04"
group by name;

+-------+-----------------+--+
| name  | count_window_0  |
+-------+-----------------+--+
| mart  | 2               |
| jack  | 2               |
+-------+-----------------+--+

（2）查詢顧客的購買明細及月購買總額

select sum(cost)
from business;

select *, 
sum(cost) over() -- 表示每一行數據就是一個窗口
from business;

select month(orderdate) from business; -- 按月份輸出

select *, 
sum(cost) over(distribute by month(orderdate)) -- 按月份分區
from business;

select *, 
sum(cost) over(partition by month(orderdate)) -- 按月份分區（同上）
from business;

（3）上述的場景,要將cost按照日期進行累加

select * from business sort by orderdate;

select *, 
sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)
from business;

select *, 
sum(cost) over(distribute by name sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) 
from business;

select *, 
sum(cost) over(partition by name sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) 
from business;

select name, orderdate, cost, 
sum(cost) over() as sample1, -- 只有一個分區，全部行相加，得一個值
sum(cost) over(partition by name) as sample2, -- 按name分區，有多個分區，分區內數據相加，每一個分區得一個值
sum(cost) over(order by orderdate) as sample3, -- 按orderdate排序，只有一個分區，區內數據累加
sum(cost) over(partition by name order by orderdate) as sample4, -- 按name分區，按orderdate排序，有多個分區，區內數據各自累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) as sample5, -- 和sample4同樣，由起點到當前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and CURRENT ROW) as sample6, -- 當前行和前面一行作聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING) as sample7, -- 當前行和前邊一行及後面一行
sum(cost) over(partition by name order by orderdate rows between CURRENT ROW and UNBOUNDED FOLLOWING) as sample8, -- 當前行及後面全部行
from business;

演示1以下：

select name, orderdate, cost,
sum(cost) over() as sample1
from business;
+-------+-------------+-------+----------+--+
| name  |  orderdate  | cost  | sample1  |
+-------+-------------+-------+----------+--+
| mart  | 2017-04-13  | 94    | 661      |
| neil  | 2017-06-12  | 80    | 661      |
| mart  | 2017-04-11  | 75    | 661      |
| neil  | 2017-05-10  | 12    | 661      |
| mart  | 2017-04-09  | 68    | 661      |
| mart  | 2017-04-08  | 62    | 661      |
| jack  | 2017-01-08  | 55    | 661      |
| tony  | 2017-01-07  | 50    | 661      |
| jack  | 2017-04-06  | 42    | 661      |
| jack  | 2017-01-05  | 46    | 661      |
| tony  | 2017-01-04  | 29    | 661      |
| jack  | 2017-02-03  | 23    | 661      |
| tony  | 2017-01-02  | 15    | 661      |
| jack  | 2017-01-01  | 10    | 661      |
+-------+-------------+-------+----------+--+

演示2以下：

select name, orderdate, cost,
sum(cost) over(partition by name) as sample2
from business;
+-------+-------------+-------+----------+--+
| name  |  orderdate  | cost  | sample2  |
+-------+-------------+-------+----------+--+
| jack  | 2017-01-05  | 46    | 176      |
| jack  | 2017-01-08  | 55    | 176      |
| jack  | 2017-01-01  | 10    | 176      |
| jack  | 2017-04-06  | 42    | 176      |
| jack  | 2017-02-03  | 23    | 176      |
| mart  | 2017-04-13  | 94    | 299      |
| mart  | 2017-04-11  | 75    | 299      |
| mart  | 2017-04-09  | 68    | 299      |
| mart  | 2017-04-08  | 62    | 299      |
| neil  | 2017-05-10  | 12    | 92       |
| neil  | 2017-06-12  | 80    | 92       |
| tony  | 2017-01-04  | 29    | 94       |
| tony  | 2017-01-02  | 15    | 94       |
| tony  | 2017-01-07  | 50    | 94       |
+-------+-------------+-------+----------+--+

演示3以下：

select name, orderdate, cost,
sum(cost) over(order by orderdate) as sample3
from business;
+-------+-------------+-------+----------+--+
| name  |  orderdate  | cost  | sample3  |
+-------+-------------+-------+----------+--+
| jack  | 2017-01-01  | 10    | 10       |
| tony  | 2017-01-02  | 15    | 25       |
| tony  | 2017-01-04  | 29    | 54       |
| jack  | 2017-01-05  | 46    | 100      |
| tony  | 2017-01-07  | 50    | 150      |
| jack  | 2017-01-08  | 55    | 205      |
| jack  | 2017-02-03  | 23    | 228      |
| jack  | 2017-04-06  | 42    | 270      |
| mart  | 2017-04-08  | 62    | 332      |
| mart  | 2017-04-09  | 68    | 400      |
| mart  | 2017-04-11  | 75    | 475      |
| mart  | 2017-04-13  | 94    | 569      |
| neil  | 2017-05-10  | 12    | 581      |
| neil  | 2017-06-12  | 80    | 661      |
+-------+-------------+-------+----------+--+

注意：
select name, orderdate, cost,
sum(cost) over(sort by orderdate) as sample3
from business;
演示結果同上。
區別：使用sort by能夠設定reducer的個數，order by不可以設定reducer的個數，默認是1個。即便設定了也沒用！

演示4以下：

select name, orderdate, cost,
sum(cost) over(partition by name order by orderdate) as sample4
from business;
+-------+-------------+-------+----------+--+
| name  |  orderdate  | cost  | sample4  |
+-------+-------------+-------+----------+--+
| jack  | 2017-01-01  | 10    | 10       |
| jack  | 2017-01-05  | 46    | 56       |
| jack  | 2017-01-08  | 55    | 111      |
| jack  | 2017-02-03  | 23    | 134      |
| jack  | 2017-04-06  | 42    | 176      |
| mart  | 2017-04-08  | 62    | 62       |
| mart  | 2017-04-09  | 68    | 130      |
| mart  | 2017-04-11  | 75    | 205      |
| mart  | 2017-04-13  | 94    | 299      |
| neil  | 2017-05-10  | 12    | 12       |
| neil  | 2017-06-12  | 80    | 92       |
| tony  | 2017-01-02  | 15    | 15       |
| tony  | 2017-01-04  | 29    | 44       |
| tony  | 2017-01-07  | 50    | 94       |
+-------+-------------+-------+----------+--+

演示5以下：

select name, orderdate, cost,
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) as sample5
from business;
+-------+-------------+-------+----------+--+
| name  |  orderdate  | cost  | sample5  |
+-------+-------------+-------+----------+--+
| jack  | 2017-01-01  | 10    | 10       |
| jack  | 2017-01-05  | 46    | 56       |
| jack  | 2017-01-08  | 55    | 111      |
| jack  | 2017-02-03  | 23    | 134      |
| jack  | 2017-04-06  | 42    | 176      |
| mart  | 2017-04-08  | 62    | 62       |
| mart  | 2017-04-09  | 68    | 130      |
| mart  | 2017-04-11  | 75    | 205      |
| mart  | 2017-04-13  | 94    | 299      |
| neil  | 2017-05-10  | 12    | 12       |
| neil  | 2017-06-12  | 80    | 92       |
| tony  | 2017-01-02  | 15    | 15       |
| tony  | 2017-01-04  | 29    | 44       |
| tony  | 2017-01-07  | 50    | 94       |
+-------+-------------+-------+----------+--+

演示6以下：

select name, orderdate, cost,
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and CURRENT ROW) as sample6
from business;
+-------+-------------+-------+----------+--+
| name  |  orderdate  | cost  | sample6  |
+-------+-------------+-------+----------+--+
| jack  | 2017-01-01  | 10    | 10       |
| jack  | 2017-01-05  | 46    | 56       |
| jack  | 2017-01-08  | 55    | 101      |
| jack  | 2017-02-03  | 23    | 78       |
| jack  | 2017-04-06  | 42    | 65       |
| mart  | 2017-04-08  | 62    | 62       |
| mart  | 2017-04-09  | 68    | 130      |
| mart  | 2017-04-11  | 75    | 143      |
| mart  | 2017-04-13  | 94    | 169      |
| neil  | 2017-05-10  | 12    | 12       |
| neil  | 2017-06-12  | 80    | 92       |
| tony  | 2017-01-02  | 15    | 15       |
| tony  | 2017-01-04  | 29    | 44       |
| tony  | 2017-01-07  | 50    | 79       |
+-------+-------------+-------+----------+--+

演示7以下：

select name, orderdate, cost,
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING) as sample7
from business;
+-------+-------------+-------+----------+--+
| name  |  orderdate  | cost  | sample7  |
+-------+-------------+-------+----------+--+
| jack  | 2017-01-01  | 10    | 56       |
| jack  | 2017-01-05  | 46    | 111      |
| jack  | 2017-01-08  | 55    | 124      |
| jack  | 2017-02-03  | 23    | 120      |
| jack  | 2017-04-06  | 42    | 65       |
| mart  | 2017-04-08  | 62    | 130      |
| mart  | 2017-04-09  | 68    | 205      |
| mart  | 2017-04-11  | 75    | 237      |
| mart  | 2017-04-13  | 94    | 169      |
| neil  | 2017-05-10  | 12    | 92       |
| neil  | 2017-06-12  | 80    | 92       |
| tony  | 2017-01-02  | 15    | 44       |
| tony  | 2017-01-04  | 29    | 94       |
| tony  | 2017-01-07  | 50    | 79       |
+-------+-------------+-------+----------+--+

演示8以下：

select name, orderdate, cost,
sum(cost) over(partition by name order by orderdate rows between CURRENT ROW and UNBOUNDED FOLLOWING) as sample8
from business;
+-------+-------------+-------+----------+--+
| name  |  orderdate  | cost  | sample8  |
+-------+-------------+-------+----------+--+
| jack  | 2017-01-01  | 10    | 176      |
| jack  | 2017-01-05  | 46    | 166      |
| jack  | 2017-01-08  | 55    | 120      |
| jack  | 2017-02-03  | 23    | 65       |
| jack  | 2017-04-06  | 42    | 42       |
| mart  | 2017-04-08  | 62    | 299      |
| mart  | 2017-04-09  | 68    | 237      |
| mart  | 2017-04-11  | 75    | 169      |
| mart  | 2017-04-13  | 94    | 94       |
| neil  | 2017-05-10  | 12    | 92       |
| neil  | 2017-06-12  | 80    | 80       |
| tony  | 2017-01-02  | 15    | 94       |
| tony  | 2017-01-04  | 29    | 79       |
| tony  | 2017-01-07  | 50    | 50       |
+-------+-------------+-------+----------+--+

（4）查詢顧客上次的購買時間

select *, 
lag(orderdate, 1) over(distribute by name sort by orderdate)
from business;

查詢顧客上次的購買時間和下次購買時間
select *, 
lag(orderdate, 1) over(distribute by name sort by orderdate) as lag1,
lead(orderdate, 1) over(distribute by name sort by orderdate) as lead1
from business;

（5）查詢前20%時間的訂單信息

select *,
ntile(5) over(sort by orderdate) as gid
from business;

select *
from (select *,
       ntile(5) over(sort by orderdate) as gid
       from business) as t
where t.gid=1;

select *
from (select name, orderdate, cost,
       ntile(5) over(sort by orderdate) as gid
       from business) as t
where t.gid=1;

6.7.5 rank函數

一、函數說明
RANK()：排序相同時會重複，總數不會變。（兩個100分爲列第一名和第二名，99分的爲第三名）
DENSE_RANK()：排序相同時會重複，總數會減小。（兩個100分並列第一，99分的爲第二名）
ROW_NUMBER()：會根據順序計算。
注意：使用rank函數須要在其後跟上over函數（窗口函數）。
二、數據準備

三、需求
計算每門學科的成績排名。
四、建立本地score.txt，導入數據

[atguigu@hadoop102 datas]$ vim score.txt

五、建立hive表並導入數據

create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/score.txt' into table score;

六、按需求查詢數據

select *,
rank() over(partition by subject order by score desc) rp,
dense_rank()over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rnp
from score;

結果截圖：

第7章函數（Hive高級）

7.1 系統內置函數

一、查看系統自帶的函數

hive> show functions;

二、顯示自帶的函數的用法

hive> desc function upper;

三、詳細顯示自帶的函數的用法

hive> desc function extended upper;

7.2 自定義函數

1）Hive 自帶了一些函數，好比：max/min等，可是數量有限，本身能夠經過自定義UDF來方便的擴展。
2）當Hive提供的內置函數沒法知足你的業務處理須要時，此時就能夠考慮使用用戶自定義函數（UDF：user-defined function）。
3）根據用戶自定義函數類別分爲如下三種：
（1）UDF（User-Defined-Function）
一進一出
（2）UDAF（User-Defined Aggregation Function）
彙集函數，多進一出
相似於：count/max/min等
（3）UDTF（User-Defined Table-Generating Functions）
一進多出
如：lateral view explore()
4）官方文檔地址
https://cwiki.apache.org/confluence/display/Hive/HivePlugins
5）編程步驟：
（1）繼承org.apache.hadoop.hive.ql.UDF
（2）須要實現evaluate()函數；evaluate()函數支持重載
（3）在hive的命令行窗口建立函數
a）添加jar
add jar linux_jar_path
b）建立function
create [temporary] function [dbname.]function_name AS class_name;
（4）在hive的命令行窗口刪除函數
drop [temporary] function [if exists] [dbname.]function_name;
6）注意事項
（1）UDF必需要有返回類型，能夠返回null，可是返回類型不能爲void。

7.3 自定義UDF函數

一、建立一個Maven工程Hive
二、導入依賴

<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.2.1</version>
    </dependency>
</dependencies>

三、建立一個類

package com.atguigu.hive;

import org.apache.hadoop.hive.ql.exec.UDF;

/**
 * @author chenmingjun
 * @date 2019-02-27 17:50
 */
public class HiveUDF extends UDF {

    public String evaluate(final String s) {

        if (s == null) {
            return null;
        }

        return s.toLowerCase();
    }
}

四、打成jar包上傳到服務器/opt/module/jars/udf.jar
五、將jar包添加到hive的class path

hive (default)> add jar /opt/module/jars/udf.jar;

六、建立臨時函數與開發好的java class關聯

hive (default)> create temporary function mylower as "com.atguigu.hive.HiveUDF";

七、便可在hql中使用自定義的函數strip

hive (default)> select ename, mylower(ename) lowername from emp;OKename    lowernameSMITH    smithALLEN    allenWARD    wardJONES    jonesMARTIN    martinBLAKE    blakeCLARK    clarkSCOTT    scottKING    kingTURNER    turnerADAMS    adamsJAMES    jamesFORD    fordMILLER    miller