一,引言sql
二,表生成函數express
2.2 表生成函數(lateral view explode)數組
三,集合函數ide
3.1 判斷值是否存在某集合(array_contains)函數
3.4 Map集合的keys值返回(map_keys)code
3.5 Map集合的values值返回(map_values)orm
5.2.4 row_number,rank,dense_rank
不一樣於普通的SQL函數,Hive支持一些其餘sql不支持的函數,如表生成函數和窗口函數,集合函數等等,接下來對應一一解答。
explode:能夠將集合的數據遍歷出來,遍歷出的每個元素爲新的一行。注意,使用explode會生成一個新的表:
以下示例:
select * from t_stu_subject; +-------------------+---------------------+-----------------------------+--+ | t_stu_subject.id | t_stu_subject.name | t_stu_subject.subjects | +-------------------+---------------------+-----------------------------+--+ | 1 | zhangsan | ["化學","物理","數學","語文"] | | 2 | lisi | ["化學","數學","生物","生理","衛生"] | | 3 | wangwu | ["化學","語文","英語","體育","生物"] | +-------------------+---------------------+-----------------------------+--+ 3 rows selected (0.176 seconds) -->對subjects字段進行行轉列 select explode(subjects) from t_stu_subject; +------+--+ | col | +------+--+ | 化學 | | 物理 | | 數學 | | 語文 | | 化學 | | 數學 | | 生物 | | 生理 | | 衛生 | | 化學 | | 語文 | | 英語 | | 體育 | | 生物 | +------+--+ -->錯誤語句:explode生成的是一個表,因此下面會報錯 select id,name,explode(subjects) from t_stu_subject; Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)
lateral view用於和split、explode等UDTF一塊兒使用的,能將一行數據拆分紅多行數據,在此基礎上能夠對拆分的數據進行聚合,lateral view首先爲原始表的每行調用UDTF,UDTF會把一行拆分紅一行或者多行,lateral view在把結果組合,產生一個支持別名表的虛擬表。
上面的explode使用看起來毫無心義,可是咱們能夠結合lateral view來一塊兒使用,以下:
select id,name,sub from t_stu_subject lateral view explode(subjects) tmp as sub; +-----+-----------+------+--+ | id | name | sub | +-----+-----------+------+--+ | 1 | zhangsan | 化學 | | 1 | zhangsan | 物理 | | 1 | zhangsan | 數學 | | 1 | zhangsan | 語文 | | 2 | lisi | 化學 | | 2 | lisi | 數學 | | 2 | lisi | 生物 | | 2 | lisi | 生理 | | 2 | lisi | 衛生 | | 3 | wangwu | 化學 | | 3 | wangwu | 語文 | | 3 | wangwu | 英語 | | 3 | wangwu | 體育 | | 3 | wangwu | 生物 | +-----+-----------+------+--+
SQL代碼解析:
理解: lateral view 至關於兩個表在join 左表:是原表 右表:是explode(某個集合字段)以後產生的表 並且:這個join只在同一行的數據間進行
實例二:求word count
實例一: ==== 利用explode和lateral view 實現hive版的wordcount 有如下數據: a b c d e f g a b c e f g a b c d b 對數據建表: create table t_juzi(line string) row format delimited; 導入數據: load data local inpath '/root/words.txt' into table t_juzi; ** ***** ******** ***** ******** ***** ******** wordcount查詢語句:***** ******** ***** ******** ***** ******** select a.word,count(1) cnt from (select tmp.* from t_juzi lateral view explode(split(line,' ')) tmp as word) a group by a.word order by cnt desc; +---------+------+--+ | a.word | cnt | +---------+------+--+ | b | 4 | | c | 3 | | a | 3 | | g | 2 | | f | 2 | | e | 2 | | d | 2 | +---------+------+--+
array_contains:語法結構
array_contains(Array<T>, value) 返回boolean值
示例:
-->源數據查看 select * from t_stu_subject; +-------------------+---------------------+-----------------------------+--+ | t_stu_subject.id | t_stu_subject.name | t_stu_subject.subjects | +-------------------+---------------------+-----------------------------+--+ | 1 | zhangsan | ["化學","物理","數學","語文"] | | 2 | lisi | ["化學","數學","生物","生理","衛生"] | | 3 | wangwu | ["化學","語文","英語","體育","生物"] | +-------------------+---------------------+-----------------------------+--+ 3 rows selected (0.066 seconds) -->array_contains使用 select id, name, array_contains(subjects, '語文') from t_stu_subject; +-----+-----------+--------+--+ | id | name | _c2 | +-----+-----------+--------+--+ | 1 | zhangsan | true | | 2 | lisi | false | | 3 | wangwu | true | +-----+-----------+--------+--+ 3 rows selected (13.573 seconds)
sort_array:語法結構
sort_array(Array<T>) 返回排序後的數組
示例:
select sort_array(array(3,2,6)); +----------+--+ | _c0 | +----------+--+ | [2,3,6] | +----------+--+ 1 row selected (12.599 seconds)
-->數據查詢 select * from t_stu_subject; +-------------------+---------------------+-----------------------------+--+ | t_stu_subject.id | t_stu_subject.name | t_stu_subject.subjects | +-------------------+---------------------+-----------------------------+--+ | 1 | zhangsan | ["化學","物理","數學","語文"] | | 2 | lisi | ["化學","數學","生物","生理","衛生"] | | 3 | wangwu | ["化學","語文","英語","體育","生物"] | +-------------------+---------------------+-----------------------------+--+ 3 rows selected (0.069 seconds) -->size測試 select id, name, size(subjects) as sub_num from t_stu_subject; +-----+-----------+----------+--+ | id | name | sub_num | +-----+-----------+----------+--+ | 1 | zhangsan | 4 | | 2 | lisi | 5 | | 3 | wangwu | 5 | +-----+-----------+----------+--+ 3 rows selected (13.578 seconds)
語法格式:
map_keys(Map<T,T>)
實例:
select * from t_family; +--------------+----------------+----------------------------------------------------------------+---------------+--+ | t_family.id | t_family.name | t_family.family_members | t_family.age | +--------------+----------------+----------------------------------------------------------------+---------------+--+ | 1 | zhangsan | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} | 28 | | 2 | lisi | {"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22 | | 3 | wangwu | {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"} | 29 | | 4 | mayun | {"father":"mayongzhen","mother":"angelababy"} | 26 | +--------------+----------------+----------------------------------------------------------------+---------------+--+ -- 查出每一個人有哪些親屬關係 select id,name,map_keys(family_members) as relations,age from t_family; +-----+-----------+--------------------------------+------+--+ | id | name | relations | age | +-----+-----------+--------------------------------+------+--+ | 1 | zhangsan | ["father","mother","brother"] | 28 | | 2 | lisi | ["father","mother","brother"] | 22 | | 3 | wangwu | ["father","mother","sister"] | 29 | | 4 | mayun | ["father","mother"] | 26 | +-----+-----------+--------------------------------+------+--+ 4 rows selected (0.129 seconds)
語法結構:map_values(Map<T,T>)
實例:
-- 查出每一個人的親人名字 select id,name,map_values(family_members) as relations,age from t_family; +-----+-----------+-------------------------------------+------+--+ | id | name | relations | age | +-----+-----------+-------------------------------------+------+--+ | 1 | zhangsan | ["xiaoming","xiaohuang","xiaoxu"] | 28 | | 2 | lisi | ["mayun","huangyi","guanyu"] | 22 | | 3 | wangwu | ["wangjianlin","ruhua","jingtian"] | 29 | | 4 | mayun | ["mayongzhen","angelababy"] | 26 | +-----+-----------+-------------------------------------+------+--+ 4 rows selected (0.132 seconds)
做用:解析json字符串對象,經過 '$.key' 來獲取json串的value值
語法格式:
get_json_object(json字符串,'$.key')
實例:
select get_json_object('{"key1":3333, "key2": 4444}', '$.key1'); +-------+--+ | _c0 | +-------+--+ | 3333 | +-------+--+
做用:將json字符串的value值進行提取
語法格式:
json_tuple(json字符串,key值1,key值2) as (key1, key2)
實例:
select json_tuple('{"key1":3333, "key2": 4444}', 'key1', 'key2') as (key1, key2); +-------+-------+--+ | key1 | key2 | +-------+-------+--+ | 3333 | 4444 | +-------+-------+--+
hive中的窗口函數和sql中的窗口函數相相似,都是用來作一些數據分析類的工做,通常用於olap分析(在線分析處理)。
咱們都知道在sql中有一類函數叫作聚合函數,例如sum()、avg()、max()等等,這類函數能夠將多行數據按照規則彙集爲一行,通常來說彙集後的行數是要少於彙集前的行數的.可是有時咱們想要既顯示彙集前的數據,又要顯示彙集後的數據,這時咱們便引入了窗口函數。
在深刻研究over字句以前,必定要注意:在SQL處理中,窗口函數都是最後一步執行,並且僅位於Order by字句以前。
數據準備:
-->表建立 create table t_order(name string, orderdate string, cost int) row format delimited fields terminated by ','; -->數據導入 load data local inpath '/root/hiveData/t_order.txt' into table t_order; -->數據展現 1: jdbc:hive2://localhost:10000> select * from t_order; +---------------+--------------------+---------------+--+ | t_order.name | t_order.orderdate | t_order.cost | +---------------+--------------------+---------------+--+ | jack | 2015-01-01 | 10 | | tony | 2015-01-02 | 15 | | jack | 2015-02-03 | 23 | | tony | 2015-01-04 | 29 | | jack | 2015-01-05 | 46 | | jack | 2015-04-06 | 42 | | tony | 2015-01-07 | 50 | | jack | 2015-01-08 | 55 | | mart | 2015-04-08 | 62 | | mart | 2015-04-09 | 68 | | neil | 2015-05-10 | 12 | | mart | 2015-04-11 | 75 | | neil | 2015-06-12 | 80 | | mart | 2015-04-13 | 94 | +---------------+--------------------+---------------+--+ 14 rows selected (0.159 seconds)
假如說咱們想要查詢在2015年4月份購買過的顧客及總人數,咱們即可以使用窗口函數去去實現
實例:
1: jdbc:hive2://localhost:10000> select name,count(*) over() from t_order where substring(orderdate, 1, 7)='2015-04'; +-------+-----------------+--+ | name | count_window_0 | +-------+-----------------+--+ | mart | 5 | | mart | 5 | | mart | 5 | | mart | 5 | | jack | 5 | +-------+-----------------+--+ 5 rows selected (1.857 seconds)
-->注意:正常狀況下,使用count函數是必須結合分組使用,但這裏配合over能夠顯示聚合後的數據
可見其實在2015年4月一共有5次購買記錄,mart購買了4次,jack購買了1次.事實上,大多數狀況下,咱們是隻看去重後的結果的.針對於這種狀況,咱們有兩種實現方式:
--->distinct方式 1: jdbc:hive2://localhost:10000> select distinct name,count(*) over () 1: jdbc:hive2://localhost:10000> from t_order 1: jdbc:hive2://localhost:10000> where substring(orderdate,1,7) = '2015-04'; --->group by方式 1: jdbc:hive2://localhost:10000> select name,count(*) over () 1: jdbc:hive2://localhost:10000> from t_order 1: jdbc:hive2://localhost:10000> where substring(orderdate,1,7) = '2015-04' 1: jdbc:hive2://localhost:10000> group by name; -->輸出結果都爲:這裏就體現出了over是在後面執行的 +-------+-----------------+--+ | name | count_window_0 | +-------+-----------------+--+ | mart | 2 | | jack | 2 | +-------+-----------------+--+ 2 rows selected (2.889 seconds)
over子句以後第一個提到的就是partition by。partition by子句也能夠稱爲查詢分區子句,很是相似於Group By,都是將數據按照邊界值分組,而Over以前的函數在每個分組以內進行,若是超出了分組,則函數會從新計算.
需求:咱們想要去看顧客的購買明細及月購買總額。
1: jdbc:hive2://localhost:10000> select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) # 這裏month(orderdate) 提取出月份 1: jdbc:hive2://localhost:10000> from t_order; +-------+-------------+-------+---------------+--+ | name | orderdate | cost | sum_window_0 | +-------+-------------+-------+---------------+--+ | jack | 2015-01-01 | 10 | 205 | | jack | 2015-01-08 | 55 | 205 | | tony | 2015-01-07 | 50 | 205 | | jack | 2015-01-05 | 46 | 205 | | tony | 2015-01-04 | 29 | 205 | | tony | 2015-01-02 | 15 | 205 | | jack | 2015-02-03 | 23 | 23 | | mart | 2015-04-13 | 94 | 341 | | jack | 2015-04-06 | 42 | 341 | | mart | 2015-04-11 | 75 | 341 | | mart | 2015-04-09 | 68 | 341 | | mart | 2015-04-08 | 62 | 341 | | neil | 2015-05-10 | 12 | 12 | | neil | 2015-06-12 | 80 | 80 | +-------+-------------+-------+---------------+--+ 14 rows selected (1.56 seconds)
能夠看出數據已經按照月進行彙總了.
order by子句會讓輸入的數據強制排序(文章前面提到過,窗口函數是SQL語句最後執行的函數,所以能夠把SQL結果集想象成輸入數據)。Order By子句對於諸如Row_Number(),Lead(),LAG()等函數是必須的,由於若是數據無序,這些函數的結果就沒有任何意義。所以若是有了Order By子句,則Count(),Min()等計算出來的結果就沒有任何意義。
實例:
-->假如咱們想要將cost按照月進行累加.這時咱們引入order by子句. 1: jdbc:hive2://localhost:10000> select name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate ) 1: jdbc:hive2://localhost:10000> from t_order; +-------+-------------+-------+---------------+--+ | name | orderdate | cost | sum_window_0 | +-------+-------------+-------+---------------+--+ | jack | 2015-01-01 | 10 | 10 | | tony | 2015-01-02 | 15 | 25 | | tony | 2015-01-04 | 29 | 54 | | jack | 2015-01-05 | 46 | 100 | | tony | 2015-01-07 | 50 | 150 | | jack | 2015-01-08 | 55 | 205 | | jack | 2015-02-03 | 23 | 23 | | jack | 2015-04-06 | 42 | 42 | | mart | 2015-04-08 | 62 | 104 | | mart | 2015-04-09 | 68 | 172 | | mart | 2015-04-11 | 75 | 247 | | mart | 2015-04-13 | 94 | 341 | | neil | 2015-05-10 | 12 | 12 | | neil | 2015-06-12 | 80 | 80 | +-------+-------------+-------+---------------+--+ 14 rows selected (1.7 seconds) -->從上面能夠看出,對月進行分組切排序
咱們在上面已經經過使用partition by子句將數據進行了分組的處理.若是咱們想要更細粒度的劃分,咱們就要引入window子句了
咱們首先要理解兩個概念:
若是隻使用partition by子句,未指定order by的話,咱們的聚合是分組內的聚合.
使用了order by子句,未使用window子句的狀況下,默認從起點到當前行.當同一個select查詢中存在多個窗口函數時,他們相互之間是沒有影響的.每一個窗口函數應用本身的規則.
window子句: - PRECEDING:往前 - FOLLOWING:日後 - CURRENT ROW:當前行 - UNBOUNDED:起點,UNBOUNDED PRECEDING 表示從前面的起點, UNBOUNDED FOLLOWING:表示到後面的終點
實例:
-->咱們按照name進行分區,按照購物時間進行排序,作cost的累加. 以下咱們結合使用window子句進行查詢 select name,orderdate,cost, sum(cost) over() as sample1,--全部行相加 sum(cost) over(partition by name) as sample2,--按name分組,組內數據相加 sum(cost) over(partition by name order by orderdate) as sample3,--按name分組,組內數據累加 sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3同樣,由起點到當前行的聚合 sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --當前行和前面一行作聚合 sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--當前行和前邊一行及後面一行 sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --當前行及後面全部行 from t_order; +-------+-------------+-------+----------+----------+----------+----------+----------+----------+----------+--+ | name | orderdate | cost | sample1 | sample2 | sample3 | sample4 | sample5 | sample6 | sample7 | +-------+-------------+-------+----------+----------+----------+----------+----------+----------+----------+--+ | jack | 2015-01-01 | 10 | 661 | 176 | 10 | 10 | 10 | 56 | 176 | | jack | 2015-01-05 | 46 | 661 | 176 | 56 | 56 | 56 | 111 | 166 | | jack | 2015-01-08 | 55 | 661 | 176 | 111 | 111 | 101 | 124 | 120 | | jack | 2015-02-03 | 23 | 661 | 176 | 134 | 134 | 78 | 120 | 65 | | jack | 2015-04-06 | 42 | 661 | 176 | 176 | 176 | 65 | 65 | 42 | | mart | 2015-04-08 | 62 | 661 | 299 | 62 | 62 | 62 | 130 | 299 | | mart | 2015-04-09 | 68 | 661 | 299 | 130 | 130 | 130 | 205 | 237 | | mart | 2015-04-11 | 75 | 661 | 299 | 205 | 205 | 143 | 237 | 169 | | mart | 2015-04-13 | 94 | 661 | 299 | 299 | 299 | 169 | 169 | 94 | | neil | 2015-05-10 | 12 | 661 | 92 | 12 | 12 | 12 | 92 | 92 | | neil | 2015-06-12 | 80 | 661 | 92 | 92 | 92 | 92 | 92 | 80 | | tony | 2015-01-02 | 15 | 661 | 94 | 15 | 15 | 15 | 44 | 94 | | tony | 2015-01-04 | 29 | 661 | 94 | 44 | 44 | 44 | 94 | 79 | | tony | 2015-01-07 | 50 | 661 | 94 | 94 | 94 | 79 | 79 | 50 | +-------+-------------+-------+----------+----------+----------+----------+----------+----------+----------+--+ 14 rows selected (4.959 seconds)
功能:用於將分組數據按順序切分紅n片,返回當前切片值
注意:
ntile不支持 rows between
使用實例:
-->假如咱們想要每位顧客購買金額前1/3的交易記錄,咱們即可以使用這個函數. select name,orderdate,cost, ntile(3) over() as sample1 , --全局數據切片 ntile(3) over(partition by name), -- 按照name進行分組,在分組內將數據切成3份 ntile(3) over(order by cost),--全局按照cost升序排列,數據切成3份 ntile(3) over(partition by name order by cost ) --按照name分組,在分組內按照cost升序排列,數據切成3份 from t_order; +-------+-------------+-------+----------+------+------+-----------------+--+ | name | orderdate | cost | sample1 | _c4 | _c5 | ntile_window_3 | +-------+-------------+-------+----------+------+------+-----------------+--+ | jack | 2015-01-01 | 10 | 3 | 1 | 1 | 1 | | jack | 2015-02-03 | 23 | 3 | 1 | 1 | 1 | | jack | 2015-04-06 | 42 | 2 | 2 | 2 | 2 | | jack | 2015-01-05 | 46 | 2 | 2 | 2 | 2 | | jack | 2015-01-08 | 55 | 2 | 3 | 2 | 3 | | mart | 2015-04-08 | 62 | 2 | 1 | 2 | 1 | | mart | 2015-04-09 | 68 | 1 | 2 | 3 | 1 | | mart | 2015-04-11 | 75 | 1 | 3 | 3 | 2 | | mart | 2015-04-13 | 94 | 1 | 1 | 3 | 3 | | neil | 2015-05-10 | 12 | 1 | 2 | 1 | 1 | | neil | 2015-06-12 | 80 | 1 | 1 | 3 | 2 | | tony | 2015-01-02 | 15 | 3 | 2 | 1 | 1 | | tony | 2015-01-04 | 29 | 3 | 3 | 1 | 2 | | tony | 2015-01-07 | 50 | 2 | 1 | 2 | 3 | +-------+-------------+-------+----------+------+------+-----------------+--+ 14 rows selected (5.981 seconds)
如上述數據,咱們去sample4 = 1的那部分數據就是咱們要的結果
這兩個函數爲經常使用的窗口函數,能夠返回上下數據行的數據.
實例:
-->以咱們的訂單表爲例,假如咱們想要查看顧客上次的購買時間能夠這樣去查詢 select name,orderdate,cost, lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 from t_order; +-------+-------------+-------+-------------+-------------+--+ | name | orderdate | cost | time1 | time2 | +-------+-------------+-------+-------------+-------------+--+ | jack | 2015-01-01 | 10 | 1900-01-01 | NULL | | jack | 2015-01-05 | 46 | 2015-01-01 | NULL | | jack | 2015-01-08 | 55 | 2015-01-05 | 2015-01-01 | | jack | 2015-02-03 | 23 | 2015-01-08 | 2015-01-05 | | jack | 2015-04-06 | 42 | 2015-02-03 | 2015-01-08 | | mart | 2015-04-08 | 62 | 1900-01-01 | NULL | | mart | 2015-04-09 | 68 | 2015-04-08 | NULL | | mart | 2015-04-11 | 75 | 2015-04-09 | 2015-04-08 | | mart | 2015-04-13 | 94 | 2015-04-11 | 2015-04-09 | | neil | 2015-05-10 | 12 | 1900-01-01 | NULL | | neil | 2015-06-12 | 80 | 2015-05-10 | NULL | | tony | 2015-01-02 | 15 | 1900-01-01 | NULL | | tony | 2015-01-04 | 29 | 2015-01-02 | NULL | | tony | 2015-01-07 | 50 | 2015-01-04 | 2015-01-02 | +-------+-------------+-------+-------------+-------------+--+ 14 rows selected (1.6 seconds)
first_value取分組內排序後,截止到當前行,第一個值
last_value取分組內排序後,截止到當前行,最後一個值
select name,orderdate,cost, first_value(orderdate) over(partition by name order by orderdate) as time1, last_value(orderdate) over(partition by name order by orderdate) as time2 from t_order; +-------+-------------+-------+-------------+-------------+--+ | name | orderdate | cost | time1 | time2 | +-------+-------------+-------+-------------+-------------+--+ | jack | 2015-01-01 | 10 | 2015-01-01 | 2015-01-01 | | jack | 2015-01-05 | 46 | 2015-01-01 | 2015-01-05 | | jack | 2015-01-08 | 55 | 2015-01-01 | 2015-01-08 | | jack | 2015-02-03 | 23 | 2015-01-01 | 2015-02-03 | | jack | 2015-04-06 | 42 | 2015-01-01 | 2015-04-06 | | mart | 2015-04-08 | 62 | 2015-04-08 | 2015-04-08 | | mart | 2015-04-09 | 68 | 2015-04-08 | 2015-04-09 | | mart | 2015-04-11 | 75 | 2015-04-08 | 2015-04-11 | | mart | 2015-04-13 | 94 | 2015-04-08 | 2015-04-13 | | neil | 2015-05-10 | 12 | 2015-05-10 | 2015-05-10 | | neil | 2015-06-12 | 80 | 2015-05-10 | 2015-06-12 | | tony | 2015-01-02 | 15 | 2015-01-02 | 2015-01-02 | | tony | 2015-01-04 | 29 | 2015-01-02 | 2015-01-04 | | tony | 2015-01-07 | 50 | 2015-01-02 | 2015-01-07 | +-------+-------------+-------+-------------+-------------+--+ 14 rows selected (1.588 seconds)
row_number的用途很是普遍,排序最好用它,它會爲查詢出來的每一行記錄生成一個序號,依次排序且不會重複,注意使用row_number函數時必需要用over子句選擇對某一列進行排序才能生成序號。
rank函數用於返回結果集的分區內每行的排名,行的排名是相關行以前的排名數加一。簡單來講rank函數就是對查詢出來的記錄進行排名,與row_number函數不一樣的是,rank函數考慮到了over子句中排序字段值相同的狀況,若是使用rank函數來生成序號,over子句中排序字段值相同的序號是同樣的,後面字段值不相同的序號將跳過相同的排名號排下一個,也就是相關行以前的排名數加一,能夠理解爲根據當前的記錄數生成序號,後面的記錄依此類推。
dense_rank函數的功能與rank函數相似,dense_rank函數在生成序號時是連續的,而rank函數生成的序號有可能不連續。dense_rank函數出現相同排名時,將不跳過相同排名號,rank值緊接上一次的rank值。在各個分組內,rank()是跳躍排序,有兩個第一名時接下來就是第四名,dense_rank()是連續排序,有兩個第一名時仍然跟着第二名。
藉助實例能更直觀地理解:
假設如今有一張學生表student,學生表中有姓名、分數、課程編號
1: jdbc:hive2://localhost:10000> select * from student; +-------------+---------------+----------------+-----------------+--+ | student.id | student.name | student.score | student.course | +-------------+---------------+----------------+-----------------+--+ | 5 | elic | 70 | 1 | | 4 | dock | 100 | 1 | | 3 | clark | 80 | 1 | | 2 | bob | 90 | 1 | | 1 | alce | 60 | 1 | | 10 | jacky | 80 | 2 | | 9 | iris | 60 | 2 | | 8 | hill | 70 | 2 | | 7 | grace | 50 | 2 | | 6 | frank | 70 | 2 | +-------------+---------------+----------------+-----------------+--+ 10 rows selected (0.115 seconds)
如今須要按照課程對學生的成績進行排序:
--row_number() 順序排序 select name,course,row_number() over(partition by course order by score desc) rank from student; +--------+---------+-------+--+ | name | course | rank | +--------+---------+-------+--+ | dock | 1 | 1 | | bob | 1 | 2 | | clark | 1 | 3 | | elic | 1 | 4 | | alce | 1 | 5 | | jacky | 2 | 1 | | frank | 2 | 2 | | hill | 2 | 3 | | iris | 2 | 4 | | grace | 2 | 5 | +--------+---------+-------+--+
--rank() 跳躍排序,若是有兩個第一級別時,接下來是第三級別 select name,course,rank() over(partition by course order by score desc) rank from student; +--------+---------+-------+--+ | name | course | rank | +--------+---------+-------+--+ | dock | 1 | 1 | | bob | 1 | 2 | | clark | 1 | 3 | | elic | 1 | 4 | | alce | 1 | 5 | | jacky | 2 | 1 | | frank | 2 | 2 | | hill | 2 | 2 | | iris | 2 | 4 | | grace | 2 | 5 | +--------+---------+-------+--+
--dense_rank() 連續排序,若是有兩個第一級別時,接下來是第二級別 select name,course,dense_rank() over(partition by course order by score desc) rank from student; +--------+---------+-------+--+ | name | course | rank | +--------+---------+-------+--+ | dock | 1 | 1 | | bob | 1 | 2 | | clark | 1 | 3 | | elic | 1 | 4 | | alce | 1 | 5 | | jacky | 2 | 1 | | frank | 2 | 2 | | hill | 2 | 2 | | iris | 2 | 3 | | grace | 2 | 4 | +--------+---------+-------+--+ 10 rows selected (1.635 seconds)
關於Parttion by:
Parttion by關鍵字是Oracle中分析性函數的一部分,用於給結果集進行分區。它和聚合函數Group by不一樣的地方在於它只是將原始數據進行名次排列,可以返回一個分組中的多條記錄(記錄數不變),而Group by是對原始數據進行聚合統計,通常只有一條反映統計值的結果(每組返回一條)。
TIPS:
使用rank over()的時候,空值是最大的,若是排序字段爲null, 可能形成null字段排在最前面,影響排序結果。
能夠這樣: rank over(partition by course order by score desc nulls last)
總結:
在使用排名函數的時候須要注意如下三點:
一、排名函數必須有 OVER 子句。
二、排名函數必須有包含 ORDER BY 的 OVER 子句。
三、分組內從1開始排序。