目前 hive 支持的複合數據類型有如下幾種:php
map
(key1, value1, key2, value2, ...) Creates a map with the given key/value pairs
struct
(val1, val2, val3, ...) Creates a struct with the given field values. Struct field names will be col1, col2, ...
named_struct
(name1, val1, name2, val2, ...) Creates a struct with the given field names and values. (as of Hive 0.8.0)
array
(val1, val2, ...) Creates an array with the given elements
create_union
(tag, val1, val2, ...) Creates a union type with the value that is being pointed to by the tag parameterhtml
建立數據庫表,以array做爲數據類型 create table person(name string,work_locations array<string>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ','; 數據 biansutao beijing,shanghai,tianjin,hangzhou linan changchu,chengdu,wuhan 入庫數據 LOAD DATA LOCAL INPATH '/home/hadoop/person.txt' OVERWRITE INTO TABLE person; 查詢 hive> select * from person; biansutao ["beijing","shanghai","tianjin","hangzhou"] linan ["changchu","chengdu","wuhan"] Time taken: 0.355 seconds hive> select name from person; linan biansutao Time taken: 12.397 seconds hive> select work_locations[0] from person; changchu beijing Time taken: 13.214 seconds hive> select work_locations from person; ["changchu","chengdu","wuhan"] ["beijing","shanghai","tianjin","hangzhou"] Time taken: 13.755 seconds hive> select work_locations[3] from person; NULL hangzhou Time taken: 12.722 seconds hive> select work_locations[4] from person; NULL NULL Time taken: 15.958 seconds
建立數據庫表 create table score(name string, score map<string,int>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':'; 要入庫的數據 biansutao '數學':80,'語文':89,'英語':95 jobs '語文':60,'數學':80,'英語':99 入庫數據 LOAD DATA LOCAL INPATH '/home/hadoop/score.txt' OVERWRITE INTO TABLE score; 查詢 hive> select * from score; biansutao {"數學":80,"語文":89,"英語":95} jobs {"語文":60,"數學":80,"英語":99} Time taken: 0.665 seconds hive> select name from score; jobs biansutao Time taken: 19.778 seconds hive> select t.score from score t; {"語文":60,"數學":80,"英語":99} {"數學":80,"語文":89,"英語":95} Time taken: 19.353 seconds hive> select t.score['語文'] from score t; 60 89 Time taken: 13.054 seconds hive> select t.score['英語'] from score t; 99 95 Time taken: 13.769 seconds
建立數據表 CREATE TABLE test(id int,course struct<course:string,score:int>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ','; 數據 1 english,80 2 math,89 3 chinese,95 入庫 LOAD DATA LOCAL INPATH '/home/hadoop/test.txt' OVERWRITE INTO TABLE test; 查詢 hive> select * from test; OK 1 {"course":"english","score":80} 2 {"course":"math","score":89} 3 {"course":"chinese","score":95} Time taken: 0.275 seconds hive> select course from test; {"course":"english","score":80} {"course":"math","score":89} {"course":"chinese","score":95} Time taken: 44.968 seconds select t.course.course from test t; english math chinese Time taken: 15.827 seconds hive> select t.course.score from test t; 80 89 95 Time taken: 13.235 seconds
LOAD DATA LOCAL INPATH '/home/hadoop/test.txt' OVERWRITE INTO TABLE test; create table test1(id int,a MAP<STRING,ARRAY<STRING>>) row format delimited fields terminated by '\t' collection items terminated by ',' MAP KEYS TERMINATED BY ':'; 1 english:80,90,70 2 math:89,78,86 3 chinese:99,100,82 LOAD DATA LOCAL INPATH '/home/hadoop/test1.txt' OVERWRITE INTO TABLE test1;
常見的函數就不廢話了,和標準sql相似,下面咱們要聊到的基本是HQL裏面專有的函數,sql
hive裏面的函數大體分爲以下幾種:Built-in、Misc.、UDF、UDTF、UDAF數據庫
咱們就挑幾個標準SQL裏沒有,可是在HIVE SQL在作統計分析經常使用到的來講吧。express
這是內置的對集合進行操做的函數,用法舉例:apache
create EXTERNAL table IF NOT EXISTS userInfo (id int,sex string, age int, name string, email string,sd string, ed string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/hive/dw'; select * from userinfo where sex='male' and (id!=1 and id !=2 and id!=3 and id!=4 and id!=5) and age < 30; select * from (select * from userinfo where sex='male' and !array_contains(split('1,2,3,4,5',','),cast(id as string))) tb1 where tb1.age < 30;
其中建表所用的測試數據你能夠用以下連接的腳本自動生成:json
http://my.oschina.net/leejun2005/blog/76631api
測試數據:app
first {"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} third
first {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":91,"type":"pear"}],"bicycle":{"price":19.952,"color":"red2"}},"email":"amy@only_for_json_udf_test.net","owner":"amy2"} third
first {"store":{"fruit":[{"weight":10,"type":"apple"},{"weight":911,"type":"pear"}],"bicycle":{"price":19.953,"color":"red3"}},"email":"amy@only_for_json_udf_test.net","owner":"amy3"} thirdide
create external table if not exists t_json(f1 string, f2 string, f3 string) row format delimited fields TERMINATED BY ' ' location '/test/json' select get_json_object(t_json.f2, '$.owner') from t_json; SELECT * from t_json where get_json_object(t_json.f2, '$.store.fruit[0].weight') = 9; SELECT get_json_object(t_json.f2, '$.non_exist_key') FROM t_json;
這裏尤爲要注意UDTF的問題,官方文檔有說明:
json_tuple
A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.
For example,
select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;
should be changed to
select a.timestamp, b.* from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;
UDTF(User-Defined Table-Generating Functions) 用來解決 輸入一行輸出多行(On-to-many maping) 的需求。
經過Lateral view能夠方便的將UDTF獲得的行轉列的結果集合在一塊兒提供服務,由於直接在SELECT使用UDTF會存在限制,即僅僅能包含單個字段,不光是多個UDTF,僅僅單個UDTF加上其餘字段也是不能夠,hive提示在UDTF中僅僅能有單一的表達式。以下:
hive> select my_test(「abcef:aa」) as qq,’abcd’ from sunwg01;
FAILED: Error in semantic analysis: Only a single expression in the SELECT clause is supported with UDTF’s
使用Lateral view能夠實現上面的需求,Lateral view語法以下:
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’ columnAlias)*
fromClause: FROM baseTable (lateralView)*
hive> create table sunwg ( a array, b array )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’
> COLLECTION ITEMS TERMINATED BY ‘,’;
OK
Time taken: 1.145 seconds
hive> load data local inpath ‘/home/hjl/sunwg/sunwg.txt’ overwrite into table sunwg;
Copying data from file:/home/hjl/sunwg/sunwg.txt
Loading data to table sunwg
OK
Time taken: 0.162 seconds
hive> select * from sunwg;
OK
[10,11] ["tom","mary"]
[20,21] ["kate","tim"]
Time taken: 0.069 seconds
hive>
> SELECT a, name
> FROM sunwg LATERAL VIEW explode(b) r1 AS name;
OK
[10,11] tom
[10,11] mary
[20,21] kate
[20,21] tim
Time taken: 8.497 seconds
hive> SELECT id, name
> FROM sunwg LATERAL VIEW explode(a) r1 AS id
> LATERAL VIEW explode(b) r2 AS name;
OK
10 tom
10 mary
11 tom
11 mary
20 kate
20 tim
21 kate
21 tim
Time taken: 9.687 seconds
測試數據:
url1 http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1
url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
url3 https://www.google.com.hk/#hl=zh-CN&newwindow=1&safe=strict&q=hive+translate+example&oq=hive+translate+example&gs_l=serp.3...10174.11861.6.12051.8.8.0.0.0.0.132.883.0j7.7.0...0.0...1c.1j4.8.serp.0B9C1T_n0Hs&bav=on.2,or.&bvm=bv.44770516,d.aGc&fp=e13e41a6b9dab3f6&biw=1241&bih=589
create external table if not exists t_url(f1 string, f2 string) row format delimited fields TERMINATED BY ' ' location '/test/url'; SELECT f1, b.* FROM t_url LATERAL VIEW parse_url_tuple(f2, 'HOST', 'PATH', 'QUERY', 'QUERY:k1') b as host, path, query, query_id;
結果:
url1 facebook.com /path1/p.php k1=v1&k2=v2 v1
url2 cwiki.apache.org /confluence/display/Hive/LanguageManual+UDF NULL NULL
url3 www.google.com.hk / NULL NULL
explode 是一個 hive 內置的表生成函數:Built-in Table-Generating Functions (UDTF),主要是解決 1 to N 的問題,即它能夠把一行輸入拆成多行,好比一個 array 的每一個元素拆成一行,做爲一個虛表輸出。它有以下須要注意的地方:
Using the syntax "SELECT udtf(col) AS colAlias..." has a few limitations: No other expressions are allowed in SELECT SELECT pageid, explode(adid_list) AS myCol... is not supported UDTF's can't be nested SELECT explode(explode(adid_list)) AS myCol... is not supported GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported
從上面的原理與語法上可知,
select 列中不能 udtf 和其它非 udtf 列混用,
udtf 不能嵌套,
不支持 GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY
還有 select 中出現的 udtf 必定須要列別名,不然會報錯:
SELECT explode(myCol) AS myNewCol FROM myTable; SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable; SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;
lateral view 是Hive中提供給UDTF的conjunction,它能夠解決UDTF不能添加額外的select列的問題。當咱們想對hive表中某一列進行split以後,想對其轉換成1 to N的模式,即一行轉多列。hive不容許咱們在UDTF函數以外,再添加其它select語句。
以下,咱們想將登陸某個遊戲的用戶id放在一個字段user_ids裏,對每一行數據用UDTF後輸出多行。
select game_id, explode(split(user_ids,'\\[\\[\\[')) as user_id from login_game_log where dt='2014-05-15' ; FAILED: Error in semantic analysis: UDTF's are not supported outside the SELECT clause, nor nested in expressions。
提示語法分析錯誤,UDTF不支持函數以外的select 語句,若是咱們想支持怎麼辦呢?接下來就是Lateral View 登場的時候了。
Lateral view 其實就是用來和像相似explode這種UDTF函數聯用的。lateral view 會將UDTF生成的結果放到一個虛擬表中,而後這個虛擬表(1 to N)會和輸入行即每一個game_id進行join 來達到鏈接UDTF外的select字段的目的(源表和拆分的虛表按行作行內 1 join N 的直接鏈接),這也是爲何 LATERAL VIEW udtf(expression) 後面須要表別名和列別名的緣由。
Lateral View Syntax
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
能夠看出,能夠在2個地方用Lateral view:
在udtf前面用
在from baseTable後面用
例如:
pageid adid_list
front_page [1, 2, 3]
contact_page [3, 4, 5]
SELECT pageid, adid FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;
pageid adid
front_page 1
front_page 2
front_page 3
contact_page 3
contact_page 4
contact_page 5
From語句後能夠跟多個Lateral View。
A FROM clause can have multiple LATERAL VIEW clauses. Subsequent LATERAL VIEWS can reference columns from any of the tables appearing to the left of the LATERAL VIEW.
給定數據:
Array<int> col1 Array<string> col2
[1, 2] [a", "b", "c"]
[3, 4] [d", "e", "f"]
轉換目標:
想同時把第一列和第二列拆開,相似作笛卡爾乘積。
咱們能夠這樣寫:
SELECT myCol1, myCol2 FROM baseTable LATERAL VIEW explode(col1) myTable1 AS myCol1 LATERAL VIEW explode(col2) myTable2 AS myCol2;
還有一種狀況,若是UDTF轉換的Array是空的怎麼辦呢?
在Hive0.12裏面會支持outer關鍵字,若是UDTF的結果是空,默認會被忽略輸出。
若是加上outer關鍵字,則會像left outer join 同樣,仍是會輸出select出的列,而UDTF的輸出結果是NULL。
總結:
Lateral View一般和UDTF一塊兒出現,爲了解決UDTF不容許在select字段的問題。
Multiple Lateral View能夠實現相似笛卡爾乘積。
Outer關鍵字能夠把不輸出的UDTF的空結果,輸出成NULL,防止丟失數據。
http://blog.csdn.net/wf1982/article/details/7474601
http://www.cnblogs.com/ggjucheng/archive/2013/01/08/2850797.html
http://www.oratea.net/?p=650
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-parseurltuple
https://cwiki.apache.org/confluence/display/Hive/Tutorial
http://blog.csdn.net/inte_sleeper/article/details/7196114 hive lateral view語句:列拆分紅行
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
http://blog.csdn.net/oopsoom/article/details/26001307 Lateral View用法 與 Hive UDTF explode
http://bit.ly/2bDuVxS 助力大數據的複雜統計分析-Hive窗口函數