現有數據以下:html
1 huangbo guangzhou,xianggang,shenzhen a1:30,a2:20,a3:100 beijing,112233,13522334455,500
2 xuzheng xianggang b2:50,b3:40 tianjin,223344,13644556677,600
3 wangbaoqiang beijing,zhejinag c1:200 chongqinjg,334455,15622334455,20java
建表語句python
use class; create table cdt( id int, name string, work_location array<string>, piaofang map<string,bigint>, address struct<location:string,zipcode:int,phone:string,value:int>) row format delimited fields terminated by "\t" collection items terminated by "," map keys terminated by ":" lines terminated by "\n";
導入數據正則表達式
0: jdbc:hive2://hadoop3:10000> load data local inpath "/home/hadoop/cdt.txt" into table cdt;
查詢語句數據庫
select * from cdt;
select name from cdt;
select work_location from cdt;
select work_location[0] from cdt;
select work_location[1] from cdt;
建表語句、導入數據同1apache
查詢語句json
select piaofang from cdt;
select piaofang["a1"] from cdt;
建表語句、導入數據同1服務器
查詢語句app
select address from cdt;
select address.location from cdt;
不多使用函數
參考資料:http://yugouai.iteye.com/blog/1849192
和關係型數據庫同樣,Hive 也提供了視圖的功能,不過請注意,Hive 的視圖和關係型數據庫的數據仍是有很大的區別:
(1)只有邏輯視圖,沒有物化視圖;
(2)視圖只能查詢,不能 Load/Insert/Update/Delete 數據;
(3)視圖在建立時候,只是保存了一份元數據,當查詢視圖的時候,纔開始執行視圖對應的 那些子查詢
create view view_cdt as select * from cdt;
show views; desc view_cdt;-- 查看某個具體視圖的信息
select * from view_cdt;
drop view view_cdt;
具體可看http://www.cnblogs.com/qingyunzong/p/8744593.html
show functions;
desc function substr;
desc function extended substr;
當 Hive 提供的內置函數沒法知足業務處理須要時,此時就能夠考慮使用用戶自定義函數。
UDF(user-defined function)做用於單個數據行,產生一個數據行做爲輸出。(數學函數,字 符串函數)
UDAF(用戶定義彙集函數 User- Defined Aggregation Funcation):接收多個輸入數據行,併產 生一個輸出數據行。(count,max)
UDTF(表格生成函數 User-Defined Table Functions):接收一行輸入,輸出多行(explode)
ToLowerCase.java
import org.apache.hadoop.hive.ql.exec.UDF; public class ToLowerCase extends UDF{ // 必須是 public,而且 evaluate 方法能夠重載 public String evaluate(String field) { String result = field.toLowerCase(); return result; } }
add JAR /home/hadoop/udf.jar;
0: jdbc:hive2://hadoop3:10000> create temporary function tolowercase as 'com.study.hive.udf.ToLowerCase';
0: jdbc:hive2://hadoop3:10000> select tolowercase('HELLO');
現有原始 json 數據(rating.json)以下
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
如今須要將數據導入到 hive 倉庫中,而且最終要獲得這麼一個結果:
該怎麼作、???(提示:可用內置 get_json_object 或者自定義函數完成)
返回值: string
說明:解析json的字符串json_string,返回path指定的內容。若是輸入的json字符串無效,那麼返回NULL。 這個函數每次只能返回一個數據項。
0: jdbc:hive2://hadoop3:10000> select get_json_object('{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}','$.movie');
建立json表並將數據導入進去
0: jdbc:hive2://hadoop3:10000> create table json(data string); No rows affected (0.983 seconds) 0: jdbc:hive2://hadoop3:10000> load data local inpath '/home/hadoop/json.txt' into table json; No rows affected (1.046 seconds) 0: jdbc:hive2://hadoop3:10000>
0: jdbc:hive2://hadoop3:10000> select . . . . . . . . . . . . . . .> get_json_object(data,'$.movie') as movie . . . . . . . . . . . . . . .> from json;
參數爲一組鍵k1,k2……和JSON字符串,返回值的元組。該方法比 get_json_object
高效,由於能夠在一次調用中輸入多個鍵
0: jdbc:hive2://hadoop3:10000> select . . . . . . . . . . . . . . .> b.b_movie, . . . . . . . . . . . . . . .> b.b_rate, . . . . . . . . . . . . . . .> b.b_timeStamp, . . . . . . . . . . . . . . .> b.b_uid . . . . . . . . . . . . . . .> from json a . . . . . . . . . . . . . . .> lateral view json_tuple(a.data,'movie','rate','timeStamp','uid') b as b_movie,b_rate,b_timeStamp,b_uid;
Hive 的 TRANSFORM 關鍵字提供了在 SQL 中調用自寫腳本的功能。適合實現 Hive 中沒有的 功能又不想寫 UDF 的狀況
具體以一個實例講解。
Json 數據: {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
需求:把 timestamp 的值轉換成日期編號
一、先加載 rating.json 文件到 hive 的一個原始表 rate_json
create table rate_json(line string) row format delimited; load data local inpath '/home/hadoop/rating.json' into table rate_json;
二、建立 rate 這張表用來存儲解析 json 出來的字段:
create table rate(movie int, rate int, unixtime int, userid int) row format delimited fields terminated by '\t';
解析 json,獲得結果以後存入 rate 表:
insert into table rate select get_json_object(line,'$.movie') as moive, get_json_object(line,'$.rate') as rate, get_json_object(line,'$.timeStamp') as unixtime, get_json_object(line,'$.uid') as userid from rate_json;
三、使用 transform+python 的方式去轉換 unixtime 爲 weekday
先編輯一個 python 腳本文件
########python######代碼 ## vi weekday_mapper.py #!/bin/python import sys import datetime for line in sys.stdin: line = line.strip() movie,rate,unixtime,userid = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([movie, rate, str(weekday),userid])
保存文件 而後,將文件加入 hive 的 classpath:
hive>add file /home/hadoop/weekday_mapper.py; hive> insert into table lastjsontable select transform(movie,rate,unixtime,userid) using 'python weekday_mapper.py' as(movie,rate,weekday,userid) from rate;
建立最後的用來存儲調用 python 腳本解析出來的數據的表:lastjsontable
create table lastjsontable(movie int, rate int, weekday int, userid int) row format delimited fields terminated by '\t';
最後查詢看數據是否正確
select distinct(weekday) from lastjsontable;
補充:hive 讀取數據的機制:
一、 首先用 InputFormat<默認是:org.apache.hadoop.mapred.TextInputFormat >的一個具體實 現類讀入文件數據,返回一條一條的記錄(能夠是行,或者是你邏輯中的「行」)
二、 而後利用 SerDe<默認:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe>的一個具體 實現類,對上面返回的一條一條的記錄進行字段切割
Hive 對文件中字段的分隔符默認狀況下只支持單字節分隔符,若是數據文件中的分隔符是多 字符的,以下所示:
01||huangbo
02||xuzheng
03||wangbaoqiang
建立表
create table t_bi_reg(id string,name string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)\\|\\|(.*)','output.format.string'='%1$s %2$s') stored as textfile;
導入數據並查詢
0: jdbc:hive2://hadoop3:10000> load data local inpath '/home/hadoop/data.txt' into table t_bi_reg; No rows affected (0.747 seconds) 0: jdbc:hive2://hadoop3:10000> select a.* from t_bi_reg a;