Hive 是基於Hadoop 構建的一套數據倉庫分析系統,它提供了豐富的SQL查詢方式來分析存儲在Hadoop 分佈式文件系統中的數據,能夠將結構node
化的數據文件映射爲一張數據庫表,並提供完整的SQL查詢功能,能夠將SQL語句轉換爲MapReduce任務進行運行,經過本身的SQL 去查詢分析需python
要的內容,這套SQL 簡稱Hive SQL,使不熟悉mapreduce 的用戶很方便的利用SQL 語言查詢,彙總,分析數據。而mapreduce開發人員能夠把正則表達式
己寫的mapper 和reducer 做爲插件來支持Hive 作更復雜的數據分析。sql
它與關係型數據庫的SQL 略有不一樣,但支持了絕大多數的語句如DDL、DML 以及常見的聚合函數、鏈接查詢、條件查詢。HIVE不適合用於聯機數據庫
online)事務處理,也不提供實時查詢功能。它最適合應用在基於大量不可變數據的批處理做業。
HIVE的特色:可伸縮(在Hadoop的集羣上動態的添加設備),可擴展,容錯,輸入格式的鬆散耦合。express
Hive 的官方文檔中對查詢語言有了很詳細的描述,請參考:http://wiki.apache.org/hadoop/Hive/LanguageManual ,本文的內容大部分翻譯自該頁面,期間加入了一些在使用過程當中須要注意到的事項。apache
1. DDL 操做緩存
建表:
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
建立簡單表:
hive> CREATE TABLE pokes (foo INT, bar STRING); session
建立外部表:
建分區表
建Bucket表
建立表並建立索引字段ds
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
複製一個空表
例子
create table user_info (user_id int, cid string, ckid string, username string)
row format delimited
fields terminated by '\t'
lines terminated by '\n';
導入數據表的數據格式是:字段之間是tab鍵分割,行之間是斷行。
及要咱們的文件內容格式:
100636 100890 c5c86f4cddc15eb7 yyyvybtvt
100612 100865 97cc70d411c18b6f gyvcycy
100078 100087 ecd6026a15ffddf5 qa000100
顯示全部表:
hive> SHOW TABLES;
按正條件(正則表達式)顯示錶,
hive> SHOW TABLES '.*s';
修改表結構
表添加一列 :
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
添加一列並增長列字段註釋
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
更改表名:
hive> ALTER TABLE events RENAME TO 3koobecaf;
刪除列:
hive> DROP TABLE pokes;
增長、刪除分區
重命名錶
修改列的名字、類型、位置、註釋:
表添加一列 :
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
添加一列並增長列字段註釋
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
增長/更新列
增長表的元數據信息
改變表文件格式與組織
建立/刪除視圖
2. DML 操做:元數據存儲
hive不支持用insert語句一條一條的進行插入操做,也不支持update操做。數據是以load的方式加載到創建好的表中。數據一旦導入就不能夠修改。
向數據表內加載文件
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
加載本地數據,同時給定分區信息
例如:加載本地數據,同時給定分區信息:
加載DFS數據 ,同時給定分區信息:
hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.
OVERWRITE
將查詢結果插入Hive表
3. DQL 操做:數據查詢SQL
3.1 基本的Select 操做
SELECT * FROM test SORT BY amount DESC LIMIT 5
例如
按先件查詢
hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';
將查詢數據輸出至目錄:
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';
將查詢結果輸出至本地目錄:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
選擇全部列到本地目錄 :
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>';
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
將一個表的統計結果插入另外一個表中:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
將多表數據插入到同一表中:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
將文件流直接插入文件:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)
3.2 基於Partition的查詢
3.3 Join
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON equality_expression ( AND equality_expression )
equality_expression:
expression = expression
ON (a.id = b.id AND a.department = b.department)
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
WHERE a.ds='2010-07-07' AND b.ds='2010-07-07‘
ON (c.key=d.key AND d.ds='2009-07-07' AND c.ds='2009-07-07')
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
FROM a LEFT SEMI JOIN b on (a.key = b.key)
4. 從SQL到HiveQL應轉變的習慣
四、Hive不支持將數據插入現有的表或分區中,
僅支持覆蓋重寫整個表,示例以下:
- INSERT OVERWRITE TABLE t1
- SELECT * FROM t2;
四、hive不支持INSERT INTO, UPDATE, DELETE操做
這樣的話,就不要很複雜的鎖機制來讀寫數據。
INSERT INTO syntax is only available starting in version 0.8。INSERT INTO就是在表或分區中追加數據。
五、hive支持嵌入mapreduce程序,來處理複雜的邏輯
如:
- FROM (
- MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
- FROM docs
- CLUSTER BY word
- ) a
- REDUCE word, cnt USING 'python wc_reduce.py';
--doctext: 是輸入
--word, cnt: 是map程序的輸出
--CLUSTER BY: 將wordhash後,又做爲reduce程序的輸入
而且map程序、reduce程序能夠單獨使用,如:
- FROM (
- FROM session_table
- SELECT sessionid, tstamp, data
- DISTRIBUTE BY sessionid SORT BY tstamp
- ) a
- REDUCE sessionid, tstamp, data USING 'session_reducer.sh';
--DISTRIBUTE BY: 用於給reduce程序分配行數據
六、hive支持將轉換後的數據直接寫入不一樣的表,還能寫入分區、hdfs和本地目錄。
這樣能免除屢次掃描輸入表的開銷。
- FROM t1
- INSERT OVERWRITE TABLE t2
- SELECT t3.c2, count(1)
- FROM t3
- WHERE t3.c1 <= 20
- GROUP BY t3.c2
- INSERT OVERWRITE DIRECTORY '/output_dir'
- SELECT t3.c2, avg(t3.c1)
- FROM t3
- WHERE t3.c1 > 20 AND t3.c1 <= 30
- GROUP BY t3.c2
- INSERT OVERWRITE LOCAL DIRECTORY '/home/dir'
- SELECT t3.c2, sum(t3.c1)
- FROM t3
- WHERE t3.c1 > 30
- GROUP BY t3.c2;
5. 實際示例
建立一個表
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t'
STORED AS TEXTFILE;
下載示例數據文件,並解壓縮
wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
tar xvzf ml-data.tar__0.gz
加載數據到表中:
LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;
統計數據總量:
SELECT COUNT(1) FROM u_data;
如今作一些複雜的數據分析:
建立一個 weekday_mapper.py: 文件,做爲數據按周進行分割
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('/t')
生成數據的周信息
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '/t'.join([userid, movieid, rating, str(weekday)])
使用映射腳本
//建立表,按分割符分割行中的字段值
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t';
//將python文件加載到系統
add FILE weekday_mapper.py;
將數據按周進行分割
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
SELECT weekday, COUNT(1)
FROM u_data_new
GROUP BY weekday;
處理Apache Weblog 數據
將WEB日誌先用正則表達式進行組合,再按須要的條件進行組合輸入到表中
add jar ../build/contrib/hive_contrib.jar;
CREATE TABLE apachelog (host STRING,identity STRING,user STRING,time STRING,request STRING,status STRING,size STRING,referer STRING,agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|//[[^//]]*//]) ([^ /"]*|/"[^/"]*/") (-|[0-9]*) (-|[0-9]*)(?: ([^ /"]*|/"[^/"]*/") ([^ /"]*|/"[^/"]*/"))?","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s")STORED AS TEXTFILE;