(1)打印查詢頭,須要顯示設置: 正則表達式
set hive.cli.print.header=true;
(2)加"--",其後的都被認爲是註釋,但 CLI 不解析註釋。帶有註釋的文件只能經過這種方式執行:
hive -f script_name
(3)-e後跟帶引號的hive指令或者查詢,-S去掉多餘的輸出:
hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery
(4)遍歷全部分區的查詢將產生一個巨大的MapReduce做業,若是你的數據集和目錄很是多,
所以建議你使用strict模型,也就是你存在分區時,必須指定where語句
sql
hive> set hive.mapred.mode=strict; 數據庫
(5)顯示當前使用數據庫
set hive.cli.print.current.db=true;
緩存
(6)設置 Hive Job 優先級 cookie
set mapred.job.priority=VERY_HIGH | HIGH | NORMAL | LOW | VERY_LOW session
(VERY_LOW=1,LOW=2500,NORMAL=5000,HIGH=7500,VERY_HIGH=10000)
set mapred.job.map.capacity=M設置同時最多運行M個map任務
set mapred.job.reduce.capacity=N設置同時最多運行N個reduce任務
(7)Hive 中的Mapper個數的是由如下幾個參數肯定的:
mapred.min.split.size ,mapred.max.split.size ,dfs.block.size
splitSize = Math.max(minSize, Math.min(maxSize, blockSize));
map個數還與inputfilles的個數有關,若是有2個輸入文件,即便總大小小於blocksize,也會產生2個map
mapred.reduce.tasks用來設置reduce個數。
數據結構
(1)查看某個表全部分區
SHOW PARTITIONS ext_trackflow
查詢具體某個分區
SHOW PARTITIONS ext_trackflow PARTITION(statDate='20140529');
(2)查看格式化的完整表結構
desc formatted ext_trackflow;
DESCRIBE EXTENDED ext_trackflow;
(3)刪除分區:分區的元數據和數據將被一併刪除,可是對於擴展表則只刪除元數據
ALTER TABLE ext_trackflow DROP PARTITION (statDate='20140529');
(4)查詢是外部表仍是內部表
DESCRIBE EXTENDED tablename
(5)複製表結構
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';
Note:若是你忽略關鍵字EXTERNAL,那麼將依據 employees 是外部仍是內部,若是加了那麼必定是EXTERNAL,並要LOCATION
(6)爲內部表某個分區導入數據,Hive將創建目錄並拷貝數據到分區當中
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
(7)爲外部表某個分區添加數據
ALTER TABLE log_messages ADD IF NOT EXISTS PARTITION(year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';
Note:Hive並不關心分區,目錄是否存在,是否有數據,這會致使沒有查詢結果
(8)修改表:在任什麼時候候你均可以修改表,可是你僅僅修改的是表的元數據,都實際數據不會形成任何影響
例如更改分區指定位置,這個命令不會刪除舊的數據
ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
app
SET LOCATION 's3n://ourbucket/logs/2011/01/02'; oop
(9)更改表屬性
ALTER TABLE log_messages SET TBLPROPERTIES (
'notes' = 'The process id is no longer captured; this column is always NULL'
);
(10)更改存儲屬性
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;
Note:若是table是分區的話那麼partition是必須的
(11)指定新的 SerDe
ALTER TABLE table_using_JSON_storage
SET SERDE 'com.example.JSONSerDe'
WITH SERDEPROPERTIES (
'prop1' = 'value1',
'prop2' = 'value2'
);
Note:SERDEPROPERTIE解釋SERDE用的何種模型,屬性值和名稱都爲字符串,方便告訴用戶,爲本身指定SERDE而且應用於什麼模型
爲當前SERDE設定
ALTER TABLE table_using_JSON_storage
SET SERDEPROPERTIES (
'prop3' = 'value3',
'prop4' = 'value4'
);
(12)改變存儲屬性
ALTER TABLE stocks
CLUSTERED BY (exchange, symbol)
SORTED BY (symbol)
INTO 48 BUCKETS;
(13)複雜更改表語句:爲各類不一樣的操做添加 hook ALTER TABLE … TOUCH
ALTER TABLE log_messages TOUCH
PARTITION(year = 2012, month = 1, day = 1);
典型的應用場景就是當分區有改動的時候,那麼將觸發
hive -e 'ALTER TABLE log_messages TOUCH PARTITION(year = 2012, month = 1, day = 1);'
(14)ALTER TABLE … ARCHIVE PARTITION 捕獲分區文件到Hadoop archive file也就是HAR
ALTER TABLE log_messages ARCHIVE
PARTITION(year = 2012, month = 1, day = 1);(只能夠用在被分區的表)
(15)保護分區不被刪除和查詢
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;
Note:與ENABLE對應的是DISABLE,不能應用在未被分區的表
ui
(16)按正條件(正則表達式)顯示錶
hive> SHOW TABLES '.*s';
(17)外部表、內部表互轉
alter table tablePartition set TBLPROPERTIES ('EXTERNAL'='TRUE'); //內部錶轉外部表
alter table tablePartition set TBLPROPERTIES ('EXTERNAL'='FALSE'); //外部錶轉內部表
(18)分區與分桶:
partition(分區:按目錄保存文件,每一個partition對應一個目錄)例如:
CREATE EXTERNAL TABLE table1 ( column1 STRING, column2 STRING, column3 STRING, ) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE; ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20090105); ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20090102); ALTER TABLE table1 ADD IF NOT EXISTS PARTITION (dt=20081231);
bucket(分桶,對指定列做hash,每一個bucket對應一個文件)
CREATE TABLE VT_NEW_DATA ( column1 STRING, column2 STRING, column3 STRING, ) CLUSTERED BY (column1) SORTED BY (column1) INTO 48 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS RCFILE;
(1)重命名列,更改位置,類型和註釋
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;
更更名稱: new column old column type
comment不是必須的,你能夠添加註釋
AFTER用於更改字段位置
僅修改了元數據並未對源data作任何改動
(2)添加新列
ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');
(3)刪除和替換列:慎用!!!
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
ADD是表明新增一字段,字段位置在全部列後面(partition列前)
REPLACE則是表示替換表中全部字段。
REPLACE COLUMNS removes all existing columns and adds the new set of columns.
REPLACE COLUMNS can also be used to drop columns. For example:
"ALTER TABLE test_change REPLACE COLUMNS (a int, b int);" will remove column `c' from test_change's schema. Note that this does not delete underlying data, it just changes the schema.
(4)REGEX Column Specification
SELECT 語句可使用正則表達式作列選擇,下面的語句查詢除了 ds 和 hr 以外的全部列:
SELECT `(ds|hr)?+.+` FROM test
hive> set;
…
hive> set-v;
… even more output!…
‘set’輸出 hivevar,hiveconf,system 和 env 命名空間下的全部變量。
‘set -v’包括了輸出Hadoop定義的所有變量。
hive> set hivevar:foo=hello;
hive> set hivevar:foo;
hivevar:foo=hello
使用變量:
hive> create table toss1(i int, ${hivevar:foo} string);
-- 建立數據庫 create database ecdata WITH DBPROPERTIES ('creator' = 'June', 'date' = '2014-06-01'); -- 或者使用 COMMENT 關鍵字 -- 查看數據庫描述 DESCRIBE DATABASE ecdata; DESCRIBE DATABASE EXTENDED ecdata; -- 切庫 use ecdata; -- 刪除表 drop table ext_trackflow; -- 建立表 create EXTERNAL table IF NOT EXISTS ext_trackflow ( cookieId string COMMENT '05dvOVC6Il6INhYABV6LAg==', cate1 string COMMENT '4', area1 string COMMENT '102', url string COMMENT 'http://cd.ooxx.com/jinshan-mingzhan-1020', trackTime string COMMENT '2014-05-25 23:03:36', trackURLMap map<string,String> COMMENT '{"area":"102","cate":"4,29,14052"}', ) PARTITIONED BY (statDate STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/DataWarehouse/ods/TrackFlowTable' ; --添加分區語句 ALTER TABLE ext_trackflow ADD PARTITION (statDate='20140525') LOCATION '/DataWarehouse/ods/TrackFlowTable/20140525'; --天天創建分區 yesterday=`date -d '1 days ago' +'%Y%m%d'` hive -e "use ecdata; ALTER TABLE ext_trackflow ADD PARTITION (statDate='$yesterday') LOCATION '/DataWarehouse/ods/TrackFlowTable/$yesterday';"
(1)按頁面類型統計 pv
select pageType, count(pageType) from ext_trackflow where statDate = '20140521' group by pageType;
Note:通常 SELECT 查詢會掃描整個表,使用 PARTITIONED BY 子句建表,查詢就能夠利用分區剪枝(input pruning)的特性
Hive 當前的實現是,只有分區斷言出如今離 FROM 子句最近的那個WHERE 子句中,纔會啓用分區剪枝
(2)導出查詢結果到本地的兩種方式
INSERT OVERWRITE LOCAL DIRECTORY '/home/jun06/tmp/110.112.113.115' select area1, count(area1) from ext_trackflow where statDate = '20140521' group by area1 having count(area1) > 1000;
hive -e 'use ecdata; select area1, count(area1) from ext_trackflow where statDate = '20140521' group by area1 having count(area1) > 1000;' > a.txt
(3)map 數據結構的查詢與使用
select trackURLMap, extField, unitParamMap, queryParamMap from ext_trackflow where statDate = '20140525' and size(unitParamMap)!=0 limit 10;
(4)下面的查詢語句查詢銷售記錄最大的 5 個銷售表明。
SET mapred.reduce.tasks = 1;
SELECT * FROM test SORT BY amount DESC LIMIT 5;
(5)將同一表中數據插入到不一樣表、路徑中:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
(6)用streaming方式將文件流直接插入文件:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
(7)Hive 只支持等值鏈接(equality joins)、外鏈接(outer joins)和(left semi joins)。Hive 不支持全部非等值的鏈接,由於非等值鏈接很是難轉化到 map/reduce 任務
LEFT,RIGHT和FULL OUTER關鍵字用於處理join中空記錄的狀況
LEFT SEMI JOIN 是 IN/EXISTS 子查詢的一種更高效的實現
join 時,每次 map/reduce 任務的邏輯是這樣的:reducer 會緩存 join 序列中除了最後一個表的全部表的記錄,再經過最後一個表將結果序列化到文件系統
實踐中,應該把最大的那個表寫在最後
(8)join 查詢時,須要注意幾個關鍵點
只支持等值join
SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a JOIN b
ON (a.id = b.id AND a.department = b.department)
能夠 join 多於 2 個表,例如
SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
Note:若是join中多個表的 join key 是同一個,則 join 會被轉化爲單個 map/reduce 任務
(9)LEFT,RIGHT和FULL OUTER
SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
若是你想限制 join 的輸出,應該在 WHERE 子句中寫過濾條件——或是在 join 子句中寫
容易混淆的問題是表分區的狀況
SELECT c.val, d.val FROM c LEFT OUTER JOIN d ON (c.key=d.key)
WHERE a.ds='2010-07-07' AND b.ds='2010-07-07'
若是 d 表中找不到對應 c 表的記錄,d 表的全部列都會列出 NULL,包括 ds 列。也就是說,join 會過濾 d 表中不能找到匹配 c 表 join key 的全部記錄。這樣的話,LEFT OUTER 就使得查詢結果與 WHERE 子句無關
解決辦法
SELECT c.val, d.val FROM c LEFT OUTER JOIN d
ON (c.key=d.key AND d.ds='2009-07-07' AND c.ds='2009-07-07')
(10)LEFT SEMI JOIN
LEFT SEMI JOIN 的限制是, JOIN 子句中右邊的表只能在 ON 子句中設置過濾條件,在 WHERE 子句、SELECT 子句或其餘地方過濾都不行
SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
能夠被重寫爲:
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)
(11)從SQL到HiveQL應轉變的習慣
①Hive不支持傳統的等值鏈接
•SQL中對兩表內聯能夠寫成:
•select * from dual a,dual b where a.key = b.key;
•Hive中應爲
•select * from dual a join dual b on a.key = b.key;
而不是傳統的格式:
SELECT t1.a1 as c1, t2.b1 as c2FROM t1, t2
WHERE t1.a2 = t2.b2
②分號字符
•分號是SQL語句結束標記,在HiveQL中也是,可是在HiveQL中,對分號的識別沒有那麼智慧,例如:
•select concat(key,concat(';',key)) from dual;
•但HiveQL在解析語句時提示:
FAILED: Parse Error: line 0:-1 mismatched input '<EOF>' expecting ) in function specification
•解決的辦法是,使用分號的八進制的ASCII碼進行轉義,那麼上述語句應寫成:
•select concat(key,concat('\073',key)) from dual;
[1] HIVE 數據定義
http://www.yanbit.com/?p=300
[2] Hadoop Hive sql語法詳解
http://blog.csdn.net/hguisu/article/details/7256833
[3] Hive 常見問題與技巧【Updating】
http://my.oschina.net/leejun2005/blog/164249
[4] Hive體系結構
http://blog.csdn.net/zhoudaxia/article/details/8855937
[5] [一塊兒學Hive]之九-Hive的查詢語句SELECT