Apache Hive 存儲方式、壓縮格式


簡介:html

Apache hive 存儲方式跟壓縮格式!apache

一、Text Fileapi

hive> create external table tab_textfile (
host string comment 'client ip address', 
local_time string comment 'client access time', 
api string comment 'request api', 
request_type string comment 'request method, http version', 
http_code int, body_bytes int, request_body map<string, string>, 
referer string, user_agent string, upstr string, response_time string, request_time string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '=';
OK
Time taken: 0.162 seconds

# 建立一張 Text File 存儲格式、不壓縮的外部表測試

hive> load data local inpath '/data/logs/api/201711/tvlog_20171101/bftvapi.20171101.log' overwrite into table tab_textfile;
Loading data to table tmpdb.tab_textfile
OK
Time taken: 1015.974 seconds

# 原始文件 9.8G,加載到該表中須要花費 1015.974 秒 ( 這裏能夠優化,不使用 load 指令,直接 put 文件到數據表目錄 )優化

hive> select count(*) from tab_textfile;
...
Stage-Stage-1: Map: 39  Reduce: 1   Cumulative CPU: 269.51 sec   HDFS Read: 10463240195 HDFS Write: 108 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 29 seconds 510 msec
OK
27199202
Time taken: 95.68 seconds, Fetched: 1 row(s)

# 總共 27199202 行數據,用時 95.68 秒
# 優化點:set [ hive.exec.reducers.bytes.per.reducer=<number>, hive.exec.reducers.max=<number>, mapreduce.job.reduces=<number> ]spa

二、ORC Filecode

# 官方文檔:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORChtm

# ORC文檔:https://orc.apache.org/docsblog

hive> create external table tab_orcfile (
host string comment 'client ip address', 
local_time string comment 'client access time', 
api string comment 'request api', 
request_type string comment 'request method, http version', 
http_code int, body_bytes int, request_body map<string, string>, 
referer string, user_agent string, upstr string, response_time string, request_time string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '=' 
STORED AS ORC tblproperties ("orc.compress"="NONE");
OK
Time taken: 0.058 seconds

# 建立一張 ORC File 存儲格式、不壓縮的外部表ip

hive> insert overwrite table tab_orcfile select * from tab_textfile;
...
Stage-Stage-1: Map: 39   Cumulative CPU: 2290.24 sec   HDFS Read: 10463288479 HDFS Write: 2474228733 SUCCESS
Total MapReduce CPU Time Spent: 38 minutes 10 seconds 240 msec
OK
Time taken: 289.954 seconds

# 向 tab_orcfile 中加載數據,注意:ORC File 不能直接 load data !!!

# 能夠先建立 Text File 的臨時表,將數據手動上傳到該表指定目錄,而後轉換成 ORC File 格式。

hive> select count(*) from tab_orcfile;
OK
27199202
Time taken: 2.555 seconds, Fetched: 1 row(s)

# 額,一樣的語句,上面執行花費 95.68 秒,如今只要 2.555 秒。
# 換一種方式測試,先查 tab_orcfile 表,而後再查 tab_textfile 表。

hive> select count(host) from tab_orcfile;
...
Stage-Stage-1: Map: 9  Reduce: 1   Cumulative CPU: 81.02 sec   HDFS Read: 96908995 HDFS Write: 108 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 21 seconds 20 msec
OK
27199202
Time taken: 33.55 seconds, Fetched: 1 row(s)

# ORC File 花費 33.55 秒

hive> select count(host) from tab_textfile;
...
Stage-Stage-1: Map: 39  Reduce: 1   Cumulative CPU: 349.77 sec   HDFS Read: 10463246048 HDFS Write: 108 SUCCESS
Total MapReduce CPU Time Spent: 5 minutes 49 seconds 770 msec
OK
27199202
Time taken: 87.308 seconds, Fetched: 1 row(s)

# Text File 花費 87.308 秒,高下立見!

三、啓用壓縮

# ORC 文檔:https://orc.apache.org/docs/hive-config.html

hive> create external table tab_orcfile_zlib (
host string comment 'client ip address', 
local_time string comment 'client access time', 
api string comment 'request api', 
request_type string comment 'request method, http version', 
http_code int, body_bytes int, request_body map<string, string>, 
referer string, user_agent string, upstr string, response_time string, request_time string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '=' 
STORED AS ORC;

# 默認的 ORC 壓縮方式爲 ZLIB,還支持 LZO、SNAPPY 等

hive> insert overwrite table tab_orcfile_zlib select * from tab_textfile;
...
Stage-Stage-1: Map: 39   Cumulative CPU: 2344.68 sec   HDFS Read: 10463292808 HDFS Write: 1077757683 SUCCESS
Total MapReduce CPU Time Spent: 39 minutes 4 seconds 680 msec
OK
Time taken: 299.204 seconds

# 數據加載完成

hive> select count(host) from tab_orcfile_zlib;
...
Stage-Stage-1: Map: 4  Reduce: 1   Cumulative CPU: 43.66 sec   HDFS Read: 66760966 HDFS Write: 108 SUCCESS
Total MapReduce CPU Time Spent: 43 seconds 660 msec
OK
27199202
Time taken: 31.369 seconds, Fetched: 1 row(s)

# 查詢速度不受影響

hive> dfs -ls -h /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/
Found 39 items
-rwxrwxrwx   3 root supergroup     24.6 M 2017-11-10 16:55 /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000000_0
-rwxrwxrwx   3 root supergroup     23.0 M 2017-11-10 16:56 /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000001_0
-rwxrwxrwx   3 root supergroup     25.9 M 2017-11-10 16:55 /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000002_0
-rwxrwxrwx   3 root supergroup     26.5 M 2017-11-10 16:55 /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000003_0

# 總共分紅 39 個文件,每一個平均 25M,總過不到 1G,原始文件 9.8G,這壓縮好比何 ?

相關文章
相關標籤/搜索