Impala 表使用 SequenceFile 文件格式（翻譯）

時間 2019-11-10

標籤 impala 使用 sequencefile 文件格式翻譯欄目 Hadoop 简体版

原文原文鏈接

Impala 表使用 SequenceFile 文件格式

Cloudera Impala 支持使用 SequenceFile 數據文件。 html

參加如下章節瞭解 Impala 表使用 SequenceFile 數據文件的詳情： shell

建立 SequenceFile 表並加載數據

假如你沒有使用已有的數據文件，請先建立一個合適格式的文件。 apache

建立 SequenceFile 表： app

在 impala-shell 中，執行相似命令： oop

create table sequencefile_table (column_specs) stored as sequencefile;

由於 Impala 能夠查詢一些目前它沒法寫入數據的表，當建立特定格式的表以後，你可能須要在 Hive shell 中加載數據。參見 Impala 如何使用 Hadoop 文件格式瞭解詳細信息。當經過 Hive 或其餘 Impala 以外的機制加載數據以後，在你下次鏈接到 Impala 節點時，在執行關於這個表的查詢以前，執行 REFRESH table_name 語句，以確保 Impala 識別到新添加的數據。性能

例如，下面是你如何在 Impala 中建立 SequenceFile 表(經過顯式設置列，或者克隆其餘表的結構)，經過 Hive 加載數據，而後經過 Impala 查詢： ui

$ impala-shell -i localhost
[localhost:21000] > create table seqfile_table (x int) stored as seqfile;
[localhost:21000] > create table seqfile_clone like some_other_table stored as seqfile;
[localhost:21000] > quit;

$ hive
hive> insert into table seqfile_table select x from some_other_table;
3 Rows loaded to seqfile_table
Time taken: 19.047 seconds
hive> quit;

$ impala-shell -i localhost
[localhost:21000] > select * from seqfile_table;
Returned 0 row(s) in 0.23s
[localhost:21000] > -- Make Impala recognize the data loaded through Hive;
[localhost:21000] > refresh seqfile_table;
[localhost:21000] > select * from seqfile_table;
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.23s

SequenceFile 表啓用壓縮

你可能但願對已有的表啓用壓縮。啓用壓縮大多數狀況下能提升性能提高，而且 SequenceFile 表支持壓縮。例如，啓用 Snappy 壓縮，你須要經過 Hive shell 加載數據時設置如下附加設置： spa

hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> insert overwrite table new_table select * from old_table;

假如你轉換分區表，你必須完成額外的步驟。這時候，相似下面指定附加的設置： .net

hive> create tablenew_table(your_cols) partitioned by (partition_cols) stored asnew_format;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> insert overwrite table new_table partition(comma_separated_partition_cols) select * from old_table;

請記住 Hive 不須要你設置源格式。考慮轉換一個包含年和月兩個分區列的分區表到採用 Snappy 壓縮的 SequenceFile 格式，結合以前所述的組件來完成這個表的轉換，你應當相似下面指定設置： code

hive> create table TBL_SEQ (int_col int, string_col string) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_seq SELECT * FROM tbl;

爲了對分區表完成相似的處理，你應當相似下面指定設置：

hive> CREATE TABLE tbl_seq (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_seq PARTITION(year) SELECT * FROM tbl;

Note:

使用下面命令設置壓縮類型：

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

你能夠在這裏選擇替代的編解碼器如 GzipCodec。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。