針對Hive數據類型存儲和ORCFile關係的研究


數據集導出:3d

一、從tvlog庫 tvlog_tcl表中導出2015-09-09號的數據code

   INSERT OVERWRITE LOCAL DIRECTORY '/tmp/tvlog_tcl_2015_09_09'ip

   SELECT *string

   FROM tvlog.tvlog_tclit

   WHERE year = 2015 and month = 9 and day = 9;io


二、導出數據以下:table

   f03ec5b8ed9f4d209864a39ab97b8711-1安徽衛視廣西antv171.38.38.1512015-09-092015-09-09 19:24:252015-09-09 19:26:25CCTV-1綜合CCTV-1綜合40-8B-F6-A4-A1-49e4d0a284ce53e4ab810847adc33cb5a00d806370113522354201599test

   f03ec5b8ed9f4d209864a39ab97b8711-1CCTV-1綜合廣西cctv1171.38.38.1512015-09-092015-09-09 19:26:252015-09-09 19:30:25安徽衛視安徽衛視40-8B-F6-A4-A1-49e4d0a284ce53e4ab810847adc33cb5a00d806370113522354201599file

   ...channel

   導出的數據沒有分隔符,這是由於並無把^A和^B顯示出來,不易使用


三、使用Linux管道命令導出數據

   hive -e "SELECT * FROM tvlog.tvlog_tcl WHERE year = 2015 and month = 9 and day = 9;" >> "/tmp/tvlog_tcl_2015_09_09.txt";


四、導出數據以下:

   NULL    cc4b7bc02a1c47bcb5f47f23c6e4a45b        -1      貴州衛視        中國    5a7d01661b5d9c64293860531374312b        103.244.252.71  2015-09-09      2015-09-09 00:39:03    2015-09-09 00:51:03      安徽衛視        廣東衛視        5C-36-B8-40-EA-91       248dca07ce4070d56b59a56dff1fb8d3e0125654        406355278       2015    9       9

   NULL    21ea96ff7f31439b8434baf2b6953db9        -1      深圳衛視        四川    20831bb807a45638cfaf81df1122024d        222.215.124.45  2015-09-09      2015-09-09 00:01:02    2015-09-09 00:23:02              浙江衛視        40-8B-F6-6B-1B-52       40

   ...

   數據是以製表符進行分隔的


建立表:

建立存儲格式爲textfile而且字段類型所有爲String類型的表

DROP TABLE test.tvlog_tcl_textfile_string;

CREATE TABLE IF NOT EXISTS test.tvlog_tcl_textfile_string (

    id STRING,

    userid STRING,

    channelid STRING,

    channelname STRING,

    region STRING,

    channelcode STRING,

    ip STRING,

    dt STRING,

    starttime STRING,

    endtime STRING,

    fromchannel STRING,

    tochannel STRING,

    mac STRING,

    deviceid STRING,

    dnum STRING,

    year STRING,

    month STRING,

    day STRING

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/tmp/tvlog_tcl_2015_09_09.txt' INTO TABLE test.tvlog_tcl_textfile_string;

Loading data to table test.tvlog_tcl_textfile_string

Table test.tvlog_tcl_textfile_string stats: [numFiles=1, totalSize=549291193]

OK

Time taken: 2.4 seconds



建立存儲格式爲textfile而且對應類型的表

DROP TABLE test.tvlog_tcl_textfile_other;

CREATE TABLE IF NOT EXISTS test.tvlog_tcl_textfile_other (

    id STRING,

    userid STRING,

    channelid STRING,

    channelname STRING,

    region STRING,

    channelcode STRING,

    ip STRING,

    dt DATE,

    starttime TIMESTAMP,

    endtime TIMESTAMP,

    fromchannel STRING,

    tochannel STRING,

    mac STRING,

    deviceid STRING,

    dnum INT,

    year INT,

    month INT,

    day INT

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/tmp/tvlog_tcl_2015_09_09.txt' INTO TABLE test.tvlog_tcl_textfile_other;

Loading data to table test.tvlog_tcl_textfile_other

Table test.tvlog_tcl_textfile_other stats: [numFiles=1, totalSize=549291193]

OK

Time taken: 2.36 seconds



建立存儲格式爲orcfile而且字段類型所有爲對應類型的表

DROP TABLE test.tvlog_tcl_orc_string;

CREATE TABLE IF NOT EXISTS test.tvlog_tcl_orc_string (

    id STRING,

    userid STRING,

    channelid STRING,

    channelname STRING,

    region STRING,

    channelcode STRING,

    ip STRING,

    dt STRING,

    starttime STRING,

    endtime STRING,

    fromchannel STRING,

    tochannel STRING,

    mac STRING,

    deviceid STRING,

    dnum STRING,

    year STRING,

    month STRING,

    day STRING

)

STORED AS ORC;

INSERT INTO TABLE test.tvlog_tcl_orc_string

SELECT * FROM test.tvlog_tcl_textfile_string;

Loading data to table test.tvlog_tcl_orc_string

Table test.tvlog_tcl_orc_string stats: [numFiles=3, numRows=2223869, totalSize=87336289, rawDataSize=3863401633]

MapReduce Jobs Launched: 

Stage-Stage-1: Map: 3   Cumulative CPU: 54.55 sec   HDFS Read: 549326336 HDFS Write: 87336567 SUCCESS

Total MapReduce CPU Time Spent: 54 seconds 550 msec

OK

Time taken: 36.028 seconds


建立存儲格式爲orc而且字段類型爲對應類型的表

DROP TABLE test.tvlog_tcl_orc_other;

CREATE TABLE IF NOT EXISTS test.tvlog_tcl_orc_other (

    id STRING,

    userid STRING,

    channelid STRING,

    channelname STRING,

    region STRING,

    channelcode STRING,

    ip STRING,

    dt DATE,

    starttime TIMESTAMP,

    endtime TIMESTAMP,

    fromchannel STRING,

    tochannel STRING,

    mac STRING,

    deviceid STRING,

    dnum INT,

    year INT,

    month INT,

    day INT

)

STORED AS orc;

INSERT INTO TABLE test.tvlog_tcl_orc_other

SELECT * FROM test.tvlog_tcl_textfile_other;

Loading data to table test.tvlog_tcl_orc_other

Table test.tvlog_tcl_orc_other stats: [numFiles=3, numRows=2223869, totalSize=84204372, rawDataSize=2755419207]

MapReduce Jobs Launched: 

Stage-Stage-1: Map: 3   Cumulative CPU: 53.6 sec   HDFS Read: 549326196 HDFS Write: 84204647 SUCCESS

Total MapReduce CPU Time Spent: 53 seconds 600 msec

OK

Time taken: 33.834 seconds


綜上所述:

    一、若是表存儲格式是textfile,存儲字段是任意類型對於表大小沒有影響。

    二、若是表存儲格式是某種壓縮格式(orcfile),存儲字段是對應類型比全是string類型要小。

    三、2223869條數據,orcfile與textfile存儲比率,84204372 / 549291193 = 0.153296417406787

相關文章
相關標籤/搜索