數據集導出:3d
一、從tvlog庫 tvlog_tcl表中導出2015-09-09號的數據code
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/tvlog_tcl_2015_09_09'ip
SELECT *string
FROM tvlog.tvlog_tclit
WHERE year = 2015 and month = 9 and day = 9;io
二、導出數據以下:table
f03ec5b8ed9f4d209864a39ab97b8711-1安徽衛視廣西antv171.38.38.1512015-09-092015-09-09 19:24:252015-09-09 19:26:25CCTV-1綜合CCTV-1綜合40-8B-F6-A4-A1-49e4d0a284ce53e4ab810847adc33cb5a00d806370113522354201599test
f03ec5b8ed9f4d209864a39ab97b8711-1CCTV-1綜合廣西cctv1171.38.38.1512015-09-092015-09-09 19:26:252015-09-09 19:30:25安徽衛視安徽衛視40-8B-F6-A4-A1-49e4d0a284ce53e4ab810847adc33cb5a00d806370113522354201599file
...channel
導出的數據沒有分隔符,這是由於並無把^A和^B顯示出來,不易使用
三、使用Linux管道命令導出數據
hive -e "SELECT * FROM tvlog.tvlog_tcl WHERE year = 2015 and month = 9 and day = 9;" >> "/tmp/tvlog_tcl_2015_09_09.txt";
四、導出數據以下:
NULL cc4b7bc02a1c47bcb5f47f23c6e4a45b -1 貴州衛視 中國 5a7d01661b5d9c64293860531374312b 103.244.252.71 2015-09-09 2015-09-09 00:39:03 2015-09-09 00:51:03 安徽衛視 廣東衛視 5C-36-B8-40-EA-91 248dca07ce4070d56b59a56dff1fb8d3e0125654 406355278 2015 9 9
NULL 21ea96ff7f31439b8434baf2b6953db9 -1 深圳衛視 四川 20831bb807a45638cfaf81df1122024d 222.215.124.45 2015-09-09 2015-09-09 00:01:02 2015-09-09 00:23:02 浙江衛視 40-8B-F6-6B-1B-52 40
...
數據是以製表符進行分隔的
建立表:
建立存儲格式爲textfile而且字段類型所有爲String類型的表
DROP TABLE test.tvlog_tcl_textfile_string;
CREATE TABLE IF NOT EXISTS test.tvlog_tcl_textfile_string (
id STRING,
userid STRING,
channelid STRING,
channelname STRING,
region STRING,
channelcode STRING,
ip STRING,
dt STRING,
starttime STRING,
endtime STRING,
fromchannel STRING,
tochannel STRING,
mac STRING,
deviceid STRING,
dnum STRING,
year STRING,
month STRING,
day STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/tmp/tvlog_tcl_2015_09_09.txt' INTO TABLE test.tvlog_tcl_textfile_string;
Loading data to table test.tvlog_tcl_textfile_string
Table test.tvlog_tcl_textfile_string stats: [numFiles=1, totalSize=549291193]
OK
Time taken: 2.4 seconds
建立存儲格式爲textfile而且對應類型的表
DROP TABLE test.tvlog_tcl_textfile_other;
CREATE TABLE IF NOT EXISTS test.tvlog_tcl_textfile_other (
id STRING,
userid STRING,
channelid STRING,
channelname STRING,
region STRING,
channelcode STRING,
ip STRING,
dt DATE,
starttime TIMESTAMP,
endtime TIMESTAMP,
fromchannel STRING,
tochannel STRING,
mac STRING,
deviceid STRING,
dnum INT,
year INT,
month INT,
day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/tmp/tvlog_tcl_2015_09_09.txt' INTO TABLE test.tvlog_tcl_textfile_other;
Loading data to table test.tvlog_tcl_textfile_other
Table test.tvlog_tcl_textfile_other stats: [numFiles=1, totalSize=549291193]
OK
Time taken: 2.36 seconds
建立存儲格式爲orcfile而且字段類型所有爲對應類型的表
DROP TABLE test.tvlog_tcl_orc_string;
CREATE TABLE IF NOT EXISTS test.tvlog_tcl_orc_string (
id STRING,
userid STRING,
channelid STRING,
channelname STRING,
region STRING,
channelcode STRING,
ip STRING,
dt STRING,
starttime STRING,
endtime STRING,
fromchannel STRING,
tochannel STRING,
mac STRING,
deviceid STRING,
dnum STRING,
year STRING,
month STRING,
day STRING
)
STORED AS ORC;
INSERT INTO TABLE test.tvlog_tcl_orc_string
SELECT * FROM test.tvlog_tcl_textfile_string;
Loading data to table test.tvlog_tcl_orc_string
Table test.tvlog_tcl_orc_string stats: [numFiles=3, numRows=2223869, totalSize=87336289, rawDataSize=3863401633]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Cumulative CPU: 54.55 sec HDFS Read: 549326336 HDFS Write: 87336567 SUCCESS
Total MapReduce CPU Time Spent: 54 seconds 550 msec
OK
Time taken: 36.028 seconds
建立存儲格式爲orc而且字段類型爲對應類型的表
DROP TABLE test.tvlog_tcl_orc_other;
CREATE TABLE IF NOT EXISTS test.tvlog_tcl_orc_other (
id STRING,
userid STRING,
channelid STRING,
channelname STRING,
region STRING,
channelcode STRING,
ip STRING,
dt DATE,
starttime TIMESTAMP,
endtime TIMESTAMP,
fromchannel STRING,
tochannel STRING,
mac STRING,
deviceid STRING,
dnum INT,
year INT,
month INT,
day INT
)
STORED AS orc;
INSERT INTO TABLE test.tvlog_tcl_orc_other
SELECT * FROM test.tvlog_tcl_textfile_other;
Loading data to table test.tvlog_tcl_orc_other
Table test.tvlog_tcl_orc_other stats: [numFiles=3, numRows=2223869, totalSize=84204372, rawDataSize=2755419207]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Cumulative CPU: 53.6 sec HDFS Read: 549326196 HDFS Write: 84204647 SUCCESS
Total MapReduce CPU Time Spent: 53 seconds 600 msec
OK
Time taken: 33.834 seconds
綜上所述:
一、若是表存儲格式是textfile,存儲字段是任意類型對於表大小沒有影響。
二、若是表存儲格式是某種壓縮格式(orcfile),存儲字段是對應類型比全是string類型要小。
三、2223869條數據,orcfile與textfile存儲比率,84204372 / 549291193 = 0.153296417406787