測試hive

時間 2019-12-01

標籤測試 hive 欄目 Hadoop 简体版

原文原文鏈接

1、準備數據sql

造數據腳本： gendata.shbash

#!/bin/bashsession

file=$1app

s=$2oop

touch $file性能

for((i=0;i<10000000;i++))測試

do優化

str=','$s;spa

name=${i}${str}${i}orm

#echo $name

echo $name>> $file

done

echo 'show testdata'

head $file

造數據：

先造十個小文件，每一個1000w記錄：

bash gendata.sh name.txt name ; bash gendata.sh zhuzhi.txt zhuzhi ; bash gendata.sh minzu.txt minzu ; bash gendata.sh jg.txt jg ;bash gendata.sh gj.txt gj ; bash gendata.sh dz.txt dz ; bash gendata.sh abcd.txt abcd ; bash gendata.sh efgh.txt efgh ; bash gendata.sh xyz.txt xyz ;bash gendata.sh opq.txt opq

total 1.8G

-rw-r--r-- 1 root root 189M Feb 9 10:35 abcd.txt

-rw-r--r-- 1 root root 170M Feb 9 10:32 dz.txt

-rw-r--r-- 1 root root 189M Feb 9 10:38 efgh.txt

-rw-r--r-- 1 root root 170M Feb 9 10:28 gj.txt

-rw-r--r-- 1 root root 170M Feb 9 10:25 jg.txt

-rw-r--r-- 1 root root 199M Feb 9 10:22 minzu.txt

-rw-r--r-- 1 root root 189M Feb 9 10:08 name.txt

-rw-r--r-- 1 root root 180M Feb 9 10:49 opq.txt

-rw-r--r-- 1 root root 180M Feb 9 10:41 xyz.txt

-rw-r--r-- 1 root root 208M Feb 9 10:19 zhuzhi.txt

大文件，1億記錄

bash gendata.sh name1000.txt name

-rw-r--r-- 1 root root 2.1G Feb 9 10:50 name1000.txt

2、測試10個小文件，每一個文件1000萬記錄，180MB大小，總1億記錄，1.8G ，不作任何優化的數據分析

hive中建表：

create table hyl_test_par(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

手動創建分區文件夾：

hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003

上傳數據、修改權限：

hadoop fs -put *.txt /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003/

hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par/

修復分區信息：

0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;

+------------+--+

| partition |

+------------+--+

No rows selected (0.125 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> msck repair table hyl_test_par;

No rows affected (0.551 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;

+--------------------------------+--+

| partition |

+--------------------------------+--+

| sys_sj=20170209/sys_type=2003 |

+--------------------------------+--+

1 row selected (0.123 seconds)

測試：

select count(*) from hyl_test_par where name <> ' ' ;

第一次運行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (189.116 seconds)

第二次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (118.107 seconds)

第三次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (117.551 seconds)

第四次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (117.44 seconds)

第五次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (113.291 seconds)

======================================莫名的分割線=======================================================

下午從新跑，性能塊了10倍以上！！！

第一次運行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.274 seconds)

第二次運行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.525 seconds)

第三次運行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.11 seconds)

第四次：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.722 seconds)

3、測試10個小文件，每一個文件1000萬記錄，180MB大小，總1億記錄，1.8G，通過analysis分區信息的表

同理，建立測試analysis的表，並導入數據：

create table hyl_test_par_ana(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003

hadoop fs -put *.txt /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003/

hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par_ana/

show partitions hyl_test_par_ana;

msck repair table hyl_test_par_ana;

show partitions hyl_test_par_ana;

分區信息分析：

0: jdbc:hive2://cluster09.hzhz.co:10000> analyze table hyl_test_par_ana partition(sys_sj=20170209,sys_type=2003) compute statistics ;

INFO : Session is already open

INFO : Dag name: analyze table hyl_test_par_ana ...statistics(Stage-0)

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0021)

INFO : Partition default.hyl_test_par_ana{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (126.81 seconds)

測試：

select count(*) from hyl_test_par_ana;

0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_ana;

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (0.086 seconds)

換sql：

select count(*) from hyl_test_par_ana where name <> ' ';

第一次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (118.239 seconds)

第二次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (121.687 seconds)

第三次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (121.319 seconds)

======================================莫名的分割線=======================================================

下午從新跑，性能塊了10倍以上！！！

第一次運行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (10.923 seconds)

第二次運行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.058 seconds)

第三次運行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.45 seconds)

第四次：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.218 seconds)

3、測試1個大文件，2.0G大小，總1億記錄，沒有任何優化的表

建立使用一個大文件的同結構表，並上傳數據：

create table hyl_test_par_big(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003

hadoop fs -put name1000.txt /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003/

hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par_big/

[hdfs@cluster13 tmp]$ hadoop fs -du -h /apps/hive/warehouse/hyl_test_par_big/

2.0 G /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209

show partitions hyl_test_par_big;

msck repair table hyl_test_par_big;

show partitions hyl_test_par_big;

測試：

select count(*) from hyl_test_par_big where name <> ' ';

第一次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.356 seconds)

第二次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.3 seconds)

第三次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (5.861 seconds)

第四次：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (5.675 seconds)

第五次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.814 seconds)

======================================莫名的分割線=======================================================

下午測試：

第一次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.933 seconds)

第二次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (4.435 seconds)

第三次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (5.868 seconds)

第四次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.403 seconds)

第五次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.814 seconds)

4、測試10個小文件，180MB大小，總1億記錄，是通過insert into方式創建的表

經過insert into 從第一張表中導入數據（這個過程當中hive會自動analysis信息）到新表，測試：

create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);

INFO : Session is already open

INFO : Dag name: create table hyl_test_par_auto a...rand(123)(Stage-1)

INFO : Tez session was closed. Reopening...

INFO : Session re-established.

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)

INFO : Map 1: -/- Reducer 2: 0/10

INFO : Map 1: 0/119 Reducer 2: 0/10

INFO : Moving data to directory hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto from hdfs://myBigdata/apps/hive/warehouse/.hive-staging_hive_2017-02-09_13-05-59_036_2906983779886430780-1/-ext-10001

INFO : Table default.hyl_test_par_auto stats: [numFiles=10, numRows=100000000, totalSize=3327777800, rawDataSize=3227777800]

[hdfs@cluster13 tmp]$ hadoop fs -ls -h hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003

Found 10 items

-rwxrwxrwx 2 hive hdfs 183.9 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000000_0

-rwxrwxrwx 2 hive hdfs 183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000001_0

-rwxrwxrwx 2 hive hdfs 183.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000002_0

-rwxrwxrwx 2 hive hdfs 184.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000003_0

-rwxrwxrwx 2 hive hdfs 183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000004_0

-rwxrwxrwx 2 hive hdfs 183.3 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000005_0

-rwxrwxrwx 2 hive hdfs 184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000006_0

-rwxrwxrwx 2 hive hdfs 184.0 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000007_0

-rwxrwxrwx 2 hive hdfs 184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000008_0

-rwxrwxrwx 2 hive hdfs 183.4 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000009_0

測試：

select count(*) from hyl_test_par_auto where name <> ' ';

第一次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (14.653 seconds)

第二次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (13.989 seconds)

第三次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (9.236 seconds)

drop table hyl_test_auto;

create table hyl_test_par_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

INFO : Session is already open

INFO : Dag name: insert into table hyl_test_par_a...rand(123)(Stage-1)

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)

INFO : Map 1: 0/119 Reducer 2: 0/10

INFO : Loading data to table default.hyl_test_par_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_13-26-47_915_3059621581501435386-1/-ext-10000

INFO : Partition default.hyl_test_par_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (135.325 seconds)

測試：

select count(*) from hyl_test_par_auto where name <> ' ';

第一次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.303 seconds)

第二次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.56 seconds)

第三次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.446 seconds)

第四次：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.643 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_auto ;

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (0.098 seconds)

======================================莫名的分割線=======================================================

下午測試：

第一次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.282 seconds)

第二次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.414 seconds)

第三次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.452 seconds)

第四次運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.24 seconds)

5、統計：

耗時\條件	10180MBNONE	10180MBanalyze	12GBnone	10180MBinsert into
第一次運行	11.274	10.923	11.933	11.282
第二次運行	11.525	6.058	4.435	3.414
第三次運行	11.11	6.45	5.868	6.452
第四次運行	11.722	6.218	3.403	6.24
平均時間	11.40775	7.41225	6.40975	6.847
熱數據平均時間（去掉第一次）	11.45233333	6.242	4.568666667	5.368666667

6、繼續尋找優化項：

單個文件，經過insert into方式插入數據的表：

set mapred.reduce.tasks=1;

create table hyl_test_par_big_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

INFO : Tez session hasn't been created yet. Opening session

INFO : Dag name: insert into table hyl_test_par_b...rand(123)(Stage-1)

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0023)

INFO : Map 1: -/- Reducer 2: 0/1

INFO : Map 1: 0/119 Reducer 2: 0/1

INFO : Loading data to table default.hyl_test_par_big_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_big_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_14-30-05_480_3967529948260649900-1/-ext-10000

INFO : Partition default.hyl_test_par_big_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=1, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (104.637 seconds)

測試：

select count(*) from hyl_test_par_big_auto where name <> ' ';

運行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.744 seconds)

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (4.188 seconds)

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (4.041 seconds)

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (5.198 seconds)

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.788 seconds)

耗時\條件	10180MBNONE	10180MBanalyze	12GBnone	10180MBinsert into	12GBnone*insert into
第一次運行	11.274	10.923	11.933	11.282	11.744
第二次運行	11.525	6.058	4.435	3.414	4.188
第三次運行	11.11	6.45	5.868	6.452	4.041
第四次運行	11.722	6.218	3.403	6.24	5.198
平均時間	11.40775	7.41225	6.40975	6.847	6.29275
熱數據平均時間（去掉第一次）	11.45233333	6.242	4.56866666	5.368666667	4.475666667