Etl之HiveSql調優(設置map reduce 的數量)

前言:

最近發現hivesql的執行速度特別慢,前面咱們已經說明了left和union的優化,下面我們分析一下增長或者減小reduce的數量來提高hsql的速度。html

參考:http://www.cnblogs.com/liqiu/p/4873238.htmlnode

分析:

select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10'; 

上一篇博文已經說明了,須要8個map,1個reduce,執行的速度:52秒。詳細記錄參考:http://www.cnblogs.com/liqiu/p/4873238.htmlsql

增長Reduce的數量:

首先說明一下reduce默認的個數:(每一個reduce任務處理的數據量,默認爲1000^3=1G,參數是hive.exec.reducers.bytes.per.reducer);(每一個任務最大的reduce數,默認爲999,參數是hive.exec.reducers.max)apache

即,若是reduce的輸入(map的輸出)總大小不超過1G,那麼只會有一個reduce任務;app

若是數據表b2c_money_trace的大小是2.4G,那麼reduce的數量是3個,例如:oop

hive> select count(1) from b2c_money_trace where operate_time = '2015-10-10' group by operate_time;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3623421, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3623421/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3623421
Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 3

那麼繼續說最開始的例子,例如:優化

set mapred.reduce.tasks = 8; 

執行的結果:spa

hive> set mapred.reduce.tasks = 8;                                                                                                    
hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 8
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Cannot run job locally: Input Size (= 380265495) is larger than hive.exec.mode.local.auto.inputbytes.max (= 50000000)
Starting Job = job_1434099279301_3618454, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3618454/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3618454
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 8
2015-10-14 15:31:55,570 Stage-1 map = 0%,  reduce = 0%
2015-10-14 15:32:01,734 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.63 sec
2015-10-14 15:32:02,760 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 10.93 sec
2015-10-14 15:32:03,786 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 10.93 sec
2015-10-14 15:32:04,812 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:05,837 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:06,892 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:07,947 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:08,983 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:10,039 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:11,088 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:12,114 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:13,143 Stage-1 map = 75%,  reduce = 19%, Cumulative CPU 24.28 sec
2015-10-14 15:32:14,170 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 27.94 sec
2015-10-14 15:32:15,197 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 27.94 sec
2015-10-14 15:32:16,224 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 28.58 sec
2015-10-14 15:32:17,250 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 28.95 sec
2015-10-14 15:32:18,277 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 37.02 sec
2015-10-14 15:32:19,305 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 48.93 sec
2015-10-14 15:32:20,332 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 49.31 sec
2015-10-14 15:32:21,359 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 57.99 sec
2015-10-14 15:32:22,385 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 61.88 sec
2015-10-14 15:32:23,411 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 71.56 sec
2015-10-14 15:32:24,435 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 71.56 sec
MapReduce Total cumulative CPU time: 1 minutes 11 seconds 560 msec
Ended Job = job_1434099279301_3618454
MapReduce Jobs Launched: 
Job 0: Map: 8  Reduce: 8   Cumulative CPU: 71.56 sec   HDFS Read: 380267639 HDFS Write: 330 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 11 seconds 560 msec
OK
9718    210296076
9718    210299105
9718    210295344
9718    210295277
9718    210295586
9718    210295050
9718    210301363
9718    210297733
9718    210298066
9718    210295566
9718    210298219
9718    210296438
9718    210298328
9718    210298008
9718    210299712
9718    210295239
9718    210297567
9718    210295525
9718    210294949
9718    210296318
9718    210294421
9718    210295840
Time taken: 36.978 seconds, Fetched: 22 row(s)

可見8個reduce使得reduce的時間明顯提高了。code

增長Map的數量:

數據表大小:

map的數量就不能用上面的事例,那麼看這個數據表:orm

hive> dfs -ls -h /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace;
Found 4 items
-rw-r--r--   3 ticketdev ticketdev    600.0 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/24f19a74-ca91-4fb2-9b79-1b1235f1c6f8
-rw-r--r--   3 ticketdev ticketdev    597.2 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/34ca13a3-de44-402e-9548-e6b9f92fde67
-rw-r--r--   3 ticketdev ticketdev    590.6 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/ac249f44-60eb-4bf7-9c1a-6f643873b823
-rw-r--r--   3 ticketdev ticketdev    606.5 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/f587fec9-60da-4f18-8b47-406999d95fd1

共2.4G

數據塊大小:

hive> set dfs.block.size;
dfs.block.size=134217728

注意:134217728L是128M的意思!

map數量

文件大小是600M*4個,每一個數據塊是128M,即:取整(600/128)*4=20個Mapper

hive> select count(1) from b2c_money_trace;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3620170, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3620170/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3620170
Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 1

注意上面的紅色部分,說明mappers的數量是20。

那麼設置劃分map的文件大小

set mapred.max.split.size=50000000;
set mapred.min.split.size.per.node=50000000;
set mapred.min.split.size.per.rack=50000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

大概解釋一下:

50000000表示50M;

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;這個參數表示執行前進行小文件合併,固然這裏沒有使用到。

其餘三個參數說明按照50M來劃分數據塊。

執行結果:

hive> select count(1) from b2c_money_trace;       
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3620223, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3620223/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3620223
Hadoop job information for Stage-1: number of mappers: 36; number of reducers: 1

每一個文件600M,正好12個Mapper,因此36個Mappers,注意上面的紅色部分。

結論:

並不是map和reduce數量越多越好,由於越多佔用的資源越多,同時處理的時間未必必定增長,最好根據實際狀況調整到一個合理的數量。

參考文章

http://lxw1234.com/archives/2015/04/15.htm

相關文章
相關標籤/搜索