最近發現hivesql的執行速度特別慢,前面咱們已經說明了left和union的優化,下面我們分析一下增長或者減小reduce的數量來提高hsql的速度。html
參考:http://www.cnblogs.com/liqiu/p/4873238.htmlnode
select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10';
上一篇博文已經說明了,須要8個map,1個reduce,執行的速度:52秒。詳細記錄參考:http://www.cnblogs.com/liqiu/p/4873238.htmlsql
首先說明一下reduce默認的個數:(每一個reduce任務處理的數據量,默認爲1000^3=1G,參數是hive.exec.reducers.bytes.per.reducer);(每一個任務最大的reduce數,默認爲999,參數是hive.exec.reducers.max)apache
即,若是reduce的輸入(map的輸出)總大小不超過1G,那麼只會有一個reduce任務;app
若是數據表b2c_money_trace的大小是2.4G,那麼reduce的數量是3個,例如:oop
hive> select count(1) from b2c_money_trace where operate_time = '2015-10-10' group by operate_time; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 3 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1434099279301_3623421, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3623421/ Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3623421 Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 3
那麼繼續說最開始的例子,例如:優化
set mapred.reduce.tasks = 8;
執行的結果:spa
hive> set mapred.reduce.tasks = 8; hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Defaulting to jobconf value of: 8 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Cannot run job locally: Input Size (= 380265495) is larger than hive.exec.mode.local.auto.inputbytes.max (= 50000000) Starting Job = job_1434099279301_3618454, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3618454/ Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3618454 Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 8 2015-10-14 15:31:55,570 Stage-1 map = 0%, reduce = 0% 2015-10-14 15:32:01,734 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 4.63 sec 2015-10-14 15:32:02,760 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 10.93 sec 2015-10-14 15:32:03,786 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 10.93 sec 2015-10-14 15:32:04,812 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 21.94 sec 2015-10-14 15:32:05,837 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 21.94 sec 2015-10-14 15:32:06,892 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 21.94 sec 2015-10-14 15:32:07,947 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 21.94 sec 2015-10-14 15:32:08,983 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 21.94 sec 2015-10-14 15:32:10,039 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 21.94 sec 2015-10-14 15:32:11,088 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 21.94 sec 2015-10-14 15:32:12,114 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 21.94 sec 2015-10-14 15:32:13,143 Stage-1 map = 75%, reduce = 19%, Cumulative CPU 24.28 sec 2015-10-14 15:32:14,170 Stage-1 map = 75%, reduce = 25%, Cumulative CPU 27.94 sec 2015-10-14 15:32:15,197 Stage-1 map = 75%, reduce = 25%, Cumulative CPU 27.94 sec 2015-10-14 15:32:16,224 Stage-1 map = 75%, reduce = 25%, Cumulative CPU 28.58 sec 2015-10-14 15:32:17,250 Stage-1 map = 75%, reduce = 25%, Cumulative CPU 28.95 sec 2015-10-14 15:32:18,277 Stage-1 map = 75%, reduce = 25%, Cumulative CPU 37.02 sec 2015-10-14 15:32:19,305 Stage-1 map = 75%, reduce = 25%, Cumulative CPU 48.93 sec 2015-10-14 15:32:20,332 Stage-1 map = 75%, reduce = 25%, Cumulative CPU 49.31 sec 2015-10-14 15:32:21,359 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 57.99 sec 2015-10-14 15:32:22,385 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 61.88 sec 2015-10-14 15:32:23,411 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 71.56 sec 2015-10-14 15:32:24,435 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 71.56 sec MapReduce Total cumulative CPU time: 1 minutes 11 seconds 560 msec Ended Job = job_1434099279301_3618454 MapReduce Jobs Launched: Job 0: Map: 8 Reduce: 8 Cumulative CPU: 71.56 sec HDFS Read: 380267639 HDFS Write: 330 SUCCESS Total MapReduce CPU Time Spent: 1 minutes 11 seconds 560 msec OK 9718 210296076 9718 210299105 9718 210295344 9718 210295277 9718 210295586 9718 210295050 9718 210301363 9718 210297733 9718 210298066 9718 210295566 9718 210298219 9718 210296438 9718 210298328 9718 210298008 9718 210299712 9718 210295239 9718 210297567 9718 210295525 9718 210294949 9718 210296318 9718 210294421 9718 210295840 Time taken: 36.978 seconds, Fetched: 22 row(s)
可見8個reduce使得reduce的時間明顯提高了。code
map的數量就不能用上面的事例,那麼看這個數據表:orm
hive> dfs -ls -h /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace; Found 4 items -rw-r--r-- 3 ticketdev ticketdev 600.0 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/24f19a74-ca91-4fb2-9b79-1b1235f1c6f8 -rw-r--r-- 3 ticketdev ticketdev 597.2 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/34ca13a3-de44-402e-9548-e6b9f92fde67 -rw-r--r-- 3 ticketdev ticketdev 590.6 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/ac249f44-60eb-4bf7-9c1a-6f643873b823 -rw-r--r-- 3 ticketdev ticketdev 606.5 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/f587fec9-60da-4f18-8b47-406999d95fd1
共2.4G
hive> set dfs.block.size; dfs.block.size=134217728
注意:134217728L是128M的意思!
文件大小是600M*4個,每一個數據塊是128M,即:取整(600/128)*4=20個Mapper
hive> select count(1) from b2c_money_trace; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1434099279301_3620170, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3620170/ Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3620170 Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 1
注意上面的紅色部分,說明mappers的數量是20。
那麼設置劃分map的文件大小
set mapred.max.split.size=50000000; set mapred.min.split.size.per.node=50000000; set mapred.min.split.size.per.rack=50000000; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
大概解釋一下:
50000000表示50M;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;這個參數表示執行前進行小文件合併,固然這裏沒有使用到。
其餘三個參數說明按照50M來劃分數據塊。
執行結果:
hive> select count(1) from b2c_money_trace; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1434099279301_3620223, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3620223/ Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3620223 Hadoop job information for Stage-1: number of mappers: 36; number of reducers: 1
每一個文件600M,正好12個Mapper,因此36個Mappers,注意上面的紅色部分。
並不是map和reduce數量越多越好,由於越多佔用的資源越多,同時處理的時間未必必定增長,最好根據實際狀況調整到一個合理的數量。
http://lxw1234.com/archives/2015/04/15.htm