當hive在執行大數據量的統計查詢語句時,常常會出現下面OOM錯誤,具體錯誤提示以下:html
Possible error: Out of memory due to hash maps used in map-side aggregation. Solution: Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;'
查看task的失敗信息爲:sql
Error:GC overhead limit exceeded
對於這個錯誤,一般是由兩種狀況形成的:(1) hive sql寫的不合理,致使執行時hash map過大;(2)hive sql沒有優化的餘地了(要獲得想要的數據只能寫這樣的sql)。ide
對於(1)則改變sql語句,從而下降hash map的大小。對於(2)則能夠調整參數。大數據
下面分別說明(1)和(2)的狀況:優化
(1)改變sql語句spa
select count(distinct v) from tbl; 能夠改成select count(1) from (select v from tbl group by v) t;
說明:減小了hash map的key個數 .net
select collect_set(messageDate)[0],count(*) from incidents_hive group by substr(messageDate,8,2); 能夠改成select hourNum, count(1) from (select substr(messageDate,9,2) as hourNum from incidents_hive ) t group by hourNum;
說明:沒有減小hash map的key個數,可是減小了value的大小code
(2)調整參數htm
對於這個sql語句,是沒辦法進行優化(由於keywords的重複率很低,致使map階段裏面維護的一個內存Map對象很是巨大)來下降hash map大小的:對象
INSERT OVERWRITE TABLE hbase_table_poi_keywords_count SELECT concat(substr(key,0,8), svccode, keywords), substr(key,0,8), svccode, keywords, count(*) where substr(key,0,8)=\"$yesterday\" AND length(keywords)>0 AND svccode is not null GROUP BY substr(key,0,8),svccode,keywords;
與mapjoin和map aggregate相關的優化參數有:
hive.map.aggr
hive.groupby.mapaggr.checkinterval
hive.map.aggr.hash.min.reduction
hive.map.aggr.hash.percentmemory
hive.groupby.skewindata
以上參數能夠查看配置文件說明即文檔進行調整。若是需求確實無法經過調整這些參數來達到,那麼set hive.map.aggr=false即是最終的方案,它確定能知足你需求,只是執行速度比map join 和 map aggr慢些,但經過實際跑數據你極可能發現其實它也不慢哈。
參考文章:
http://blog.csdn.net/macyang/article/details/9260777
http://www.myexception.cn/open-source/1487747.html
http://blog.csdn.net/lixucpf/article/details/20458617
INSERT OVERWRITE TABLE hbase_table_poi_keywords_count SELECT concat(substr(key,0,8), svccode, keywords), substr(key,0,8), svccode, keywords, count(*) where substr(key,0,8)=\"$yesterday\" AND length(keywords)>0 AND svccode is not null GROUP BY substr(key,0,8),svccode,keywords;