最近有個SQL運行時長超過兩個小時,因此準備優化下html
首先查看hive sql 產生job的counter數據發現java
總的CPU time spent 太高估計100.4319973小時正則表達式
每一個map的CPU time spentsql
排第一的耗了2.0540889小時apache
建議設置以下參數:ide
1、mapreduce.input.fileinputformat.split.maxsize如今是256000000 往下調增長map數(此招立竿見影,我設爲32000000產生了500+的map,最後任務由原先的2小時提速到47分鐘就完成)oop
2、優化UDF getPageID getSiteId getPageValue (這幾個方法用了不少正則表達式的文本匹配)優化
2.1 正則表達式處理優化能夠參考lua
http://www.fasterj.com/articles/regex1.shtmlurl
http://www.fasterj.com/articles/regex2.shtml
2.2 UDF優化見
1 Also you should use class level privatete members to save on object incantation and garbage collection. 2 You also get benefits by matching the args with what you would normally expect from upstream. Hive converts text to string when needed, but if the data normally coming into the method is text you could try and match the argument and see if it is any faster. Exapmle: 優化前: >>>> import org.apache.hadoop.hive.ql.exec.UDF; >>>> import java.net.URLDecoder; >>>> >>>> public final class urldecode extends UDF { >>>> >>>> public String evaluate(final String s) { >>>> if (s == null) { return null; } >>>> return getString(s); >>>> } >>>> >>>> public static String getString(String s) { >>>> String a; >>>> try { >>>> a = URLDecoder.decode(s); >>>> } catch ( Exception e) { >>>> a = ""; >>>> } >>>> return a; >>>> } >>>> >>>> public static void main(String args[]) { >>>> String t = "%E5%A4%AA%E5%8E%9F-%E4%B8%89%E4%BA%9A"; >>>> System.out.println( getString(t) ); >>>> } >>>> }
優化後:
import java.net.URLDecoder; public final class urldecode extends UDF { private Text t = new Text(); public Text evaluate(Text s) { if (s == null) { return null; } try { t.set( URLDecoder.decode( s.toString(), "UTF-8" )); return t; } catch ( Exception e) { return null; } } //public static void main(String args[]) { //String t = "%E5%A4%AA%E5%8E%9F-%E4%B8%89%E4%BA%9A"; //System.out.println( getString(t) ); //} }
3 繼承實現GenericUDF
3、若是是Hive 0.14 + 能夠開啓hive.cache.expr.evaluation UDF Cache功能