成也蕭何,敗也蕭何---PIG JOIN 的replicated

    '''一句話總結:PIG 在2個表JOIN的時候,若是使用Using 'replicated' 會將後面的表分段讀到內存中,從而加快JOIN的效率。可是若是load 到內存的數據超過JVM的限制就會報錯==> java

java.lang.OutOfMemoryError: Java heap space
內存溢出'''

情節: apache

    年前寫了一個用戶session處理的PIG腳本,各類測試經過,數據OK,而後就Happy的回家過年了。T T悲劇的是,過年幾天天天都發報錯信息,還好依賴這個數據的後臺沒正是上線,否則死定了。回到公司查問題,發現老是執行到某一個JOIN的時候卡住,而後超過1200 s ,就被kill掉了。 session

2013-02-16 12:40:23,520 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More inf ormation at: http://hd09:50030/jobdetails.jsp?jobid=job_201301221227_72618
 2013-02-16 13:47:50,157 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% comp lete
 2013-02-16 13:47:52,171 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_ 201301221227_72618 has failed! Stop running all dependent jobs
 2013-02-16 13:47:52,171 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% com plete
 2013-02-16 13:47:52,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception f rom backed error: Task attempt_201301221227_72618_m_000000_1 failed to report status for 1201 seconds. Killing!
 2013-02-16 13:47:52,176 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
 2013-02-16 13:47:52,178 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
再看詳細報錯信息:

Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2894)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589)
        at java.lang.StringBuilder.append(StringBuilder.java:220)
        at java.io.UnixFileSystem.resolve(UnixFileSystem.java:108)
        at java.io.File.<init>(File.java:329)
        at org.apache.hadoop.mapred.TaskLog.getAttemptDir(TaskLog.java:267)
        at org.apache.hadoop.mapred.TaskLog.getAttemptDir(TaskLog.java:260)
        at org.apache.hadoop.mapred.TaskLog.getIndexFile(TaskLog.java:237)
        at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:316)
        at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369)
        at org.apache.hadoop.mapred.Child$3.run(Child.java:141)
Exception in thread "LeaseChecker" java.lang.OutOfMemoryError: Java heap spac
擦!!怎麼回事內存溢出呢!!之前都是好端端的沒事呀。LIMIT 減小數據量試試===》數據出來了。

再看PIG語句 app

A = LOAD 'A/dt=2013-02-14' USING PigStorage('\u0001') AS (id:int,name:chararray);
B = LOAD 'B/*' USING PigStorage('\u0001') AS (id:int,gender:chararray);                                                                                              
C = FOREACH (JOIN A BY id , B BY id USING 'replicated') GENERATE  A::id, A::name, A::gender;
Using 'replicated' ?這個語法是在join的時候把後面表也就是B的數據讀到內存,會加快JOIN的速度。 我好像發現了什麼,內存啊,內存,內存溢出。靠!!幹掉Using 'replicated' ,再跑。===》數據出來了 。再和維護Hadoop集羣的同事聯繫,果真,過年的時候爲了減小集羣的壓力,修改了不少東西,真相大白。

成也蕭何,敗也蕭何!Using 'replicated' 要慎用啊。最好仍是不用,由於隱患太大,B表一直增加的話確定會超過JVM限制的。 jsp

相關文章
相關標籤/搜索