最近線上的的nm 有crash的問題,查看錯誤日誌:
java
2014-06-19 00:01:22,308 FATAL org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Error: Shutting downjava.util. ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:761) at java.util.LinkedList$ListItr.next(LinkedList.java:696) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.toString(LocalizedResource.java:120) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:656) 2014-06-19 00:01:22,308 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting 2014-06-19 00:03:40,685 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://bipcluster/tmp/hive-hdfs/hive_2014-06-19_00-05-51_049_5891972191087895437/-mr-10004/a1495555-b0dc-4356-8b68-1c881012e123, 1403107405580, FILE, null } 2014-06-19 00:03:40,685 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.util.concurrent.RejectedExecutionException at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:618) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:514) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:456) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:662) 2014-06-19 00:03:40,685 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.
是在作resource localize時多線程的併發更新問題致使nm異常退出
這是一個bug,bug id:
https://issues.apache.org/jira/browse/YARN-573
bug描述:
node
Shared data structures in Public Localizer and Private Localizer are not Thread safe. PublicLocalizer 1) pending accessed by addResource (part of event handling) and run method (as a part of PublicLocalizer.run() ). PrivateLocalizer (LocalizerRunner?) 1) pending accessed by addResource (part of event handling) and findNextResource (i.remove()). Also update method should be fixed. It too is sharing pending list.
控制resource localize的有兩個線程
PublicLocalizer 和 LocalizerRunner,一個用來控制public文件的下載,一個用來控制private文件的下載,二者都會操做pending,fix的方法就是增長同步,這個bug已經在cdh5.2.0的yarn中fix了。
關於觸發java.util.ConcurrentModificationException的異常能夠參考:
apache
http://examples.javacodegeeks.com/java-basics/exceptions/java-util-concurrentmodificationexception-how-to-handle-concurrent-modification-exception/