多節點部署druid.io,使用indexing service進行批量數據導入,出現問題。java
提交任務後,在overlord節點消息中出現以下信息:shell
2016-03-22T19:25:17,555 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid04:8091] wrote RUNNING status for task: index_wikipedia_2016-03-22T19:25:17.191+08:00 2016-03-22T19:25:23,219 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid04:8091] wrote FAILED status for task: index_wikipedia_2016-03-22T19:25:17.191+08:00 2016-03-22T19:25:23,219 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid04:8091] completed task[index_wikipedia_2016-03-22T19:25:17.191+08:00] with status[FAILED] 2016-03-22T19:25:23,219 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_wikipedia_2016-03-22T19:25:17.191+08:00 2016-03-22T19:25:23,219 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_wikipedia_2016-03-22T19:25:17.191+08:00] on worker[druid04:8091] 2016-03-22T19:25:23,222 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskLockbox - Removing task[index_wikipedia_2016-03-22T19:25:17.191+08:00] from activeTasks 2016-03-22T19:25:23,223 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskLockbox - Removing task[index_wikipedia_2016-03-22T19:25:17.191+08:00] from TaskLock[index_wikipedia_2016-03-22T19:25:17.191+08:00]
發現是task在work節點上運行失敗。這時查找到middleManage節點日誌,繼續查看緣由。ui
2016-03-22T19:25:18,227 INFO [forking-task-runner-0] io.druid.indexing.overlord.ForkingTaskRunner - Logging task index_wikipedia_2016-03-22T19:25:17.191+08:00 output to: /tmp/persistent/task/index_wikipedia_2016-03-22T19:25:17.191+08:00/log 2016-03-22T19:25:23,783 INFO [forking-task-runner-0] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[1] for task: index_wikipedia_2016-03-22T19:25:17.191+08:00 2016-03-22T19:25:23,787 INFO [forking-task-runner-0] io.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: /tmp/druid/indexlog/index_wikipedia_2016-03-22T19:25:17.191+08:00.log 2016-03-22T19:25:23,789 INFO [forking-task-runner-0] io.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: /tmp/persistent/task/index_wikipedia_2016-03-22T19:25:17.191+08:00 2016-03-22T19:25:23,811 INFO [WorkerTaskMonitor-0] io.druid.indexing.worker.WorkerTaskMonitor - Job's finished. Completed [index_wikipedia_2016-03-22T19:25:17.191+08:00] with status [FAILED]
很失望的是隻能看出異常退出,卻沒有顯示的具體緣由。google
這時停掉 middlemanager線程,修改middleManager 的配置config/middleManager/runtime.properties 文件,將log信息配置保存在本地的目錄。線程
druid.indexer.logs.type=local druid.indexer.logs.directory=/tmp/druid/indexlog
配置完成後,從新運行middleManager線程,而且從新提交任務,錯誤依舊,可是此時能夠找到錯誤的具體緣由。查看剛剛配置過的=/tmp/druid/indexlog文件。日誌
) Not enough direct memory. Please adjust -XX:MaxDirectMemorySize, druid.processing.buffer.sizeBytes, or druid.processing.numThreads: maxDirectMemory[239,075,328], memoryNeeded[4,294,967,296] = druid.processing.buffer.sizeBytes[1,073,741,824] * ( druid.processing.numThreads[3] + 1 ) at io.druid.guice.DruidProcessingModule.getIntermediateResultsPool(DruidProcessingModule.java:106) at io.druid.guice.DruidProcessingModule.getIntermediateResultsPool(DruidProcessingModule.java:106) while locating io.druid.collections.StupidPool<java.nio.ByteBuffer> annotated with @io.druid.guice.annotations.Global() for parameter 1 at io.druid.query.groupby.GroupByQueryEngine.<init>(GroupByQueryEngine.java:75) at io.druid.guice.QueryRunnerFactoryModule.configure(QueryRunnerFactoryModule.java:83) while locating io.druid.query.groupby.GroupByQueryEngine for parameter 0 at io.druid.query.groupby.GroupByQueryRunnerFactory.<init>(GroupByQueryRunnerFactory.java:79) at io.druid.guice.QueryRunnerFactoryModule.configure(QueryRunnerFactoryModule.java:80) while locating io.druid.query.groupby.GroupByQueryRunnerFactory while locating io.druid.query.QueryRunnerFactory annotated with @com.google.inject.multibindings.Element(setName=,uniqueId=26, type=MAPBINDER) at io.druid.guice.DruidBinders.queryRunnerFactoryBinder(DruidBinders.java:36) while locating java.util.Map<java.lang.Class<? extends io.druid.query.Query>, io.druid.query.QueryRunnerFactory> for parameter 0 at io.druid.query.DefaultQueryRunnerFactoryConglomerate.<init>(DefaultQueryRunnerFactoryConglomerate.java:34) while locating io.druid.query.DefaultQueryRunnerFactoryConglomerate at io.druid.guice.StorageNodeModule.configure(StorageNodeModule.java:53) while locating io.druid.query.QueryRunnerFactoryConglomerate for parameter 9 at io.druid.indexing.common.TaskToolboxFactory.<init>(TaskToolboxFactory.java:83) at io.druid.cli.CliPeon$1.configure(CliPeon.java:138) while locating io.druid.indexing.common.TaskToolboxFactory for parameter 0 at io.druid.indexing.overlord.ThreadPoolTaskRunner.<init>(ThreadPoolTaskRunner.java:76) at io.druid.cli.CliPeon$1.configure(CliPeon.java:164) while locating io.druid.indexing.overlord.ThreadPoolTaskRunner while locating io.druid.query.QuerySegmentWalker for parameter 3 at io.druid.server.QueryResource.<init>(QueryResource.java:90) while locating io.druid.server.QueryResource
一大堆信息,第一句最重要,那就是內存不夠致使問題緣由。code
思考具體哪裏內存分配的不夠? 多年從事c 類的語言,java的水平等於0,根本不知道緣由,反覆對比各類配置 java運行配置。server
最後猜想 多是middleManager fork poen線程時內存不夠問題。ip
查找手冊查看mddileManager中關於middle 配置fork poen的參數。內存
最後找到config/middleManager/runtime.properties中的兩項
#druid.indexer.fork.property.druid.processing.buffer.sizeBytes=100000000
#druid.indexer.fork.property.druid.processing.numThreads=1
默認的可能buffer是1g, numThreads是3個,致使內存移除,修改後,從新運行任務,任務順利完成。