1,錯誤:java.io.IOException: Incompatible clusterIDs 時常出如今namenode從新格式化以後php
2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000
java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
at java.lang.Thread.run(Thread.java:722)
2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000
2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)
2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanodehtml
緣由:每次namenode format會從新建立一個namenodeId,而data目錄包含了上次format時的id,namenode format清空了namenode下的數據,可是沒有清空datanode下的數據,致使啓動時失敗,所要作的就是每次fotmat前,清空data下的全部目錄.
解決辦法:停掉集羣,刪除問題節點的data目錄下的全部內容。即hdfs-site.xml文件中配置的dfs.data.dir目錄。從新格式化namenode。java
另外一個更省事的辦法:先停掉集羣,而後將datanode節點目錄/dfs/data/current/VERSION中的修改成與namenode一致便可。node
2,錯誤:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start containerlinux
14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.web
This token is expired. current time is 1398762692768 found 1398711306590
at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
. Failing the application.
14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0
問題緣由:namenode,datanode時間同步問題
解決辦法:多個datanode與namenode進行時間同步,在每臺服務器執行:ntpdate time.nist.gov,確認時間同步成功。
最好在每臺服務器的 /etc/crontab 中加入一行:
0 2 * * * root ntpdate time.nist.gov && hwclock –wshell
3,錯誤apache
2015-04-07 23:12:39,837 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2015-04-07 23:12:39,838 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 41.
at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:647)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:264)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:684)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:669)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)
2015-04-07 23:12:39,842 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1vim
緣由:namenode元數據被破壞,須要修復
解決:恢復一下namenode
hadoop namenode -recover
一路選擇c,通常就OK了windows
4,錯誤
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)
at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)
at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
緣由:錯誤很明顯,磁盤空間不足,但鬱悶的是,進各節點查看,磁盤空間使用不到40%,還有不少空間。
鬱悶很長時間才發現,原來有個map任務運行時輸出比較多,運行出錯前,硬盤空間一路飆升,直到100%不夠時報錯。隨後任務執行失敗,釋放空間,把任務分配給其它節點。正由於空間被釋放,所以雖然報空間不足的錯誤,但查看當時磁盤還有不少剩餘空間。
這個問題告訴咱們,運行過程當中的監控很重要。
5,錯誤
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2
java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)
錯誤緣由:本地磁盤空間不足非hdfs (我是在myeclipse中調試程序,本地tmp目錄佔滿)
解決辦法:清理、增長空間
6,錯誤:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for
14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED
Error: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Container killed by the ApplicationMaster.
緣由:兩種可能,hadoop.tmp.dir或者data目錄存儲空間不足。
解決辦法:看了一下個人dfs狀態,data使用率不到40%,因此推測是hadoop.tmp.dir空間不足,致使沒法建立Jog臨時文件。查看core-site.xml發現沒有配置hadoop.tmp.dir,所以使用的是默認的/tmp目錄,在這目錄一旦服務器重啓數據就會丟失,所以須要修改。添加:
<property>
<name>hadoop.tmp.dir</dir>
<value>/data/tmp</value>
</property>
而後從新格式化:hadoop namenode -format
重啓。
7,錯誤:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.
2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)
at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
緣由:沒法寫入;個人環境中有3個datanode,備份數量設置的是3。在寫操做時,它會在pipeline中寫3個機器。默認replace-datanode-on-failure.policy是DEFAULT,若是系統中的datanode大於等於3,它會找另一個datanode來拷貝。目前機器只有3臺,所以只要一臺datanode出問題,就一直沒法寫入成功。
解決辦法:修改hdfs-site.xml文件,添加或者修改以下兩項:
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
<value>true</value>
</property>
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
<value>NEVER</value>
</property>
對於dfs.client.block.write.replace-datanode-on-failure.enable,客戶端在寫失敗的時候,是否使用更換策略,默認是true沒有問題。
對於,dfs.client.block.write.replace-datanode-on-failure.policy,default在3個或以上備份的時候,是會嘗試更換結點嘗試寫入datanode。而在兩個備份的時候,不更換datanode,直接開始寫。對於3個datanode的集羣,只要一個節點沒響應寫入就會出問題,因此能夠關掉。
8,錯誤:DataXceiver error processing WRITE_BLOCK operation
2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation src: /192.168.1.193:34147 dest: /192.168.1.191:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:722)
緣由:文件操做超租期,實際上就是data stream操做過程當中文件被刪掉了。
解決辦法:
修改hdfs-site.xml (針對2.x版本,1.x版本屬性名應該是:dfs.datanode.max.xcievers):
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>8192</value>
</property>
拷貝到各datanode節點並重啓datanode便可
9,錯誤:java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write
2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation src: /192.168.1.191:48854 dest: /192.168.1.191:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:722)
緣由:IO超時
解決方法:
修改hadoop配置文件hdfs-site.xml,增長dfs.datanode.socket.write.timeout和dfs.socket.timeout兩個屬性的設置。
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>6000000</value>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>6000000</value>
</property>
注意: 超時上限值以毫秒爲單位。0表示無限制。
10,兩次以上格式化形成NameNode 和 DataNode namespaceID 不一致。
報錯:ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /var/lib/hadoop-0.20/cache/hdfs/dfs/data: namenode
namespaceID = 240012870; datanode namespaceID = 1462711424 .
解決方法1:修改$hadoop.tmp.dir 下的dfs/data/current/VERSION 文件中namespaceID 使其一致。
解決方法2:這個有點殘暴,就是清空hadoop.tmp.dir這個目錄,在我這裏是/home/work/hadoop_tmp
分析:
上面是不少新手常常遇到的問題,hadoop.tmp.dir是什麼?下面給你們詳細看一下:
我是經過圖1操,vim hdfs-site.xml
圖1
查看hdfs.xml文件內容的.
圖2是hdfs.xml文件的內容,位於hadoop/conf文件夾下。
圖2
hadoop.tmp.dir這個表明的意思hadoop的存放目錄,相似咱們的數據是放在本地文件中的C盤仍是D盤。可是由於Linux特殊的文件系統,因此存放在了/home/work/hadoop_tmp文件夾下。
上面咱們懂得了,hadoop.tmp.dir它的意思,那麼咱們進一步
進入經過 vim dfs/data/current/VERSION
編輯下面內容:修改namenodeID便可。
11,DataNode 或者 JobTracker 出了故障 單獨啓動
hadoop-daemon.sh start datanode
hadoop-daemon.sh start jobtracker
12,.動態添加DataNode 動態將某個節點加入到集羣中
hadoop-daemon.sh --config ./conf start datanode
hadoop-daemon.sh --config ./conf start tasktracker
相關內容還能夠查看
hadoop集羣添加namenode的步驟及常識
13,在運行過程當中發現error:unmappable character for encoding UTF8
因爲java 程序不是utf8,因此在提交後不能解析的緣由,將eclipse 編碼設置成utf8:
這裏交給你們該如何修改:
經過Window-》preference
經過上上面操做,咱們找到workspace,而後修改編碼便可
14, 用window 提交eclipse 任務發現不經過:
緣由:本地用戶administrator(本機windows用戶)想要遠程操做hadoop系統,沒有權限引發的。
解決辦法:
一、若是是測試環境,能夠取消hadoop hdfs的用戶權限檢查。打開conf/hdfs-site.xml,找到dfs.permissions屬性修改成false(默認爲true)OK了。(1.2.1 版本只有這個方法可行),如何操做能夠參考第一個問題。
二、修改hadoop location參數,在advanced parameter選項卡中,找到hadoop.job.ugi項,將此項改成啓動hadoop的用戶名便可
3 修改window 機器的用戶名爲 hadoop 用戶名。
15, 1用eclipse 鏈接遠程集羣鏈接不上
1.除了防火牆
2.權限修改
3.ip須要設置爲靜態
4.檢查集羣是否開啓
16, 運行過程當中發現Java heap space OutOfMemory
修改 hadoop-env.sh 文件 將:export HADOOP_CLIENT_OPTS="-Xmx128m $HADOOP_CLIENT_OPTS"
改爲:export HADOOP_CLIENT_OPTS="-Xmx2048m $HADOOP_CLIENT_OPTS"
17,eclipse 運行中發現 Name node is in safe mode
在分佈式文件系統啓動的時候,開始的時候會有安全模式,當分佈式文件系統處於安全模式的狀況下,文件系統中的內容不容許修改也不容許刪除,直到安全模式結束。安全模式主要是爲了系統啓動的時候檢查各個DataNode上數據塊的有效性,同時根據策略必要的複製或者刪除部分數據塊。運行期經過命令也能夠進入安全模式。在實踐過程當中,系統啓動的時候去修改和刪除文件也會有安全模式不容許修改的出錯提示,只須要等待一下子便可。
safemode參數說明:
enter - 進入安全模式
leave - 強制NameNode離開安全模式
get - 返回安全模式是否開啓的信息
wait - 等待,一直到安全模式結束。
解決方案:bin/hadoop dfsadmin -safemode leave
18, Invalid Hadoop Runtime specified; please click 'Configure Hadoop install directory' or fill in library location input
field
解決辦法:eclipse window->preferences - > Map/Reduce 選擇hadoop根目錄
19, storage directory does not exist or is not accessible.
從新格式化 bin/hadoop namenode -format (當心不要拼錯)
20,hbase
INFO org.apache.hadoop.hbase.util.FSUtils: Waiting for dfs to exit safe mode...
:bin/hadoop dfsadmin -safemode leave (解除安全模式)
21,win7下 ssh啓動不了 錯誤:ssh: connect to host localhost port 22: Connection refused
輸入windows 登陸用戶名
22,啓動hadoop時沒有NameNode的可能緣由:
(1) NameNode沒有格式化
(2) 環境變量配置錯誤
(3) Ip和hostname綁定失敗
(4)hostname含有特殊符號如何.(符號點),會被誤解析
23,地址佔用
報錯:org.apache.hadoop.hdfs.server.namenode.NameNode: Address already in use
解決方法:查找被佔用的端口號對應的PID:netstat –tunl
Pkill -9 PID
實在不行就killall -9 java
24,safeMode
報錯:
bin/hadoop fs -put ./input input
put: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/root/input. Name node is in safe mode.
hadoop dfsadmin -safemode leave
解決方法:
NameNode在啓動的時候首先進入安全模式,若是dataNode丟失的block達到必定的比例(1-dfs.safemode.threshold.pct),則系統一直處於安全模式狀態,即只讀狀態。
dfs.safemode.threshold.pct(缺省值0.999f)表示HDFS啓動的時候,若是DataNode上報的block個數達到了元數據記錄的block個數的0999倍才能夠離開安全模式,不然一直是這種只讀模式。若是設置爲1,則HDFS一直處於安全模式。
下面這行摘錄自NameNode啓動時的日誌(block上報比例1達到了閾值0.999)
The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 14 seconds.
有兩種方法離開這種安全模式:
(1) 修改dfs.safeMode.threshold.pct爲一個比較小的值,缺省是0.999;
(2) hadoop dfsadmin –safemode leave命令強制離開
用戶可經過dfsadmin –safemode value來操做安全模式,參數value說明以下:
Enter : 進入安全模式
Leave :強制NameNode離開安全模式
Get : 返回安全模式是否開啓的信息
Wait:等待,一直到安全模式結束。
25,could only be replicatied to 0 nodes, instead of 1
報錯:
hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop
.ipc.RemoteException: java.io.IOException: ... could only be replicated to 0 nodes, instead of 1 ...
可能出現的現象:用jps查看進程都正常。可是用web查看的話,live nodes 爲0,這說明datanode沒有正常啓動,但是datanode進程又啓動了。
解決方法1:
(1) 防火牆緣由:
永久關閉防火牆命令:chkconfig iptables stop
(2)namespaceid不一致
(3) 磁盤空間緣由:
df –ah #查看磁盤空間
若是是磁盤空間不夠,則調整磁盤空間(像下圖就是磁盤空間不夠)
若是上述方法不行,可用如下方法(只不過會丟失數據,慎用)
A. 先運行stop-all.sh
B. 格式化namenode,不過在這以前先刪除原目錄,
即core-site.xml下配置的<name>hadoop.tmp.dir</name>指向的目錄,
刪除後切記從新創建配置的空目錄,而後運行命令hadoop namenode-format。
解決方法2:
26,啓動時報錯java.net. UnknownHostException
緣由分析:經過localhost.localdomain根本沒法映射到一個IP地址。
解決方法:查看/etc/hosts,將主機名hostname添加到hosts文件中。
27,啓動時報錯: java.io.IOException: File jobtracker.info could only be replicated to 0 nodes, instead of 1。
解決方法:
首先,檢查防火牆是否關閉,是否對各節點的通訊產生了影響;
其次,能夠檢查namenode和datanode中的namespaceID的值是否相同,在不一樣的狀況下,會形成該問題,修改成相同的值後,重啓該節點;
而後,將safemode設置爲off狀態
Hadoop dfsadmin –safemode leave
此外,還需檢查/etc/hosts文件中主機名的映射是否正確,不要使用127.0.0.1或localhost。
將safemode設置爲off狀態能夠有兩種方法:
執行上面的命令能夠強制設置爲off狀態,或者在hdfs-site.xml文件中添加以下代碼,將safemode的threshold.pct設置爲較小的值,這種方法避免了在執行hadoop過程當中常常性遇到錯誤Name node is in safe mode而致使需強制將safemode置爲off。
28,ip和域名解析問題
也會形成File jobtracker.info could only be replicated to 0 nodes, instead of 1的問題,此時要檢查/etc/hosts文件中主機名的映射是否正確,不要使用127.0.0.1或localhost。
29,報錯:Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
結果:執行任務時,任務卡死在reduce上,不執行。
緣由:任務會在最後將全部的reduce task 複製到一臺機器上,作最後總的reduce,此時須要ip和主機域名的配置正確。
解決方法:將個節點的ip和主機名配置正確,不能使用127.0.0.1或localhost,使用內網ip可加快通訊速度。
30,hive執行報錯: java.lang.OutOfMemoryError: GC overhead limit exceeded
緣由:
這個是jdk6新增的錯誤類型,是發生在GC佔用大量時間爲釋放很小空間的時候發生的,是一種保護機制。
解決方法:
關閉該功能,能夠添加JVM的啓動參數來限制使用內存:
在mapred-site.xml裏新增項:mapred.child.java.opts,
內容:-XX:-UseGCOverheadLimit
31,datanode節點TaskTracker任務啓動,可是DataNode任務爲啓動
這通常是因爲對hadoop進行升級後致使的,須要刪除hadoop.tmp.dir所對應的文件夾,而後對namenode從新格式化,刪除以前先將數據導出,不然數據就over了。
32,hadoop集羣在namenode格式化(bin/hadoop namenode -format)後重啓集羣會出現以下
Incompatible namespaceIDS in ... :namenode namespaceID = ... ,datanode namespaceID=...
錯誤,緣由是格式化namenode後會從新建立一個新的namespaceID,以致於和datanode上原有的不一致。
解決方法:
33,hadoop集羣啓動start-all.sh的時候,slave老是沒法啓動datanode,並會報錯:
... could only be replicated to 0 nodes, instead of 1 ...
就是有節點的標識可能重複(我的認爲這個錯誤的緣由)。也可能有其餘緣由,一下解決方法請依次嘗試,我是解決了。
解決方法:
34,程序執行出現Error: java.lang.NullPointerException
空指針異常,確保java程序的正確。變量什麼的使用前先實例化聲明,不要有數組越界之類的現象。檢查程序。
35,執行本身的程序的時候,(各類)報錯,請確保一下狀況:
正確的寫法相似:
$ hadoop jar myCount.jar myCount input output
36,ssh沒法正常通訊的問題,能夠看看這個 雲技術基礎:集羣搭建SSH的做用及這些命令的含義
37,程序編譯問題,各類包沒有的狀況,請確保你把hadoop目錄下 和hadoop/lib目錄下的jar包都有引入。
(具體可看hadoop開發方式總結及操做指導)
38,Hadoop啓動datanode時出現Unrecognized option: -jvm 和 Could not create the Java virtual machine.
在hadoop安裝目錄/bin/hadoop中有以下一段shell:
View Code SHELL
$EUID 這裏的用戶標識,若是是root的話,這個標識會是0,因此儘可能不要使用root用戶來操做hadoop就行了。這也是我在配置篇裏提到不要使用root用戶的緣由。
39,若是出現終端的錯誤信息是:
ERROR hdfs.DFSClient: Exception closing file /user/hadoop/musicdata.txt : java.io.IOException: All datanodes 10.210.70.82:50010 are bad. Aborting...
還有jobtracker log的報錯信息
Error register getProtocolVersion
java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion
和可能的一些警告信息:
WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Broken pipe
WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_3136320110992216802_1063java.io.IOException: Connection reset by peer
WARN hdfs.DFSClient: Error Recovery for block blk_3136320110992216802_1063 bad datanode[0] 10.210.70.82:50010 put: All datanodes 10.210.70.82:50010 are bad. Aborting...
解決辦法:
40,若是在執行hadoop的jar程序時獲得報錯信息:
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.LongWritable
或者相似:
Status : FAILED java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
那麼你須要學習hadoop數據類型和 map/reduce模型的基本知識。個人這篇讀書筆記裏邊中間部分有介紹hadoop定義的數據類型和自定義數據類型的方法(主要是對writable類的學習和了解);和這篇裏邊說的MapReduce的類型和格式。也就是《hadoop權威指南》這本書的第四章Hadoop I/O和第七章MapReduce的類型和格式。若是你急於解決這個問題,我如今也能夠告訴你迅速的解決之道,但這勢必影響你之後開發:
確保一下數據的一致:
... extends Mapper...
public void map(k1 k, v1 v, OutputCollector output)...
...
...extends Reducer...
public void reduce(k2 k,v2 v,OutputCollector output)...
...
job.setMapOutputKeyClass(k2.class);
job.setMapOutputValueClass(k2.class);job.setOutputKeyClass(k3.class);
job.setOutputValueClass(v3.class);
...
注意 k* 和 v*的對應。建議仍是看我剛纔說的兩個章節。詳細知道其原理。
41,若是碰到datanode報錯以下:
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Cannot lock storage /data1/hadoop_data. The directory is already locked.
根據錯誤提示來看,是目錄被鎖住,沒法讀取。這時候你須要查看一下是否有相關進程還在運行或者slave機器的相關hadoop進程還在運行,結合linux這倆命令來進行查看:
netstat -nap
ps -aux | grep 相關PID
若是有hadoop相關的進程還在運行,就使用kill命令幹掉便可。而後再從新使用start-all.sh。
42,若是碰到jobtracker報錯以下:
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
解決方式,修改datanode節點裏/etc/hosts文件。
簡單介紹下hosts格式:
每行分爲三個部分:第一部分網絡IP地址、第二部分主機名或域名、第三部分主機別名
操做的詳細步驟以下:
一、首先查看主機名稱:
cat /proc/sys/kernel/hostname
會看到一個HOSTNAME的屬性,把後邊的值改爲IP就OK,而後退出。
二、使用命令:
hostname ***.***.***.***
星號換成相應的IP。
三、修改hosts配置相似內容以下:
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.200.187.77 10.200.187.77 hadoop-datanode
若是配置後出現IP地址就表示修改爲功了,若是仍是顯示主機名就有問題了,繼續修改這個hosts文件,
以下圖:
上圖提醒下,chenyi是主機名。
當在測試環境裏,本身再去部署一個域名服務器(我的以爲很繁瑣),因此簡單地方式,就直接用IP地址比較方便。若是有了域名服務器的話,那就直接進行映射配置便可。
若是仍是出現洗牌出錯這個問題,那麼就試試別的網友說的修改配置文件裏的hdfs-site.xml文件,添加如下內容:
dfs.http.address
*.*.*.*:50070 端口不要改,星號換成IP,由於hadoop信息傳輸都是經過HTTP,這個端口是不變的。
43,若是碰到jobtracker報錯以下:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code *
這是java拋出的系統返回的錯誤碼,錯誤碼錶示的意思詳細的請看這裏。
我這裏是些streaming的php程序時遇到的,遇到的錯誤碼是code 2: No such file or directory。即找不到文件或者目錄。發現命令居然忘記使用'php ****' 很坑,另外網上看到也多是include、require等命令形成。詳細的請根據自身狀況和錯誤碼修改。
13,