1、問題背景:
1.雲主機是 Linux 環境,搭建 Hadoop 僞分佈式
公網 IP:139.198.18.xxx
內網 IP:192.168.137.2
主機名:hadoop001
2.本地的core-site.xml配置以下:java
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop001:9001</value> </property> <property> <name>hadoop.tmp.dir</name> <value>hdfs://hadoop001:9001/hadoop/tmp</value> </property> </configuration>
3.本地的hdfs-site.xml配置以下:node
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
4.雲主機hosts文件配置:apache
[hadoop@hadoop001 ~]$ cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 # hostname loopback address 192.168.137.2 hadoop001
雲主機將內網IP和主機名hadoop001作了映射
5.本地hosts文件配置瀏覽器
139.198.18.XXX hadoop001
本地已經將公網IP和域名hadoop001作了映射
2、問題症狀
1.在雲主機上開啓 HDFS,JPS 查看進程都沒有異常,經過 Shell 操做 HDFS 文件也沒有問題
2.經過瀏覽器訪問 50070 端口管理界面也沒有問題
3.在本地機器上使用 Java API 操做遠程 HDFS 文件,URI 使用公網 IP,代碼以下:服務器
val uri = new URI("hdfs://hadoop001:9001") val fs = FileSystem.get(uri,conf) val listfiles = fs.listFiles(new Path("/data"),true) while (listfiles.hasNext) { val nextfile = listfiles.next() println("get file path:" + nextfile.getPath().toString()) } ------------------------------運行結果--------------------------------- get file path:hdfs://hadoop001:9001/data/infos.txt
4.在本地機器使用SparkSQL讀取hdfs上的文件並轉換爲DF的過程當中app
object SparkSQLApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().appName("SparkSQLApp").master("local[2]").getOrCreate() val info = spark.sparkContext.textFile("/data/infos.txt") import spark.implicits._ val infoDF = info.map(_.split(",")).map(x=>Info(x(0).toInt,x(1),x(2).toInt)).toDF() infoDF.show() spark.stop() } case class Info(id:Int,name:String,age:Int) }
出現以下報錯信息:dom
.... .... .... 19/02/23 16:07:00 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 19/02/23 16:07:00 INFO HadoopRDD: Input split: hdfs://hadoop001:9001/data/infos.txt:0+17 19/02/23 16:07:21 WARN BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ..... .... 19/02/23 16:07:21 INFO DFSClient: Could not obtain BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 from any node: java.io.IOException: No live nodes contain block BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 after checking nodes = [DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Will get new block locations from namenode and retry... 19/02/23 16:07:21 WARN DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 272.617680460432 msec. 19/02/23 16:07:42 WARN BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ... ... 19/02/23 16:07:42 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499) ... ... 19/02/23 16:08:12 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) ... ... 19/02/23 16:08:12 INFO DFSClient: Could not obtain BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 from any node: java.io.IOException: No live nodes contain block BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 after checking nodes = [DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Will get new block locations from namenode and retry... 19/02/23 16:08:12 WARN DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 11918.913311370841 msec. 19/02/23 16:08:45 WARN BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ... ... 19/02/23 16:08:45 WARN DFSClient: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Throwing a BlockMissingException 19/02/23 16:08:45 WARN DFSClient: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Throwing a BlockMissingException 19/02/23 16:08:45 WARN DFSClient: DFS Read org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001) ... ... 19/02/23 16:08:45 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:648) ... ... 19/02/23 16:08:45 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 19/02/23 16:08:45 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 19/02/23 16:08:45 INFO TaskSchedulerImpl: Cancelling stage 0 19/02/23 16:08:45 INFO DAGScheduler: ResultStage 0 (show at SparkSQLApp.scala:30) failed in 105.618 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001) ... ... Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001) ... ...
3、問題分析
1.本地 Shell 能夠正常操做,排除集羣搭建和進程沒有啓動的問題
2.雲主機沒有設置防火牆,排除防火牆沒關的問題
3.雲服務器防火牆開放了 DataNode 用於數據傳輸服務端口 默認是 50010
4.我在本地搭建了另外一臺虛擬機,該虛擬機和本地在同一局域網,本地能夠正常操做該虛擬機的hdfs,基本肯定了是因爲內外網的緣由。
5.查閱資料發現 HDFS 中的文件夾和文件名都是存放在 NameNode 上,操做不須要和 DataNode 通訊,所以能夠正常建立文件夾和建立文件說明本地和遠程 NameNode 通訊沒有問題。那麼極可能是本地和遠程 DataNode 通訊有問題
4、問題猜測
因爲本地測試和雲主機不在一個局域網,hadoop配置文件是之內網ip做爲機器間通訊的ip。在這種狀況下,咱們可以訪問到namenode機器,namenode會給咱們數據所在機器的ip地址供咱們訪問數據傳輸服務,可是當寫數據的時候,NameNode 和DataNode 是經過內網通訊的,返回的是datanode內網的ip,咱們沒法根據該IP訪問datanode服務器。
咱們來看一下其中一部分報錯信息:分佈式
19/02/23 16:07:21 WARN BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection timed out: no further information ... 19/02/23 16:07:42 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue....
從報錯信息中能夠看出,鏈接不到192.168.137.2:50010,也就是datanode的地址,由於外網必須訪問「139.198.18.XXX:50010」才能訪問到datanode。
爲了可以讓開發機器訪問到hdfs,咱們能夠經過域名訪問hdfs,讓namenode返回給咱們datanode的域名。
5、問題解決
1.嘗試一:
在開發機器的hosts文件中配置datanode對應的外網ip和域名(上文已經配置),而且在與hdfs交互的程序中添加以下代碼:oop
val conf = new Configuration() conf.set("dfs.client.use.datanode.hostname", "true")
報錯依舊
2.嘗試二:測試
val spark = SparkSession .builder() .appName("SparkSQLApp") .master("local[2]") .config("dfs.client.use.datanode.hostname", "true") .getOrCreate()
報錯依舊
3.嘗試三:
在hdfs-site.xml中添加以下配置:
<property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property>
運行成功
經過查閱資料,建議在hdfs-site.xml中增長dfs.datanode.
use.datanode.hostname屬性,表示datanode之間的通訊也經過域名方式
<property> <name>dfs.datanode.use.datanode.hostname</name> <value>true</value> </property>
這樣可以使得更換內網IP變得十分簡單、方便,並且可讓特定datanode間的數據交換變得更容易。但與此同時也存在一個反作用,當DNS解析失敗時會致使整個Hadoop不能正常工做,因此要保證DNS的可靠