外網沒法訪問雲主機HDFS文件系統

1、問題背景:
1.雲主機是 Linux 環境,搭建 Hadoop 僞分佈式
公網 IP:139.198.18.xxx
內網 IP:192.168.137.2
主機名:hadoop001
2.本地的core-site.xml配置以下:java

<configuration>
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop001:9001</value>
</property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>hdfs://hadoop001:9001/hadoop/tmp</value>
</property>
</configuration>

3.本地的hdfs-site.xml配置以下:node

<configuration>
<property>
       <name>dfs.replication</name>
       <value>1</value>
 </property>
</configuration>

4.雲主機hosts文件配置:apache

[hadoop@hadoop001 ~]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# hostname loopback address
  192.168.137.2   hadoop001

雲主機將內網IP和主機名hadoop001作了映射
5.本地hosts文件配置瀏覽器

139.198.18.XXX     hadoop001

本地已經將公網IP和域名hadoop001作了映射
2、問題症狀
1.在雲主機上開啓 HDFS,JPS 查看進程都沒有異常,經過 Shell 操做 HDFS 文件也沒有問題
2.經過瀏覽器訪問 50070 端口管理界面也沒有問題
3.在本地機器上使用 Java API 操做遠程 HDFS 文件,URI 使用公網 IP,代碼以下:服務器

val uri = new URI("hdfs://hadoop001:9001")
val fs = FileSystem.get(uri,conf)
val listfiles = fs.listFiles(new Path("/data"),true)
    while (listfiles.hasNext) {
    val nextfile = listfiles.next()
    println("get file path:" + nextfile.getPath().toString())
    }
------------------------------運行結果---------------------------------
get file path:hdfs://hadoop001:9001/data/infos.txt

4.在本地機器使用SparkSQL讀取hdfs上的文件並轉換爲DF的過程當中app

object SparkSQLApp {
  def main(args: Array[String]): Unit = {
  val spark = SparkSession.builder().appName("SparkSQLApp").master("local[2]").getOrCreate()
  val info = spark.sparkContext.textFile("/data/infos.txt")
  import spark.implicits._
  val infoDF = info.map(_.split(",")).map(x=>Info(x(0).toInt,x(1),x(2).toInt)).toDF()
  infoDF.show()
  spark.stop()
  }
  case class Info(id:Int,name:String,age:Int)
}

出現以下報錯信息:dom

....
....
....
19/02/23 16:07:00 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/23 16:07:00 INFO HadoopRDD: Input split: hdfs://hadoop001:9001/data/infos.txt:0+17
19/02/23 16:07:21 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
.....
....
19/02/23 16:07:21 INFO DFSClient: Could not obtain BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 from any node: java.io.IOException: No live nodes contain block BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 after checking nodes = [DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes:  DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Will get new block locations from namenode and retry...
19/02/23 16:07:21 WARN DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 272.617680460432 msec.
19/02/23 16:07:42 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
...
...
19/02/23 16:07:42 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information
java.net.ConnectException: Connection timed out: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
...
...
19/02/23 16:08:12 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information
java.net.ConnectException: Connection timed out: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
...
...
19/02/23 16:08:12 INFO DFSClient: Could not obtain BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 from any node: java.io.IOException: No live nodes contain block BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 after checking nodes = [DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes:  DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Will get new block locations from namenode and retry...
19/02/23 16:08:12 WARN DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 11918.913311370841 msec.
19/02/23 16:08:45 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
...
...
19/02/23 16:08:45 WARN DFSClient: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes:  DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Throwing a BlockMissingException
19/02/23 16:08:45 WARN DFSClient: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes:  DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Throwing a BlockMissingException
19/02/23 16:08:45 WARN DFSClient: DFS Read
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt
    at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001)
...
...
19/02/23 16:08:45 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt
    at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:648)
...
...
19/02/23 16:08:45 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
19/02/23 16:08:45 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
19/02/23 16:08:45 INFO TaskSchedulerImpl: Cancelling stage 0
19/02/23 16:08:45 INFO DAGScheduler: ResultStage 0 (show at SparkSQLApp.scala:30) failed in 105.618 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt
    at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001)
...
...
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt
    at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001)
...
...

3、問題分析
1.本地 Shell 能夠正常操做,排除集羣搭建和進程沒有啓動的問題
2.雲主機沒有設置防火牆,排除防火牆沒關的問題
3.雲服務器防火牆開放了 DataNode 用於數據傳輸服務端口 默認是 50010
4.我在本地搭建了另外一臺虛擬機,該虛擬機和本地在同一局域網,本地能夠正常操做該虛擬機的hdfs,基本肯定了是因爲內外網的緣由。
5.查閱資料發現 HDFS 中的文件夾和文件名都是存放在 NameNode 上,操做不須要和 DataNode 通訊,所以能夠正常建立文件夾和建立文件說明本地和遠程 NameNode 通訊沒有問題。那麼極可能是本地和遠程 DataNode 通訊有問題
4、問題猜測
因爲本地測試和雲主機不在一個局域網,hadoop配置文件是之內網ip做爲機器間通訊的ip。在這種狀況下,咱們可以訪問到namenode機器,namenode會給咱們數據所在機器的ip地址供咱們訪問數據傳輸服務,可是當寫數據的時候,NameNode 和DataNode 是經過內網通訊的,返回的是datanode內網的ip,咱們沒法根據該IP訪問datanode服務器。
咱們來看一下其中一部分報錯信息:分佈式

19/02/23 16:07:21 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
...
19/02/23 16:07:42 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue....

從報錯信息中能夠看出,鏈接不到192.168.137.2:50010,也就是datanode的地址,由於外網必須訪問「139.198.18.XXX:50010」才能訪問到datanode。
爲了可以讓開發機器訪問到hdfs,咱們能夠經過域名訪問hdfs,讓namenode返回給咱們datanode的域名。
5、問題解決
1.嘗試一:
在開發機器的hosts文件中配置datanode對應的外網ip和域名(上文已經配置),而且在與hdfs交互的程序中添加以下代碼:oop

val conf = new Configuration()
conf.set("dfs.client.use.datanode.hostname", "true")

報錯依舊
2.嘗試二:測試

val spark = SparkSession
      .builder()
      .appName("SparkSQLApp")
       .master("local[2]")
      .config("dfs.client.use.datanode.hostname", "true")
      .getOrCreate()

報錯依舊
3.嘗試三:
在hdfs-site.xml中添加以下配置:

<property>
        <name>dfs.client.use.datanode.hostname</name>
        <value>true</value>
    </property>

運行成功
經過查閱資料,建議在hdfs-site.xml中增長dfs.datanode.
use.datanode.hostname屬性,表示datanode之間的通訊也經過域名方式

<property>
        <name>dfs.datanode.use.datanode.hostname</name>
        <value>true</value>
    </property>

這樣可以使得更換內網IP變得十分簡單、方便,並且可讓特定datanode間的數據交換變得更容易。但與此同時也存在一個反作用,當DNS解析失敗時會致使整個Hadoop不能正常工做,因此要保證DNS的可靠

總結:將默認的經過IP訪問,改成經過域名方式訪問。

相關文章
相關標籤/搜索