spark JAVA 開發環境搭建及遠程調試

時間 2019-11-13

原文原文鏈接

spark JAVA 開發環境搭建及遠程調試

之後要在項目中使用Spark 用戶暱稱文本作一下聚類分析，找出一些違規的暱稱信息。之前折騰過Hadoop，因而看了下Spark官網的文檔以及 github 上官方提供的examples，看完了以後決定動手跑一個文本聚類的demo，因而有了下文。html

1. 環境介紹

本地開發環境是：IDEA201八、JDK八、windows 10。遠程服務器 Ubuntu 16.04.3 LTS上安裝了spark-2.3.1-bin-hadoop2.7java

看spark官網介紹，有兩種形式（不是Spark Application Execution Mode）來啓動sparkgit

Running the Examples and Shellgithub

好比說./bin/pyspark --master local[2]啓動的是一個交互式的命令行界面，能夠在4040端口查看做業。算法
Launching on a Clustersql

spark 集羣，有多種部署選項：Standalone。另外還有：YARN，Mesos（將集羣中的資源將由資源管理器來管理）。apache

對於Standalone，./sbin/start-master.sh 啓動Master，經過8080端口就能看到：集羣的狀況。json

再經過./sbin/start-slave.sh spark://panda-e550:7077 啓動slave：Alive Workers 就是啓動的slave。windows

執行jps：看到Master和Worker：服務器

~/spark-2.3.1-bin-hadoop2.7$ jps
45437 Master
50429 Worker

下面介紹一下在本地windows10 環境下寫Spark程序，而後鏈接到遠程的這臺Ubuntu機器上的Spark上進行調試。

2. 一個簡單的開發環境

建立Maven工程，根據官網提供的Spark Examples 來演示聚類算法（JavaBisectingKMeansExample ）的運行過程，並介紹如何配置Spark調試環境。

2.1添加maven 依賴：

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-mllib_2.11</artifactId>
    <version>2.3.1</version>
    <!--<scope>runtime</scope>-->
</dependency>

2.2 編寫代碼：

package net.hapjin.spark;

import org.apache.spark.ml.clustering.BisectingKMeans;
import org.apache.spark.ml.clustering.BisectingKMeansModel;
import org.apache.spark.ml.linalg.Vector;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class JavaBisectingKMeansExample {
    public static void main(String[] args) {
//        SparkSession spark = SparkSession.builder().appName("JavaBisectingKMeansExample").getOrCreate();
        SparkSession spark = SparkSession.builder().appName("JavaBisectingKMeansExample").master("spark://xx.xx.129.170:7077").getOrCreate();
//        Dataset<Row> dataset = spark.read().format("libsvm").load(".\\data\\sample_kmeans_data.txt");
        Dataset<Row> dataset =     spark.read().format("libsvm").load("hdfs://172.25.129.170:9000/user/panda/sample_kmeans_data.txt");
//        Dataset<Row> dataset = spark.read().format("libsvm").load("file:///E:/git/myown/test/spark/example/data/sample_kmeans_data.txt");
        // Trains a bisecting k-means model.
        BisectingKMeans bkm = new BisectingKMeans().setK(2).setSeed(1);
        BisectingKMeansModel model = bkm.fit(dataset);

        // Evaluate clustering.
        double cost = model.computeCost(dataset);
        System.out.println("Within Set Sum of Squared Errors = " + cost);
        // Shows the result.
        System.out.println("Cluster Centers: ");
        Vector[] centers = model.clusterCenters();
        for (Vector center : centers) {
            System.out.println(center);
        }
        // $example off$
        spark.stop();
    }
}

2.3 配置遠程調試環境

在IDEA中，"Run"-->"Edit Configurations"-->"Template"--->"Remote"，點擊 "+"號：

報錯：

Could not locate executable null\bin\winutils.exe

去這個github下載對應的Hadoop版本的winutils.exe。

配置windows10環境變量：HADOOP_HOME，並將該環境變量添加到 Path 環境變量下%HADOOP_HOME%\bin。

再次Debug調試，成功進入斷點：（若是報拒絕鏈接的錯誤，修改一下 conf/spark-env.sh 指定SPARK_LOCAL_IP爲機器的IP地址，而後再修改 /etc/hosts 文件將主機名與機器IP地址相對應便可）

其實，在本地開發環境（Windows10）連不上遠程的服務器時，先在Windows下telnet 一下看一下能不能通。若是不能通，那確定連不上了。另外，能夠在遠程服務器上看下相應的端口綁定在哪一個IP地址上，是否是綁定到了環回地址上了。好比下面這個spark master默認端口綁定在127.0.1.1上，那你本地的開發環境確定連不上這個端口了：

~/spark-2.3.1-bin-hadoop2.7$ netstat -anp | grep 7077
tcp6 0 0 127.0.1.1:7077 :::* LISTEN 18406/java
tcp6 0 0 127.0.0.1:29599 127.0.1.1:7077 ESTABLISHED 18605/java
tcp6 0 0 127.0.1.1:7077 127.0.0.1:29599 ESTABLISHED 18406/java

接下來，執行讀取文件，明明文件就在windows的該路徑下，可是就是報錯：

Caused by: java.io.FileNotFoundException: File file:/E:/git/myown/test/spark/example/data/sample_kmeans_data.txt does not exist

嘗試了好幾個路徑寫法，好比這篇stackoverflow，都未果。

//        Dataset<Row> dataset = spark.read().format("libsvm").load(".\\data\\sample_kmeans_data.txt");
        Dataset<Row> dataset = spark.read().format("libsvm").load("file:///E:\\git\\myown\\test\\spark\\example\\data\\sample_kmeans_data.txt");
//        Dataset<Row> dataset = spark.read().format("libsvm").load("file:///E:/git/myown/test/spark/example/data/sample_kmeans_data.txt");

看裏面的解釋：

SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".

由於我本地Windows10沒有安裝Hadoop，只是上面提到的下載winutils.exe時，簡單地配置了個 HADOOP_HOME環境變量。因此我也不知道這種遠程調試中讀取本地windows系統上的文件是否可行了。

看來，我只能在遠程服務器上再安裝個Hadoop了，而後把文件上傳到HDFS上。因而，下載Hadoop-2.7.7，解壓安裝（按照官網僞分佈式模式配置安裝），並啓動HDFS：

$ sbin/start-dfs.sh

jps查看當前服務器上的進程：

panda@panda-e550:~/data/spark$ jps
2435 Master
10181 SecondaryNameNode
11543 Jps
9849 NameNode
3321 Worker
9997 DataNode

其中Master是Spark master；Worker是spark worker；NameNode、DataNode、SecondaryNameNode是Hadoop HDFS的相關進程。把文件上傳到HDFS上：

panda@panda-e550:~/software/hadoop-2.7.7$ ./bin/hdfs dfs -ls /user/panda
Found 2 items
-rw-r--r-- 1 panda supergroup 70 2018-10-30 17:54 /user/panda/people.json
-rw-r--r-- 1 panda supergroup 120 2018-10-30 18:09 /user/panda/sample_kmeans_data.txt

修改一下代碼中文件的路徑：(仍是隱藏一下具體的ip吧)

Dataset<Row> dataset = spark.read().format("libsvm").load("hdfs://xx.xx.129.170:9000/user/panda/sample_kmeans_data.txt");

繼續右擊debug，又報錯：

failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

按照提示的連接看：應該是下面這個問題。

If the error message says the remote service is on "127.0.0.1" or "localhost" that means the configuration file is telling the client that the service is on the local server. If your client is trying to talk to a remote system, then your configuration is broken.

Check that there isn't an entry for your hostname mapped to 127.0.0.1 or 127.0.1.1 in /etc/hosts (Ubuntu is notorious for this)

cat /etc/hosts 部份內容以下：

127.0.0.1 localhost

172.25.129.170 panda-e550

再看看 hdfs上的配置文件：cat etc/hadoop/core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
        </property>
</configuration>

在遠程服務器上：

panda@panda-e550:~/software/hadoop-2.7.7$ netstat -anp | grep 9000
tcp 0 0 127.0.0.1:9000 0.0.0.0:* LISTEN 9849/java
tcp 0 0 127.0.0.1:43162 127.0.0.1:9000 ESTABLISHED 9997/java
tcp 0 0 127.0.0.1:43360 127.0.0.1:9000 TIME_WAIT -
tcp 0 0 127.0.0.1:9000 127.0.0.1:43162 ESTABLISHED 9849/java

所以，出現ConnectException的緣由就很明顯了。所以，服務器Hadoop HDFS 9000端口綁定在到環回地址上。那我在windows10上的機器上的IDEA的代碼程序裏面指定：hdfs://172.25.129.170:9000/user/panda/sample_kmeans_data.txt確定是不行的了。畢竟windows10開發環境機器的ip地址，確定和安裝Hadoop的遠程服務器的ip地址是不一樣的。

因爲core-site.xml配置的是hdfs://localhost:9000，按理說，把 localhost 改爲 ip 地址應該是能夠的。可是我採用另外一種方案，修改 /etc/hosts中的文件：把原來的127.0.0.1對應的localhost註釋掉，修改爲機器的ip地址，以下：

panda@panda-e550:~/software/hadoop-2.7.7$ cat /etc/hosts
# comment this line and add new line for hadoop hdfs
#127.0.0.1      localhost
172.25.129.170 localhost

而後再重啓一下 HDFS進程。此時：

panda@panda-e550:~/software/hadoop-2.7.7$ netstat -anp | grep 9000
tcp 0 0 172.25.129.170:9000 0.0.0.0:* LISTEN 13126/java
tcp 0 0 172.25.129.170:50522 172.25.129.170:9000 ESTABLISHED 13276/java
tcp 0 0 172.25.129.170:50616 172.25.129.170:9000 TIME_WAIT -
tcp 0 0 172.25.129.170:9000 172.25.129.170:50522 ESTABLISHED 13126/java

在windows10 機器的cmd命令行上telnet 一下：發現能成功鏈接上。所以能夠放心debug了。