Zeppelin-在Flink和Spark集羣的安裝

時間 2019-12-07

標籤 zeppelin flink spark 集羣安裝欄目 Spark 简体版

原文原文鏈接

Zeppelin-在flink和spark集羣的安裝

該教程主要面向Zeppelin的入門者。不須要太多的關於 Linux, git, 或其它工具的基礎知識。若是你按照這裏的方法逐項執行，就能夠將 Zeppelin 正常運行起來。html

安裝Zeppelin爲Flink/Spark集羣模式git

本教程假定用戶有一個新的機器環境 (物理機或 virtual 都可, 最小安裝 Ubuntu 14.04.3 Server）。github

注意: 虛擬機的大小至少16GB，以避免出現磁盤空間不夠致使安裝失敗。web

軟件要求

採用最小安裝, 下面幾個程序須要在安裝Zeppelin、Flink 和 Spark以前安裝：apache

git
openssh-server
OpenJDK 7
Maven 3.1+

安裝 git, openssh-server和 OpenJDK 7 可使用apt 包管理器來完成。ubuntu

git

命令行鍵入：瀏覽器

sudo apt-get install git

openssh-server

sudo apt-get install openssh-server

OpenJDK 7

sudo apt-get install openjdk-7-jdk openjdk-7-jre-lib

使用Ubuntu 16.04: 安裝 openjdk-7 必須加上 repository（ Source），以下：bash

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-7-jdk openjdk-7-jre-lib

Maven 3.1+

Zeppelin 要求 maven 版本 3.x以上。該版本在系統庫中爲 2.x, 所以 maven 須要手動安裝。服務器

首先，清除現存的 maven各個版本：app

sudo apt-get purge maven maven2

下載 maven 3.3.9 二進制軟件：

wget "http://www.us.apache.org/dist/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz"

解壓縮並放到 /usr/local 目錄：

tar -zxvf apache-maven-3.3.9-bin.tar.gz
sudo mv ./apache-maven-3.3.9 /usr/local

建立一個符號連接，在 /usr/bin 中：

sudo ln -s /usr/local/apache-maven-3.3.9/bin/mvn /usr/bin/mvn

安裝 Zeppelin

這裏提供一個Zeppelin的源碼安裝快速步驟，詳細步驟請閱讀 Zeppelin Installation Guide。

命令行，Clone Zeppelin 源代碼：

git clone https://github.com/apache/zeppelin.git

進入 Zeppelin 根目錄：

cd zeppelin

打包 Zeppelin：

mvn clean package -DskipTests -Pspark-1.6 -Dflink.version=1.1.3 -Pscala-2.10

-DskipTests 跳過 build tests。

-Pspark-1.6 指定maven採用 Spark 1.6進行構建。由於 Zeppelin有本身的Spark interpreter，版本必須與鏈接的Spark服務保持一致。

-Dflink.version=1.1.3 指定 maven 採用Flink 版本 1.1.3進行構建。

--Pscala-2.10 指定 maven 使用 Scala v2.10進行構建。

注意: 你能夠包含額外的build flags，如： -Ppyspark 或 -Psparkr。詳細的參考： the build section of github for more details.

注意: 你能夠構建任何在Zeppelin Profile中可用的Spark版本，關鍵是要選擇一致的版本進行構建。

注意: 關於build失敗. 安裝過Zeppe超過30次，我能夠告訴你，有時候構建失敗找不出緣由。在沒有編輯任何代碼的狀況下，可能由於某些緣由build就失敗了。不少時候，maven試圖下載時失敗。

若是構建失敗，下面是一些解決方法的提示:

- 查看 logs.

- 重試 (再次運行 mvn clean package -DskipTests -Pspark-1.6 )

- 若是下載失敗，等待一些時間後再下載。有時若是 server 不可用，就只能等待。

- 確認你的步驟都是正確的。

- 向社區請求幫助。到 here 而且加入用戶郵件列表。確保將build output (everything that happened in the console) 的輸出包含在你的消息中。

啓動Zeppelin服務

bin/zeppelin-daemon.sh start

使用 ifconfig 來確認 host machine's IP 地址。若是不熟悉, 能夠參考 here。

打開瀏覽器，本機輸入地址 http://127.0.0.1:8080, 若是不在本機訪問（同一個網段）能夠經過命令 ifconfig得到服務器的IP地址。

查看 Zeppelin tutorial 獲取Zeppelin的基本用法。建議你花一些時間查看一下 Zeppelin 安裝時自帶的notebook例子，能夠快速熟悉基本的notebook功能。

Flink 測試

建立一個新的 notebook ，名稱爲 "Flink Test"，複製下面的代碼到裏面：

%flink  // let Zeppelin know what interpreter to use.

val text = benv.fromElements("In the time of chimpanzees, I was a monkey",   // some lines of text to analyze
"Butane in my veins and I'm out to cut the junkie",
"With the plastic eyeballs, spray paint the vegetables",
"Dog food stalls with the beefcake pantyhose",
"Kill the headlights and put it in neutral",
"Stock car flamin' with a loser in the cruise control",
"Baby's in Reno with the Vitamin D",
"Got a couple of couches, sleep on the love seat",
"Someone came in sayin' I'm insane to complain",
"About a shotgun wedding and a stain on my shirt",
"Don't believe everything that you breathe",
"You get a parking violation and a maggot on your sleeve",
"So shave your face with some mace in the dark",
"Savin' all your food stamps and burnin' down the trailer park",
"Yo, cut it")

/*  The meat and potatoes:
        this tells Flink to iterate through the elements, in this case strings,
        transform the string to lower case and split the string at white space into individual words
        then finally aggregate the occurrence of each word.

        This creates the count variable which is a list of tuples of the form (word, occurances)

counts.collect().foreach(println(_))  // execute the script and print each element in the counts list

*/
val counts = text.flatMap{ _.toLowerCase.split("\\W+") }.map { (_,1) }.groupBy(0).sum(1)

counts.collect().foreach(println(_))  // execute the script and print each element in the counts list

按Enter+Shift運行，確保 Zeppelin Flink interpreter 工做正確，若是有問題到菜單的interpreter進行設置。

Spark 測試

建立一個notebook，名稱爲 "Spark Test" ，複製下面的代碼進去：

%spark // let Zeppelin know what interpreter to use.

val text = sc.parallelize(List("In the time of chimpanzees, I was a monkey",  // some lines of text to analyze
"Butane in my veins and I'm out to cut the junkie",
"With the plastic eyeballs, spray paint the vegetables",
"Dog food stalls with the beefcake pantyhose",
"Kill the headlights and put it in neutral",
"Stock car flamin' with a loser in the cruise control",
"Baby's in Reno with the Vitamin D",
"Got a couple of couches, sleep on the love seat",
"Someone came in sayin' I'm insane to complain",
"About a shotgun wedding and a stain on my shirt",
"Don't believe everything that you breathe",
"You get a parking violation and a maggot on your sleeve",
"So shave your face with some mace in the dark",
"Savin' all your food stamps and burnin' down the trailer park",
"Yo, cut it"))


/*  The meat and potatoes:
        this tells spark to iterate through the elements, in this case strings,
        transform the string to lower case and split the string at white space into individual words
        then finally aggregate the occurrence of each word.

        This creates the count variable which is a list of tuples of the form (word, occurances)
*/
val counts = text.flatMap { _.toLowerCase.split("\\W+") }
                 .map { (_,1) }
                 .reduceByKey(_ + _)

counts.collect().foreach(println(_))  // execute the script and print each element in the counts list

按Enter+Shift運行，確保 Zeppelin Flink interpreter 工做正確，若是有問題到菜單的interpreter進行設置。

最後, 中止Zeppelin daemon服務。從系統的命令窗口輸入並回車執行:

bin/zeppelin-daemon.sh stop

安裝集羣

Flink 集羣

如今預編譯代碼

若是可能，建議您從源碼進行構建，不只能夠得到最新的功能，還能瞭解項目的最新進展和代碼的結構，定製本身特定環境的版本。爲了便於演示，這裏直接下載編譯好的版本。

下載使用 wget

wget "http://mirror.cogentco.com/pub/apache/flink/flink-1.1.3/flink-1.1.3-bin-hadoop24-scala_2.10.tgz"
tar -xzvf flink-1.1.3-bin-hadoop24-scala_2.10.tgz

將下載 Flink 1.1.3, 與 Hadoop 2.4兼容。這個版本不須要安裝 Hadoop ，但若是使用 Hadoop, 將上面的 24 改成對應的版本。

啓動 Flink 集羣：

flink-1.1.3/bin/start-cluster.sh

從源碼構建

若是但願從源碼編譯構建Flink, 下面是快捷指南。改變構建工具和版本可能帶來不穩定性。例如, Java8 和 Maven 3.0.3 建議用於編譯 Flink, 可是目前不適合用於 Zeppelin 的構建（版本在快速更新中，之後可能就適合了）. 查看 Flink Installation guide 得到更多的細節指南。

返回到目錄, 這裏假設是 $HOME. 複製 Flink 項目源碼, 檢出版本 release-1.1.3-rc2, 而後編譯。

cd $HOME
git clone https://github.com/apache/flink.git
cd flink
git checkout release-1.1.3-rc2
mvn clean install -DskipTests

啓動 Flink 集羣，使用 stand-alone 模式：

build-target/bin/start-cluster.sh

確保集羣成功啓動。

在瀏覽器中, 輸入URL地址 http://127.0.0.1:8082 ，能夠看到Flink 的Web-UI。在左側導航欄點擊 'Task Managers' 。確保至少有一個Task Manager打開。

若是task managers沒有出現, 從新啓動一下 Flink 集羣，方法以下：

(if binaries) flink-1.1.3/bin/stop-cluster.sh flink-1.1.3/bin/start-cluster.sh

(if built from source) build-target/bin/stop-cluster.sh build-target/bin/start-cluster.sh

Spark 1.6 集羣

下載預編譯軟件包

若是可能，建議從源碼編譯。這裏爲了便於演示，採用直接下載編譯好的軟件包。

下載使用 wget

wget "http://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz"
tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz
mv spark-1.6.3-bin-hadoop2.6 spark

上面的命令會下載Spark 1.6.3, 與Hadoop 2.6兼容。本安裝包工做時不須要安裝Hadoop，但若是使用 Hadoop, 須要將版本號 2.6 改變爲你的對應版本。

從源碼編譯

Spark 是一個比較大的項目, 將耗費較長的時間下載和編譯，中間可能會遇到像Flink編譯時一樣的問題而失敗。參考 Spark Installation 得到更多的細節的指南。

返回到下載目錄，這裏假設是 $HOME. 複製 Spark源代碼, 檢出分支 branch-1.6, 而後進行build。

注意: 這裏檢出 1.6 只是由於這是本文寫做時的 Zeppelin profile 支持的版本。你須要構建對應於Spark的相應版本。若是使用 Spark 2.0, 下面的例子 word count 須要修改成Spark 2.0 兼容。

cd $HOME

Clone, check out, 以及 build Spark 1.6.x，腳本命令以下：

git clone https://github.com/apache/spark.git
cd spark
git checkout branch-1.6
mvn clean package -DskipTests

啓動 Spark集羣

返回到 `$HOME` 目錄.

cd $HOME

啓動Spark 集羣，使用stand-alone 模式。若是不使用默認端口8080，經過 webui-port 參數制定服務端口 (Zeppelin的webui-port服務端口)。

spark/sbin/start-master.sh --webui-port 8082

注意: 爲何使用 --webui-port 8082? 這個是題外話，在後面再去解釋。

打開瀏覽器，導航到 http://yourip:8082 確保 Spark master 已經運行，顯示信息以下。

頁面上方顯示 URL地址: spark://yourhost:7077，這是Spark Master訪問的URI, 在後續的操做中將會用到。

使用這個URI啓動一個Spark的slave節點:

spark/sbin/start-slave.sh spark://yourhostname:7077

返回 Zeppelin daemon啓動的主目錄：

cd $HOME

zeppelin/bin/zeppelin-daemon.sh start

配置 Interpreters

打開瀏覽器，導航到 Zeppelin 的web-ui，地址爲：http://yourip:8080.

回到 Zeppelin web-ui （ http://yourip:8080），點擊右上方的 anonymous 將打開下拉菜單, 選擇 Interpreters 進入解釋器的配置頁面。

在 Spark 一節, 右上方點擊 edit 按鈕(鉛筆圖標)。而後，編輯 Spark 的 master 域。從 local[*] 改成上面的URI，上面的是 spark://ubuntu:7077。

點擊 Save （保存）更新參數, 而後在詢問是否須要重啓interpreter時點擊 OK。

如今滾動頁面到 Flink 一節。點擊edit按鈕，將 host 的值從 local 改成 localhost. 點擊 Save 保存。

從新打開 examples ，而後從新運行。 (屏幕上方點擊 play 按鈕，或者在每一paragraph點擊play按鈕來運行，或者按Enter+Shift組合鍵）。

你能夠去檢查 Flink 和 Spark 的webui界面 (譬如上面的 http://yourip:8081, http://yourip:8082, http://yourip:8083)，能夠看到任務在集羣上運行。

題外話-關於服務的端口

爲何要用 'something like', 而不是精確的 web-ui 端口呢？由於這依賴於你啓動時的設置。Flink 和 Spark 將缺省啓動web-ui 在端口8080, 若是被佔用就尋找下一個可用的端口。

由於 Zeppelin 第一個啓動，缺省將佔用端口 8080。當 Flink 啓動時, 將試圖使用端口 8080, 若是不可用，則使用下一個，如 8081。Spark 的 webui界面分爲 master 和 slave, 啓動時將試圖綁定端口 8080，但該端口已經被Zeppelin佔用), 而後將使用8081 (但已被 Flink的 webui佔用), 而後使用 8082。

若是一切徹底如上述運行, webui的端口將會是 8081 和 8082。可是，若是運行了其餘程序或者啓動過程由其它的集羣管理程序控制，狀況可能就與預期的不一樣，尤爲是在啓動大量節點的狀況下。

能夠經過啓動參數來指定webui服務綁定的端口 (在啓動 Flink 和 Spark時，在命令行加上參數 --webui-port <port> ，這裏 <port> 爲webui使用的端口。也能夠在配置文件中指定端口，具體方法參考官方網站文檔，這裏再也不贅述。