Azure HDInsight 和 Spark 大數據實戰(二)

時間 2019-12-01

標籤 azure hdinsight spark 數據實戰欄目 Spark 简体版

原文原文鏈接

HDInsight cluster on Linux

登陸 Azure portal (https://manage.windowsazure.com )html

點擊左下角的 NEW 按鈕，而後點擊 DATA SERVICES 按鈕，點擊 HDINSIGHT，選擇 HADOOP ON LINUX，以下圖所示。node

輸入集羣名稱，選擇集羣大小和帳號，設定集羣的密碼和存儲帳號，下表是各個參數的含義和配置說明。python

Namelinux	Valueapache
Cluster Namewindows	Name of the cluster.瀏覽器
Cluster Size服務器	Number of data nodes you want to deploy. The default value is 4. But the option to use 1 or 2 data nodes is also available from the drop-down. Any number of cluster nodes can be specified by using the Custom Create option. Pricing details on the billing rates for various cluster sizes are available. Click the ? symbol just above the drop-down box and follow the link on the pop-up.工具
Passwordoop	The password for the HTTP account (default user name: admin) and SSH account (default user name: hdiuser). Note that these are NOT the administrator accounts for the virtual machines on which the clusters are provisioned.
Storage Account	Select the Storage account you created from the drop-down box. Once a Storage account is chosen, it cannot be changed. If the Storage account is removed, the cluster will no longer be available for use. The HDInsight cluster is co-located in the same datacenter as the Storage account.

點擊 CREATE HDINSIGHT CLUSTER 便可建立運行於 Azure 的 Hadoop 集羣。

上述過程快速建立一個運行Hadoop 的 Linux 集羣，且默認 SSH 用戶名稱爲 hdiuser，HTTP 帳戶默認名稱爲 admin。若要用自定義選項，例如使用 SSH 密鑰進行身份驗證建立羣集或使用額外的存儲空間，請參閱 Provision Hadoop Linux clusters in HDInsight using custom options ( https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-provision-linux-clusters/ ) 。

Installing Spark

在 HDInsight 中點擊建立的 Hadoop集羣（在本例中集羣名稱爲 Hadooponlinux ），進入 dashboard，以下圖所示。

在 quick glance 中拷貝 Cluster Connection String的值，此爲登陸 Hadoop on Linux 配置控制檯 Ambari的地址，在瀏覽器中粘貼 Cluster Connection String的值，此時出現登陸用戶名和密碼的驗證。此時的用戶名爲上一步中快速建立hadoop集羣時默認HTTP用戶名admin，密碼爲快速建立hadoop集羣時設置的密碼。

正確輸入用戶名和密碼後，出現 Ambari的登陸用戶名和密碼驗證，此時輸入用戶名 admin 密碼爲hadoop便可進入Ambari的管理控制檯。

下圖展現了使用 Ambari 安裝Spark的過程。

The following diagram shows the Spark installation process using Ambari.

選擇 Ambari "Services" 選項卡。

在 Ambari "Actions" 下拉菜單中選擇 "Add Service." 這將啓動添加服務嚮導。

選擇 "Spark"，而後點擊 "Next" 。

(For HDP 2.2.4, Ambari will install Spark version 1.2.1, not 1.2.0.2.2.)

Ambari 將顯示警告消息，確認集羣運行的是 HDP 2.2.4 或更高版本，而後單擊 "Proceed"。

	Note
	You can reconfirm component versions in Step 6 before finalizing the upgrade.

選擇Spark 歷史服務器節點，點擊 Click "Next" 繼續。

指定 Spark 的Slaves ，點擊 "Next" 繼續。
在客戶化服務界面建議您使用默認值爲您的初始配置，而後點擊 "Next" 繼續。
Ambari 顯示確認界面，點擊 "Deploy" 繼續。

	Important
	On the Review screen, make sure all HDP components are version 2.2.4 or later.

Ambari 顯示安裝、啓動和測試界面，其狀態欄和消息則指示進度。
當Ambari安裝完成，點擊 "Complete" 完成 Spark 的整個安裝過程。

Run Spark

經過 SSH 登陸 Hadoop 的 Linux 集羣，執行如下的Linux 指令下載文檔，爲後面的Spark程序運行使用。

wget http://en.wikipedia.org/wiki/Hortonworks

將數據拷貝至 Hadoop 集羣的HDFS中，

hadoop fs -put ~/Hortonworks /user/guest/Hortonworks

在不少Spark的例子中採用Scala和Java的應用程序演示，本例中使用 PySpark 來演示基於Python語音的Spark使用方法。

pyspark

第一步使用 Spark Context 即 sc 建立RDD，代碼以下：

myLines = sc.textFile('hdfs://sandbox.hortonworks.com/user/guest/Hortonworks')

如今咱們實例化了RDD，下面咱們對RDD作轉化的操做。爲此咱們使用python lambda表達式作篩選。

myLines_filtered = myLines.filter( lambda x: len(x) > 0 )

請注意，以上的python語句不會引起任何RDD的執行操做，只有出現類型如下代碼的count()行爲纔會引起真正的RDD運算。

myLines_filtered.count()

最終Spark Job運算的結果以下所示。

341.

Data Science with Spark

對於數據科學家而言Spark是一種高度有效的數據處理工具。數據科學家常常相似Notebook ( 如 iPython http://ipython.org/notebook.html ) 的工具來快速建立原型並分享他們的工做。許多數據科學家喜愛使用 R語言，可喜的是Spark與R的集成即 SparkR已成爲 Spark 新興的能力。Apache Zeppelin (https://zeppelin.incubator.apache.org/ ) 是一種新興的工具，提供了基於 Spark 的 Notebook 功能，這裏是Apache Zeppelin 提供的易用於 Spark的用戶界面視圖。

做者：雪松

Microsoft MVP -- Windows Platform Development,

Hortonworks Certified Apache Hadoop 2.0 Developer