Apache Spark 是專爲大規模數據處理而設計的快速通用的計算引擎。Spark是開源的類Hadoop MapReduce的通用並行框架,Spark擁有Hadoop MapReduce所具備的優勢;但不一樣於MapReduce的是Job中間輸出結果能夠保存在內存中,從而再也不須要讀寫HDFS,所以Spark能更好地適用於數據挖掘與機器學習等須要迭代的MapReduce的算法。
html
環境:Docker(17.04.0-ce)、鏡像Ubuntu(16.04.3)、JDK(1.8.0_144)、Hadoop(3.1.1)、Spark(2.3.2)java
1.安裝Hadoop
參考:Hadoop僞分佈式模式安裝node
2.解壓Spark
bigdata@lab-bd:~$ tar -xf spark-2.3.2-bin-without-hadoop.tgz python
3.重名名conf/spark-env.sh.template爲spark-env.sh算法
bigdata@lab-bd:~$ mv spark-2.3.2-bin-without-hadoop/conf/spark-env.sh.template spark-2.3.2-bin-without-hadoop/conf/spark-env.sh
4.編輯conf/spark-env.sh文件,增長以下變量apache
export JAVA_HOME=/home/hadoop/jdk1.8.0_144 export SPARK_DIST_CLASSPATH=$(/home/hadoop/hadoop-3.1.1/bin/hadoop classpath) export HADOOP_CONF_DIR=/home/hadoop/hadoop-3.1.1/etc/hadoop export PYSPARK_PYTHON=/usr/bin/python3.5
1.啓動Hdfs服務
bigdata@lab-bd:~$ hadoop-3.1.1/sbin/start-dfs.sh 瀏覽器
2.啓動Yarn服務
bigdata@lab-bd:~$ hadoop-3.1.1/sbin/start-yarn.sh 框架
3.交互模式運行pyspark
bigdata@lab-bd:~$ spark-2.3.2-bin-without-hadoop/bin/pyspark --master yarn --deploy-mode client 機器學習
4.提交模式運行spark-submit分佈式
bigdata@lab-bd:~$ spark-2.3.2-bin-without-hadoop/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client \ > spark-2.3.2-bin-without-hadoop/examples/jars/spark-examples_2.11-2.3.2.jar
5.瀏覽器訪問http://10.0.0.3:8088
1.Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger異常
Hadoop和Spark獨立安裝,Spakr運行須要Hadoop,無SPARK_DIST_CLASSPATH變量,沒法關聯hadoop
編輯conf/spark-env.sh文件,配置SPARK_DIST_CLASSPATH變量
export SPARK_DIST_CLASSPATH=$(/home/bigdata/hadoop-3.1.1/bin/hadoop classpath)
2.Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment異常
Hadoop和Spark獨立安裝,Spakr運行須要Hadoop,無HADOOP_CONF_DIR變量,沒法關聯YARN
編輯conf/spark-env.sh文件,配置HADOOP_CONF_DIR變量
export HADOOP_CONF_DIR=/home/bigdata/hadoop-3.1.1/etc/hadoop
3.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped異常
物理內存或者虛擬內存分配不夠,Yarn直接殺死進程,須要禁止內存檢查
編輯Hadoop中的etc/hadoop/yarn-site.xml文件,添加以下配置
<property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
4.env: 'python': No such file or directory錯誤
pyspark須要使用python,未配置PYSPARK_PYTHON變量
export PYSPARK_PYTHON=/usr/bin/python3.5