本文搭建環境爲:Mac + Parallel Desktop + CentOS7 + JDK7 + Hadoop2.6 + Scala2.10.4 + IDEA14.0.5html
——————————————————————————————————————————————————java
■ 安裝完成後記得保存快照。node
■ 環境準備
CentOS7下載:http://mirrors.163.com/centos/7/isos/x86_64/CentOS-7-x86_64-DVD-1511.iso。python
■ Mac Parallel Desktop安裝CentOS 7 - http://www.linuxidc.com/Linux/2016-08/133827.htm
配置網卡(無需)
[root@localhost ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0
保存後重啓網卡
/etc/init.d/network stop
/etc/init.d/network startlinux
安裝網絡工具包(無需)
yum install net-tools
yum install wgetgit
packagekit問題:yum安裝出現「/var/run/yum.pid 已被鎖定,強行解除鎖
rm -f /var/run/yum.pides6
更改源爲阿里雲
cd /etc/yum.repos.d/
mv CentOS-Base.repo Centos-Base.repo.bak
wget -O CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all
yum makecachesql
■ CentOS 7 GNOME 圖形界面(無需)
yum groupinstall "X Window System"
yum groupinstall "GNOME Desktop"
startx --> 進入圖形界面
runlevel —> 運行級別查看shell
■ CentOS 7安裝後配置
http://www.cnblogs.com/pinnsvin/p/5889857.html
——————————————————————————————————————————————————apache
卸載CentOS7-x64自帶的OpenJDK並安裝Sun的JDK7 - http://www.cnblogs.com/CuteNet/p/3947193.html
rpm -qa | grep java
如下命令需根據上一指令結果:
rpm -e --nodeps python-javapackages-3.4.1-11.el7.noarch
rpm -e --nodeps java-1.8.0-openjdk-headless-1.8.0.65-3.b17.el7.x86_64
rpm -e --nodeps java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64
rpm -e --nodeps java-1.7.0-openjdk-headless-1.7.0.91-2.6.2.3.el7.x86_64
rpm -e --nodeps tzdata-java-2015g-1.el7.noarchrpm -e --nodeps javapackages-tools-3.4.1-11.el7.noarchrpm -e --nodeps java-1.8.0-openjdk-1.8.0.65-3.b17.el7.x86_64
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html,下載jdk-7u79-linux-x64.tar.gz
mkdir /usr/local/java
cp jdk-7u79-linux-x64.tar.gz /usr/local/java
cd /usr/local/java
tar xvf jdk-7u79-linux-x64.tar.gz
rm jdk-7u79-linux-x64.tar.gz
vim /etc/profile
打開以後在末尾添加
export JAVA_HOME=/usr/local/java/jdk1.7.0_79
export JRE_HOME=/usr/local/java/jdk1.7.0_79/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export PATH=$JAVA_HOME/bin:$PATH
執行配置文件,令其馬上生效
source /etc/profile
驗證是否安裝成功
java -version
——————————————————————————————————————————————————
http://dblab.xmu.edu.cn/blog/install-hadoop-in-centos/
su
useradd -m hadoop -s /bin/bash
passwd hadoop(hadoop)
visudo
hadoop ALL=(ALL) ALL
rpm -qa | grep ssh
cd ~/.ssh/
ssh-keygen -t rsa 都按回車
cat id_rsa.pub >> authorized_keys
chmod 600 ./authorized_keys
下載Hadoop:http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
sudo tar -zxf ~/home/hadoop/桌面/hadoop-2.6.0.tar.gz -C /usr/local
cd /usr/local/
mv ./hadoop-2.6.0/ ./hadoop
sudo chown -R hadoop:hadoop ./hadoop
檢查是否可用
cd /usr/local/hadoop
./bin/hadoop version
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar
--> Hadoop初步環境搭建完成
cd /usr/local/hadoop
mkdir ./input
cp ./etc/hadoop/*.xml ./input # 將配置文件做爲輸入文件
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep ./input ./output 'dfs[a-z.]+'
cat ./output/* # 查看運行結果
rm -r ./output
gedit ~/.bashrc
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/local/java/jdk1.7.0_79
export JRE_HOME=/usr/local/java/jdk1.7.0_79/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export PATH=$JAVA_HOME/bin:$PATH
source ~/.bashrc
gedit ./etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
gedit ./etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/data</value>
</property>
</configuration>
./bin/hdfs namenode –format
./sbin/start-dfs.sh
顯示以下:
[hadoop@localhost hadoop]$ jps
27710 NameNode
28315 SecondaryNameNode
28683 Jps
27973 DataNode
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
tar -x hadoop-native-64-2.6.0.tar -C /usr/local/hadoop/lib/native/
cp /usr/local/hadoop/lib/native/* /usr/local/hadoop/lib/
加入系統變量
export HADOOP_COMMON_LIB_NATIVE_DIR=/home/administrator/work/hadoop-2.6.0/lib/native
export HADOOP_OPTS="-Djava.library.path=/home/administrator/work/hadoop-2.6.0/lib"
export HADOOP_ROOT_LOGGER=DEBUG,console
主要是jre目錄下缺乏了libhadoop.so和libsnappy.so兩個文件。具體是,spark-shell依賴的是scala,scala依賴的是JAVA_HOME下的jdk,libhadoop.so和libsnappy.so兩個文件應該放到$JAVA_HOME/jre/lib/amd64下面。
這兩個so:libhadoop.so和libsnappy.so。前一個so能夠在HADOOP_HOME下找到,如hadoop\lib\native。第二個libsnappy.so須要下載一個snappy-1.1.0.tar.gz,而後./configure,make編譯出來,編譯成功以後在.libs文件夾下。
當這兩個文件準備好後再次啓動spark shell不會出現這個問題。
連接:https://www.zhihu.com/question/23974067/answer/26267153
問題:因爲在root用戶下安裝Java,而Hadoop用戶缺乏操做java目錄的權限
cd /
sudo chown -R hadoop:hadoop ./usr/local/java
Hadoop開啓關閉調試信息
開啓:export HADOOP_ROOT_LOGGER=DEBUG,console
關閉:export HADOOP_ROOT_LOGGER=INFO,console
./bin/hdfs dfs -mkdir -p /user/hadoop
./bin/hdfs dfs -mkdir input
./bin/hdfs dfs -put ./etc/hadoop/*.xml input
./bin/hdfs dfs -ls input
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'
./bin/hdfs dfs -cat output/*
rm -r ./output # 先刪除本地的 output 文件夾(若是存在)
./bin/hdfs dfs -get output ./output # 將 HDFS 上的 output 文件夾拷貝到本機
cat ./output/*
Hadoop 運行程序時,輸出目錄不能存在,不然會提示錯誤 「org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists」 ,所以若要再次執行,須要執行以下命令刪除 output 文件夾:
./bin/hdfs dfs -rm -r output # 刪除 output 文件夾
關閉Hadoop
./sbin/stop-dfs.sh
下次啓動 hadoop 時,無需進行 NameNode 的初始化,只須要運行
./sbin/start-dfs.sh 就能夠!
YARN 是從 MapReduce 中分離出來的,負責資源管理與任務調度。YARN 運行於 MapReduce 之上,提供了高可用性、高擴展性,YARN 的更多介紹在此不展開,有興趣的可查閱相關資料。
上述經過 ./sbin/start-dfs.sh 啓動 Hadoop,僅僅是啓動了 MapReduce 環境,咱們能夠啓動 YARN ,讓 YARN 來負責資源管理與任務調度。
./sbin/start-dfs.sh
mv ./etc/hadoop/mapred-site.xml.template ./etc/hadoop/mapred-site.xml
gedit ./etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
gedit ./etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
./sbin/start-yarn.sh $ 啓動YARN
./sbin/mr-jobhistory-daemon.sh start historyserver # 開啓歷史服務器,才能在Web中查看任務運行狀況
[hadoop@localhost hadoop]$ jps
11148 JobHistoryServer
9788 NameNode
10059 DataNode
11702 Jps
10428 SecondaryNameNode
10991 NodeManager
10874 ResourceManager
http://localhost:8088/cluster
./sbin/stop-yarn.sh
./sbin/mr-jobhistory-daemon.sh stop historyserver
——————————————————————————————————————————————————
《Spark快速入門指南 – Spark安裝與基礎使用》- http://dblab.xmu.edu.cn/blog/spark-quick-start-guide/
spark-1.6.0-bin-hadoop2.6.tgz
http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
解壓
sudo tar -zxf ~/下載/spark-1.6.0-bin-hadoop2.6.tgz -C /usr/local/
cd /usr/local
sudo mv ./spark-1.6.0-bin-hadoop2.6/ ./spark
sudo chown -R hadoop:hadoop ./spark # 此處的 hadoop 爲你的用戶名
安裝後,須要在 ./conf/spark-env.sh 中修改 Spark 的 Classpath,執行以下命令拷貝一個配置文件:
cd /usr/local/spark
cp ./conf/spark-env.sh.template ./conf/spark-env.sh
gedit ./conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
全局環境變量:
sudo gedit /etc/profile
source /etc/profile
export JAVA_HOME=/usr/local/java
export HADOOP_HOME=/usr/hadoop
export SCALA_HOME=/usr/lib/scala-2.10.4
export SPARK_HOME=/usr/local/spark
配置Spark環境變量
cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
gedit spark-env.sh
spark-env.sh配置
export SCALA_HOME=/usr/lib/scala-2.10.4
export HADOOP_HOME=/usr/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark
export SPARK_PID_DIR=$SPARK_HOME/tmp
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8099
export SPARK_WORKER_CORES=1 //每一個Worker使用的CPU核數
export SPARK_WORKER_INSTANCES=1 //每一個Slave中啓動幾個Worker實例
export SPARK_WORKER_MEMORY=512m //每一個Worker使用多大的內存
export SPARK_WORKER_WEBUI_PORT=8081 //Worker的WebUI端口號
export SPARK_EXECUTOR_CORES=1 //每一個Executor使用使用的核數
export SPARK_EXECUTOR_MEMORY=128m //每一個Executor使用的內存
export SPARK_CLASSPATH=$SPARK_HOME/conf/:$SPARK_HOME/lib/*:/usr/local/hadoop/lib/native:$SPARK_CLASSPATH
Spark 的安裝目錄(/usr/local/spark)爲當前路徑
cd /usr/local/spark
./bin/run-example SparkPi 2>&1 | grep "Pi is roughly"
Python 版本的 SparkPi 則須要經過 spark-submit 運行:
./bin/spark-submit examples/src/main/python/pi.py 2>&1 | grep "Pi is roughly"
cd /etc/local/hadoop
./sbin/start-dfs.sh
./sbin/start-yarn.sh
運行示例
cd /usr/local/spark
bin/spark-submit --master yarn ./examples/src/main/python/wordcount.py file:///usr/local/spark/LICENSE
(快照:運行成功Spark示例)
./bin/spark-shell
val textFile = sc.textFile("file:///usr/local/spark/README.md")
// textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:27
textFile.count() // RDD 中的 item 數量,對於文本文件,就是總行數
// res0: Long = 95
textFile.first() // RDD 中的第一個 item,對於文本文件,就是第一行內容
// res1: String = # Apache Spark
val linesWithSpark = textFile.filter(line => line.contains("Spark")) // 篩選出包含 Spark 的行
linesWithSpark.count() // 統計行數
// res4: Long = 17
textFile.filter(line => line.contains("Spark")).count() // 統計包含 Spark 的行數
// res4: Long = 17
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
// res5: Int = 14
import java.lang.Math
textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
// res6: Int = 14
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) // 實現單詞統計
// wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:29
wordCounts.collect() // 輸出單詞統計結果
// res7: Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1)...)
方式一:
wget http://downloads.sourceforge.net/project/netcat/netcat/0.6.1/netcat-0.6.1-1.i386.rpm -O ~/netcat-0.6.1-1.i386.rpm # 下載
sudo rpm -iUv ~/netcat-0.6.1-1.i386.rpm # 安裝
方式二:
wget http://sourceforge.NET/projects/netcat/files/netcat/0.7.1/netcat-0.7.1-1.i386.rpm
rpm -ihv netcat-0.7.1-1.i386.rpm
yum list glibc*
rpm -ihv netcat-0.7.1-1.i386.rpm
# 記爲終端 1
nc -l -p 9999
# 須要另外開啓一個終端,記爲終端 2,而後運行以下命令
/usr/local/spark/bin/run-example streaming.NetworkWordCount localhost 9999 2>/dev/null
(快照:完成Spark Streaming實例)
把spark/conf/log4j.properties下的
log4j.rootCategory=【Warn】=> 【ERROR】
log4j.logger.org.spark-project.jetty=【Warn】=> 【ERROR】
——————————————————————————————————————————————————
安裝scala 2.10.4:下載scala,http://www.scala-lang.org/,下載scala-2.10.4.tgz,並複製到/usr/lib
sudo tar -zxf scala-2.10.4.tgz -C /usr/lib
採用全局設置方法,修改etc/profile,是全部用戶的共用的環境變量
sudo gedit /etc/profile
export SCALA_HOME=/usr/lib/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH
source /etc/profile
scala -version
[hadoop@localhost 下載]$ scala -version
Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
——————————————————————————————————————————————————
參考:http://dongxicheng.org/framework-on-yarn/apache-spark-intellij-idea/
● 《linux系統下IntelliJ IDEA的安裝及使用》 - http://www.linuxdiyf.com/linux/19143.html
不建議你們使用eclipse開發spark程序和閱讀源代碼,推薦使用Intellij IDEA
● 下載IDEA14.0.5:
http://confluence.jetbrains.com/display/IntelliJIDEA/Previous+IntelliJ+IDEA+Releases
http://download.jetbrains.8686c.com/idea/ideaIU-14.0.5.tar.gz
https://download.jetbrains.8686c.com/idea/ideaIU-2016.2.5-no-jdk.tar.gz(只支持JDK1.8以上)
Unsupported Java Version: Cannot start under Java 1.7.0_79-b15: Java 1.8 or later is required.
解壓,進入到解壓後文件夾的bin目錄下執行
tar -zxvf ideaIU-14.tar.gz -C /usr/intellijIDEA
export IDEA_JDK=/usr/local/java/jdk1.7.0_79
./idea.sh
key:IDEA
value:61156-YRN2M-5MNCN-NZ8D2-7B4EW-U12L4
http://www.linuxdiyf.com/linux/19143.html
下載地址:http://plugins.jetbrains.com/files/1347/19005/scala-intellij-bin-1.4.zip
安裝插件後,在啓動界面中選擇建立新項目,彈出的界面中將會出現"Scala"類型項目,以下圖,選擇scala-》scala
點擊next,就如如下界面,project name本身隨便起的名字,把本身安裝的scala和jdk選中,注意,在選擇scala版本是必定不要選擇2.11.X版本,那樣後續會出大錯!完成後,點擊Finish
而後再File下選擇project Structure,而後進入以下界面,進入後點擊Libraries,在右邊框後沒任何信息,而後點擊「+」號,進入你安裝spark時候解壓的spark-XXX-bin-hadoopXX下,在lib目錄下,選擇spark-assembly-XXX-hadoopXX.jar,結果以下圖所示,而後點擊Apply,最後點擊ok
《Intellij安裝scala插件詳解》
http://blog.csdn.net/a2011480169/article/details/52712421
從上面顯示的信息是: Updatated: 2016/7/13
因而咱們到下面的網站去找匹配的插件: http://plugins.jetbrains.com/plugin/?idea&id=1347
當咱們下載完插件以後: 把下載的.zip格式的scala插件放到Intellij的安裝的plugins目錄下;
再安裝剛剛放到Intellij的plugins目錄下的scala插件(注:直接安裝zip文件)便可。
搭建Spark開發環境
在intellij IDEA中建立scala project,並依次選擇「File」–> 「project structure」 –> 「Libraries」,選擇「+」,將spark-hadoop 對應的包導入
《Spark入門實戰系列--3.Spark編程模型(下)--IDEA搭建及實》
http://www.cnblogs.com/shishanyuan/p/4721120.html
package class3 import org.apache.spark.SparkConf import org.apache.spark.SparkContext object WordConut { def main(args: Array[String]) { val conf = new SparkConf().setAppName("TrySparkStreaming").setMaster("local[2]") val sc = new SparkContext(conf) val txtFile = "/root/test" val txtData = sc.textFile(txtFile) txtData.cache() txtData.count() val wcData = txtData.flatMap { line => line.split(",") }.map { word => (word, 1) }.reduceByKey(_ + _) wcData.collect().foreach(println) sc.stop } }
——————————————————————————————————————————————————
[hadoop@localhost spark]$ hdfs dfs -put LICENSE /zhaohang
hdfs dfs -ls
hdfs dfs -cat /zhaohang | wc -l
cd /usr/local/spark/bin
./pyspark --master yarn
lines=sc.textFile("hdfs://localhost:9000/zhaohang",1)
16/11/17 19:36:34 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 228.8 KB, free 228.8 KB)
16/11/17 19:36:34 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.5 KB, free 248.3 KB)
16/11/17 19:36:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.211.55.8:60185 (size: 19.5 KB, free: 511.5 MB)
16/11/17 19:36:34 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
temp1 = lines.flatMap(lambda x:x.split(' '))
temp1.collect()
map = temp1.map(lambda x: (x,1))
map.collect()
rdd = sc.parallelize([1,2,3,4],2)
def f(iterator): yield sum(iterator)
rdd.mapPartitions(f).collect() //[3,7]
rdd = sc.parallelize(["a","b","c"])
test = rdd.flatMap(lambda x:(x,1))
test.count()
sorted(test.collect()) //[1, 1, 1, 'a', 'b', 'c']
Spark界面:http://localhost:8088/proxy/application_1479381551764_0002/jobs/
cd /usr/local/hadoop
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh
——————————————————————————————————————————————————
——————————————————————————————————————————————————