1、Windows環境安裝Spark
1.安裝Java環境:jdk-8u101-windows-x64
配置環境變量:
(1)增長變量名:JAVA_HOME
變量值:C:\Program Files\Java\jdk1.8.0_101;
(2)找到系統變量Path
在其內容中增長值C:\Program Files\Java\jdk1.8.0_101\bin;
(3)驗證:Win+R,輸入cmd,在命令行窗口中輸入以下命令:
java -version
顯示下列信息代表安裝配置正確:
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
2.安裝scala:scala-2.11.8
而後在命令行窗口輸入命令:scala
若是不報錯,則安裝成功,應該顯示以下信息:
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101).
Type in expressions for evaluation. Or try :help.
scala>
注意:Scala版本要和spark版本匹配,請根據spark的版原本選擇scala的版本。
3.安裝spark
將下載好的文件spark-1.6.2-bin-cdh4解壓到當前目錄
剪切到D:目錄(或者你但願的目錄)
打開命令行窗口:
D:
cd spark-1.6.2-bin-cdh4
(1)啓動Master,在命令行中輸入:
bin\spark-class org.apache.spark.deploy.master.Master
(2)啓動Worker
bin\spark-class org.apache.spark.deploy.worker.Worker spark://10.0.1.119:7077 -c 1 -m 512M
(3)啓動Worker
bin\spark-class org.apache.spark.deploy.worker.Worker spark://10.0.1.119:7077 -c 1 -m 1G
(4)啓動單機模式
bin\spark-shell --master spark://10.0.1.119:7077
2、Linux環境安裝Spark
1.安裝Ubuntu Linux
(1)安裝包:
VMware-workstation-full-11.1.1-2771112.exe
ubuntu-14.04.1-server-amd64.iso
jdk-8u91-linux-x64.tar.gz
spark-1.6.2-bin-hadoop2.6.tgz
Xmanager4_setup.1410342608.exe
(2)安裝3臺虛擬機,主機名分別爲:spark01,spark02,spark03
IP分別爲:192.168.6.128~130
(3)在主機安裝Xshell
2.Linux安裝Java
(1)拷貝文件
spark@spark01:~$ mkdir app
spark@spark01:~$ cd app/
而後將文件拷貝到該文件夾下
(2)解壓
spark@spark01:~/app$ ll
spark@spark01:~/app$ tar -zxvf jdk-8u91-linux-x64.tar.gz
spark@spark01:~/app$ ll
(3)修改環境變量
spark@spark01:~/app$ sudo vim /etc/profile
[sudo] password for spark:
末尾增長兩行:
JAVA_HOME=/home/spark/app/jdk1.8.0_91
export PATH=$PATH:$JAVA_HOME/bin
(4)環境變量修改生效
spark@spark01:~/app$ source /etc/profile
(5)查看安裝好的java版本
spark@spark01:~/app$ java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
spark@spark01:~/app$
3.Linux安裝spark
(1)解壓spark
spark@spark01:~/app$ tar -zxvf spark-1.6.2-bin-hadoop2.6.tgz
(2)修改配置文件
spark@spark01:~/app$ cd spark-1.6.2-bin-hadoop2.6/
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6$ ll
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6$ cd conf/
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ll
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ cp spark-env.sh.template spark-env.sh
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ll
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ vim spark-env.sh
在配置文件末尾增長:
#export SPARK_LOCAL_IP=localhost
export JAVA_HOME=/home/spark/app/jdk1.8.0_91
export SPARK_MASTER_IP=spark01
#export SPARK_MASTER_IP=localhost
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=1g
#export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM -Dspark.deploy.recoveryDirectory=/nfs/spark/recovery"
#export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
#export YARN_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export SPARK_HOME=/home/spark/app/spark-1.6.2-bin-hadoop2.6
export SPARK_JAR=/home/spark/app/spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar
export PATH=$SPARK_HOME/bin:$PATH
(3)修改主機配置
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6$ sudo vim /etc/hosts
註釋掉:
127.0.1.1 spark01
增長:
192.168.6.128 spark01
192.168.6.129 spark02
192.168.6.130 spark03
關閉文件後測試是否正確配置:
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6$ ping spark02
(4)修改另外一個配置(3臺機器都要進行操做)
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6$ cd conf/
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ll
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ cp slaves.template slaves
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ vim slaves
在文件末尾增長:
spark02
spark03
(5)配置免密登陸(只在spark01操做便可)
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ chmod 0600 ~/.ssh/authorized_keys
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ssh-copy-id
Usage: /usr/bin/ssh-copy-id [-h|-?|-n] [-i [identity_file]] [-p port] [[-o <ssh -o options>] ...] [user@]hostname
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ssh-copy-id spark02(按提示操做)
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ssh-copy-id spark03(按提示操做)
測試免密登陸:
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ssh spark02
spark@spark02:~$ exit
(6)啓動spark服務(只在spark01操做便可)
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ ../sbin/start-all.sh
(7)開始任務
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6/conf$ cd ../
spark@spark01:~/app/spark-1.6.2-bin-hadoop2.6$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://spark01:7077 --executor-memory 1G --total-executor-cores 1 ./lib/spark-examples-1.6.2-hadoop2.6.0.jar 100
(8)在瀏覽器中查看
(9)修改BASH配置,將Spark添加到PATH中,設置SPARK_HOME環境變量。在Ubuntu上,只要編輯~/.bash_profile或~/.profile文件,將如下語句添加到文件中:
export SPARK_HOME=/home/spark/app/spark-1.6.2-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
而後source或者重啓終端後,就能夠使用pyspark啓動spark的python交互式shell環境:
spark@spark01:~$ source .profile
4.Linux安裝python開發庫
(1)更新源
sudo gedit /etc/apt/sources.list
把舊的sources.list進行備份,用新的sources.list文件替換掉舊的。
(2)更新依賴關係
sudo apt-get update
(3)更新軟件(可選)
sudo apt-get upgrade 或者只更新 pip install --upgrade pip
(4)安裝pip工具
sudo apt-get install python-pip
(5)更新pip的源,下載軟件速度會明顯加快
首先新建文件:
sudo vim /etc/pip.conf
在文件中寫入:
[global]
index-url = http://pypi.douban.com/simple/
trusted-host = pypi.douban.com
(6)安裝所需的包
sudo pip install matplotlib
sudo pip install scipy
sudo pip install scikit-learn
sudo pip install ipython
sudo apt-get install python-tk
Tips:
(7)使用命令:sudo pip install numpy時,可能遇到:
The directory '/Users/huangqizhi/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
說得很清楚,是pip目錄的屬主不是sudo的root用戶。若是必須用sudo pip,更改pip目錄屬主便可:
sudo chown root /Users/huangqizhi/Library/Caches/pip
5.經常使用命令
(1)啓動spark集羣,在spark主目錄執行:
./sbin/start-all.sh
(2)關閉spark集羣,在spark主目錄執行:
./sbin/stop-all.sh
(3)啓動任務:
spark-submit pythonapp.py
spark-submit --master yarn-cluster get_hdfs.py
spark-submit --master spark://hadoop01:7077 spark_sql_wp.py
(4)啓動spark的python交互式shell環境
pyspark