目錄:html
第一部分:操做系統準備工做:node
1. 安裝部署CentOS7.3 1611linux
2. CentOS7軟件安裝(net-tools, wget, vim等)web
3. 更新CentOS7的Yum源,更新軟件速度更快算法
4. CentOS 用戶配置,Sudo受權apache
第二部分:Java環境準備vim
1. JDK1.8 安裝與配置centos
第三部分:Hadoop配置,啓動與驗證緩存
1. 解壓Hadoop2.7.3更新全局變量安全
2. 更新Hadoop配置文件
3. 啓動Hadoop
4. 驗證Hadoop
=============================================================================================
第一部分:操做系統準備工做:
1. 安裝部署CentOS7.3 1611
2. CentOS7軟件安裝(net-tools, wget, vim等)
3. 更新CentOS7的Yum源,更新軟件速度更快
4. CentOS 用戶配置,Sudo受權
1. 安裝部署CentOS7.3 1611

2. CentOS7軟件安裝(net-tools, wget, vim等)
sudo yum install -y net-tools
sudo yum install -y wget
sudo yum install -y vim

3. 更新CentOS7的Yum源(更新爲阿里雲的CentOS7的源),更新軟件速度更快
http://mirrors.aliyun.com/help/centos
一、備份
mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
二、下載新的CentOS-Base.repo 到/etc/yum.repos.d/
CentOS 5
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-5.repo
CentOS 6
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-6.repo
CentOS 7
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
三、以後運行yum makecache生成緩存

四、sudo yum -y update 對系統進行升級


sudo vim /etc/hosts # 更新hosts文件,便於用spark02表明本機IP

第二部分:Java環境準備
1. JDK1.8 安裝與配置
經過FileZilla 上傳實驗所須要用到的文件(JDK,Hadoop,Spark)

對JDK和Hadoop進行解壓
tar -zxvf jdk-8u121-linux-x64.tar.gz
tar -zxvf hadoop-2.7.3.tar.gz
在 .bash_profile文件內增長環境便利,便於Java和Hadoop更容易操做
#Add JAVA_HOME and HADOOP_HOME
export JAVA_HOME=/home/spark/jdk1.8.0_121
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/home/spark/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source .bash_profile #配置生效

第三部分:Hadoop配置,啓動與驗證
1. 解壓Hadoop2.7.3更新全局變量
2. 更新Hadoop配置文件
3. 啓動Hadoop
4. 驗證Hadoop
參考:Hadoop2.7.3 官方文檔進行僞分佈式的配置
http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html
Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Configuration
Use the following:
etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Execution
The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node.
-
Format the filesystem:
$ bin/hdfs namenode -format
-
Start NameNode daemon and DataNode daemon:
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
-
Browse the web interface for the NameNode; by default it is available at:
- NameNode - http://localhost:50070/
-
Make the HDFS directories required to execute MapReduce jobs:
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
-
Copy the input files into the distributed filesystem:
$ bin/hdfs dfs -put etc/hadoop input
-
Run some of the examples provided:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
-
Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:
$ bin/hdfs dfs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hdfs dfs -cat output/*
-
When you’re done, stop the daemons with:
YARN on a Single Node
You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.
The following instructions assume that 1. ~ 4. steps of the above instructions are already executed.
-
Configure parameters as follows:etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
-
Start ResourceManager daemon and NodeManager daemon:
-
Browse the web interface for the ResourceManager; by default it is available at:
- ResourceManager - http://localhost:8088/
-
Run a MapReduce job.
-
When you’re done, stop the daemons with:
$ sbin/stop-yarn.sh
配置免密碼,不然運行的時候會報錯。
Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Hadoop配置文件具體的配置信息以下:
1. vim etc/hadoop/hadoop-env.sh
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/home/spark/jdk1.8.0_121
2. vim etc/hadoop/core-site.xml
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://spark01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/spark/hadoopdata</value>
</property>
</configuration>
3. vim etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
4. vim etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5.vim etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
</configuration>
1. 對HDFS進行格式化
hdfs namenode -format
2.啓動HDFS
start-dfs.sh

3. 啓動YARN
start-yarn.sh

CentOS 7.2關閉防火牆
CentOS 7.0默認使用的是firewall做爲防火牆,這裏改成iptables防火牆步驟。
firewall-cmd --state #查看默認防火牆狀態(關閉後顯示notrunning,開啓後顯示running)
1
2
3
|
[root@localhost ~]
not running
|
檢查防火牆的狀態:
從centos7開始使用systemctl來管理服務和程序,包括了service和chkconfig。
1
2
3
|
[root@localhost ~]
firewalld.service disabled
|
或者
1
2
3
4
5
6
7
|
[root@localhost ~]
firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (
/usr/lib/systemd/system/firewalld
.service; disabled; vendor preset: enabled)
Active: inactive (dead)
|
關閉防火牆:
systemctl stop firewalld.service #中止firewall
systemctl disable firewalld.service #禁止firewall開機啓動
1
2
3
|
[root@localhost ~]
[root@localhost ~]
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
啓動一個服務:systemctl start firewalld.service
關閉一個服務:systemctl stop firewalld.service
重啓一個服務:systemctl restart firewalld.service
顯示一個服務的狀態:systemctl status firewalld.service
在開機時啓用一個服務:systemctl enable firewalld.service
在開機時禁用一個服務:systemctl disable firewalld.service
查看服務是否開機啓動:systemctl is-enabled firewalld.service;echo $?
查看已啓動的服務列表:systemctl list-unit-files|grep enabled
|
Centos 7 firewall 命令:
查看已經開放的端口:
1
|
firewall-cmd --list-ports
|
開啓端口
1
|
firewall-cmd --zone=public --add-port=80
/tcp
--permanent
|
命令含義:
–zone #做用域
–add-port=80/tcp #添加端口,格式爲:端口/通信協議
–permanent #永久生效,沒有此參數重啓後失效
重啓防火牆
1
2
3
4
|
firewall-cmd --reload
systemctl stop firewalld.service
systemctl disable firewalld.service
firewall-cmd --state
|
1. 啓動防火牆(Firewalld):
sudo systemctl start firewalld.service
2. 查看防火牆(Firewalld)運行狀態:
sudo systemctl status firewalld.service
3. 配置防火牆(Firewalld)訪問規則,打開8088(YARN)和50070(HDFS)兩個端口:
sudo firewall-cmd --zone=public --add-port=8088
/tcp
--permanent
sudo firewall-cmd --zone=public --add-port=50070
/tcp
--permanent
4. 載入防火牆(Firewalld)規則:
sudo firewall-cmd --reload
5. 從新啓動防火牆(Firewalld):
sudo systemctl restart firewalld.service
6. 驗證防火牆(Firewalld)生效:
http://spark02:8080
http://spark02:50070
http://spark02:8088

http://spark02:50070

使用HDFS建立目錄,拷貝文件和查看文件
hdfs dfs -mkdir hdfs://user/jonson/input
hdfs dfs -cp etc/hadoop hdfs://user/jonson/input
hdfs dfs -ls hdfs://user/jonson/input
hdfs dfs -mkdir hdfs://user/jonson/output
hdfs dfs -rmdir hdfs://user/jonson/output
hdfs dfs -ls hdfs://user/jonson
嘗試使用MapReduce計算框架
[spark@Spark02 hadoop-2.7.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep hdfs:///user/jonson/input/hadoop hdfs:///user/jonson/output 'dfs[a-z.]+'
[spark@Spark02 hadoop-2.7.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep hdfs:///user/jonson/input/hadoop hdfs:///user/jonson/output 'dfs[a-z.]+'
17/05/07 23:36:17 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/05/07 23:36:18 INFO input.FileInputFormat: Total input paths to process : 30
17/05/07 23:36:18 INFO mapreduce.JobSubmitter: number of splits:30
17/05/07 23:36:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494169715431_0003
17/05/07 23:36:19 INFO impl.YarnClientImpl: Submitted application application_1494169715431_0003
17/05/07 23:36:19 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1494169715431_0003/
17/05/07 23:36:19 INFO mapreduce.Job: Running job: job_1494169715431_0003
17/05/07 23:36:28 INFO mapreduce.Job: Job job_1494169715431_0003 running in uber mode : false
17/05/07 23:36:28 INFO mapreduce.Job: map 0% reduce 0%
17/05/07 23:36:58 INFO mapreduce.Job: map 20% reduce 0%
17/05/07 23:37:25 INFO mapreduce.Job: map 37% reduce 0%
17/05/07 23:37:26 INFO mapreduce.Job: map 40% reduce 0%
17/05/07 23:37:50 INFO mapreduce.Job: map 47% reduce 0%
17/05/07 23:37:51 INFO mapreduce.Job: map 57% reduce 0%
17/05/07 23:37:54 INFO mapreduce.Job: map 57% reduce 19%
17/05/07 23:38:04 INFO mapreduce.Job: map 60% reduce 19%
17/05/07 23:38:06 INFO mapreduce.Job: map 60% reduce 20%
17/05/07 23:38:12 INFO mapreduce.Job: map 73% reduce 20%
17/05/07 23:38:15 INFO mapreduce.Job: map 73% reduce 24%
17/05/07 23:38:18 INFO mapreduce.Job: map 77% reduce 24%
17/05/07 23:38:21 INFO mapreduce.Job: map 77% reduce 26%
17/05/07 23:38:33 INFO mapreduce.Job: map 83% reduce 26%
17/05/07 23:38:34 INFO mapreduce.Job: map 90% reduce 26%
17/05/07 23:38:35 INFO mapreduce.Job: map 93% reduce 26%
17/05/07 23:38:36 INFO mapreduce.Job: map 93% reduce 31%
17/05/07 23:38:43 INFO mapreduce.Job: map 100% reduce 31%
17/05/07 23:38:44 INFO mapreduce.Job: map 100% reduce 100%
17/05/07 23:38:45 INFO mapreduce.Job: Job job_1494169715431_0003 completed successfully
17/05/07 23:38:45 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=345
FILE: Number of bytes written=3690573
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=81841
HDFS: Number of bytes written=437
HDFS: Number of read operations=93
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=30
Launched reduce tasks=1
Data-local map tasks=30
Total time spent by all maps in occupied slots (ms)=653035
Total time spent by all reduces in occupied slots (ms)=77840
Total time spent by all map tasks (ms)=653035
Total time spent by all reduce tasks (ms)=77840
Total vcore-milliseconds taken by all map tasks=653035
Total vcore-milliseconds taken by all reduce tasks=77840
Total megabyte-milliseconds taken by all map tasks=668707840
Total megabyte-milliseconds taken by all reduce tasks=79708160
Map-Reduce Framework
Map input records=2103
Map output records=24
Map output bytes=590
Map output materialized bytes=519
Input split bytes=3804
Combine input records=24
Combine output records=13
Reduce input groups=11
Reduce shuffle bytes=519
Reduce input records=13
Reduce output records=11
Spilled Records=26
Shuffled Maps =30
Failed Shuffles=0
Merged Map outputs=30
GC time elapsed (ms)=8250
CPU time spent (ms)=13990
Physical memory (bytes) snapshot=6025490432
Virtual memory (bytes) snapshot=64352063488
Total committed heap usage (bytes)=4090552320
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=78037
File Output Format Counters
Bytes Written=437
17/05/07 23:38:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/05/07 23:38:46 INFO input.FileInputFormat: Total input paths to process : 1
17/05/07 23:38:46 INFO mapreduce.JobSubmitter: number of splits:1
17/05/07 23:38:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494169715431_0004
17/05/07 23:38:46 INFO impl.YarnClientImpl: Submitted application application_1494169715431_0004
17/05/07 23:38:46 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1494169715431_0004/
17/05/07 23:38:46 INFO mapreduce.Job: Running job: job_1494169715431_0004
17/05/07 23:39:00 INFO mapreduce.Job: Job job_1494169715431_0004 running in uber mode : false
17/05/07 23:39:00 INFO mapreduce.Job: map 0% reduce 0%
17/05/07 23:39:06 INFO mapreduce.Job: map 100% reduce 0%
17/05/07 23:39:13 INFO mapreduce.Job: map 100% reduce 100%
17/05/07 23:39:14 INFO mapreduce.Job: Job job_1494169715431_0004 completed successfully
17/05/07 23:39:14 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=291
FILE: Number of bytes written=237535
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=566
HDFS: Number of bytes written=197
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3838
Total time spent by all reduces in occupied slots (ms)=3849
Total time spent by all map tasks (ms)=3838
Total time spent by all reduce tasks (ms)=3849
Total vcore-milliseconds taken by all map tasks=3838
Total vcore-milliseconds taken by all reduce tasks=3849
Total megabyte-milliseconds taken by all map tasks=3930112
Total megabyte-milliseconds taken by all reduce tasks=3941376
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=129
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=143
CPU time spent (ms)=980
Physical memory (bytes) snapshot=306675712
Virtual memory (bytes) snapshot=4157272064
Total committed heap usage (bytes)=165810176
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=437
File Output Format Counters
Bytes Written=197


利用命令行來查看運行結果:
[spark@Spark02 hadoop-2.7.3]$ hadoop fs -cat hdfs:///user/jonson/output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
[spark@Spark02 hadoop-2.7.3]$ hadoop fs -cat hdfs:///user/jonson/output/part-r-00000
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file

====================================
免密碼登陸原理和方法
背景:搭建Hadoop環境須要設置無密碼登錄,所謂無密碼登錄實際上是指經過證書認證的方式登錄,使用一種被稱爲"公私鑰"認證的方式來進行ssh登陸。
在linux系統中,ssh是遠程登陸的默認工具,由於該工具的協議使用了RSA/DSA的加密算法.該工具作linux系統的遠程管理是很是安全的。telnet,由於其不安全性,在linux系統中被擱置使用了。
" 公私鑰"認證方式簡單的解釋:首先在客戶端上建立一對公私鑰 (公鑰文件:~/.ssh/id_rsa.pub; 私鑰文件:~/.ssh/id_rsa)。而後把公鑰放到服務器上(~/.ssh/authorized_keys), 本身保留好私鑰.在使用ssh登陸時,ssh程序會發送私鑰去和服務器上的公鑰作匹配.若是匹配成功就能夠登陸了。
方法/步驟
-
確認系統已經安裝了SSH。
rpm –qa | grep openssh
rpm –qa | grep rsync
-->出現以下圖的信息表示已安裝
假設沒有安裝ssh和rsync,能夠經過下面命令進行安裝。
yum install ssh -->安裝SSH協議
yum install rsync -->rsync是一個遠程數據同步工具,可經過LAN/WAN快速同步多臺主機間的文件
service sshd restart -->啓動服務
-
生成祕鑰對
ssh-keygen –t rsa –P '' -->直接回車生成的密鑰對:id_rsa和id_rsa.pub,默認存儲在"/home/hadoop/.ssh"目錄下。
-
把id_rsa.pub追加到受權的key裏面去。
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
-
修改受權key的權限
chmod 600 ~/.ssh/authorized_keys
-
修改SSH配置文件
su - -->登錄root用戶修改配置文件
vim /etc/ssh/sshd_config -->去掉下圖中三行的註釋
-
測試鏈接
service sshd restart -->重啓ssh服務,
exit -->退出root用戶,回到普通用戶
ssh localhost -->鏈接普通用戶測試
這只是配置好了單機環境上的SSH服務,要遠程鏈接其它的服務器,接着看下面。
-
如今祕鑰對已經生成好了,客戶端SSH服務也已經配置好了,如今就把咱們的鑰匙(公鑰)送給服務器。
scp ~/.ssh/id_rsa.pub 遠程用戶名@遠程服務器IP:~/ -->將公鑰複製到遠程服務器的~/目錄下
如: scp ~/.ssh/id_rsa.pub hadoop@192.168.1.134:~/
能夠看到咱們複製的時候須要咱們輸入服務器的密碼,等咱們把SSH配置好以後這些步驟就能夠不用輸入密碼了。
-
8
上一步把公鑰發送到192.168.1.134服務器上去了,咱們去134機器上把公鑰追加到受權key中去。(注意:若是是第一次運行SSH,那麼.ssh目錄須要手動建立,或者使用命令ssh-keygen -t rsa生成祕鑰,它會自動在用戶目錄下生成.ssh目錄。特別注意的是.ssh目錄的權限問題,記得運行下chmod 700 .ssh命令)
在134機器上使用命令:
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys -->追加公鑰到受權key中
rm ~/id_rsa.pub -->保險起見,刪除公鑰
一樣在134機器上重複第四步和第五步,
service sshd restart -->重啓ssh服務
-
9
回到客戶機來,輸入:
ssh 192.168.1.134 -->應該就能直接鏈接服務器咯。