基於docker1.7.03.1單機上部署hadoop2.7.3分佈式集羣

基於docker1.7.03.1單機上部署hadoop2.7.3分佈式集羣

[TOC]java

聲明

文章均爲本人技術筆記,轉載請註明出處:
[1] https://segmentfault.com/u/yzwall
[2] blog.csdn.net/j_dark/node

0 docker版本與hadoop版本說明

  • PC:ubuntu 16.04.1 LTSpython

  • Docker version:17.03.1-ce OS/Arch:linux/amd64linux

  • Hadoop version:hadoop-2.7.3web

1 docker中配置構建hadoop鏡像

1.1 建立docker容器container

建立基於ubuntu鏡像的容器container,官方默認下載ubuntu最新精簡版鏡像;
sudo docker run -ti container ubuntudocker

1.2 修改/etc/source.list

修改默認源文件/etc/apt/source.list,用國內源代替官方源;shell

1.3 安裝java8

# docker鏡像爲了精簡容量,刪除了許多ubuntu自帶組件,經過`apt-get update`更新得到
apt-get update
apt-get install software-properties-common python-software-properties # add-apt-repository
apt-get install software-properties-commonapt-get install software-properties-common # add-apt-repository
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer
java -version

1.4 docker中安裝hadoop-2.7.3

1.4.1 下載hadoop-2.7.3源碼

# 建立多級目錄
mkdir -p /software/apache/hadoop
cd /software/apache/hadoop
# 下載並解壓hadoop
wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xvzf hadoop-2.7.3.tar.gz

1.4.2 配置環境變量

修改~/.bashrc文件。在文件末尾加入下面配置信息:apache

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/software/apache/hadoop/hadoop-2.7.3
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

source ~/.bashrc使環境變量配置生效;
注意:完成./bashrc文件配置後,hadoop-env.sh無需再配置;ubuntu

1.5 配置hadoop

配置hadoop主要配置core-site.xmlhdfs-site.xmlmapred-site.xmlyarn-site.xml三個文件;vim

$HADOOP_HOME下建立namenode, datanodetmp目錄

cd $HADOOP_HOME
mkdir tmp
mkdir namenode
mkdir datanode

1.5.1 配置core.site.xml

  • 配置項hadoop.tmp.dir指向tmp目錄

  • 配置項fs.default.name指向master節點,配置爲hdfs://master:9000

<configuration>
    <property>
        <!-- hadoop temp dir  -->
        <name>hadoop.tmp.dir</name>
        <value>/software/apache/hadoop/hadoop-2.7.3/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>

    <!-- Size of read/write buffer used in SequenceFiles. -->
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
    
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
        <final>true</final>
        <description>The name of the default file system.</description>
    </property>
</configuration>

1.5.2 配置hdfs-site.xml

  • dfs.replication表示節點數目,配置集羣1個namenode,3個datanode,設置備份數爲4;

  • dfs.namenode.name.dirdfs.datanode.data.dir分別配置爲以前建立的NameNode和DataNode的目錄路徑

<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>master:9001</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>3</value>
        <final>true</final>
        <description>Default block replication.</description>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/software/apache/hadoop/hadoop-2.7.3/namenode</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/software/apache/hadoop/hadoop-2.7.3/datanode</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

1.5.3 配置mapred-site.xml

$HADOOP_HOME下使用cp命令建立mapred-site.xml

cd $HADOOP_HOME
cp mapred-site.xml.template mapred-site.xml

配置mapred-site.xml配置項mapred.job.tracker指向master節點

在hadoop 2.x.x中,用戶無需配置mapred.job.tracker,由於JobTracker已經不存在,功能由組件MRAppMaster實現,所以須要用mapreduce.framework.name指定運行框架名稱,指定yarn

——《Hadoop技術內幕:深刻解析YARN架構設計與實現原理》

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
    </property>
    
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:19888</value>
    </property>
</configuration>

1.5.4 配置yarn-site.xml

<configuration>
    <property>  
        <name>yarn.nodemanager.aux-services</name>  
        <value>mapreduce_shuffle</value>  
    </property>  
    <property>                                                                  
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>  
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.address</name>  
        <value>master:8032</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.scheduler.address</name>  
        <value>master:8030</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.resource-tracker.address</name>  
        <value>master:8031</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.admin.address</name>  
        <value>master:8033</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.webapp.address</name>  
        <value>master:8088</value>  
    </property>  
</configuration>

1.5.5 安裝vim,ifconfig與ping

安裝ifconfigping命令所需軟件包

apt-get update
apt-get install vim
apt-get install net-tools       # for ifconfig 
apt-get install inetutils-ping  # for ping

1.5.6 構建hadoop基礎鏡像

假設當前容器名爲container,保存基礎鏡像爲ubuntu:hadoop,後續hadoop集羣容器都根據該鏡像建立啓動,無需重複配置;
sudo docker commit -m "hadoop installed" container ubuntu:hadoop /bin/bash

2. hadoop分佈式集羣搭建

2.1 根據已經建立hadoop基礎鏡像建立容器集羣

分別根據基礎鏡像ubuntu:hadoop建立mater容器和slave1~3容器,各自主機名容器名一致;
建立master:docker run -ti -h master --name master ubuntu:hadoop /bin/bash
建立slave1:docker run -ti -h slave1 --name slave1 ubuntu:hadoop /bin/bash
建立slave2:docker run -ti -h slave2 --name slave2 ubuntu:hadoop /bin/bash
建立slave3:docker run -ti -h slave3 --name slave3 ubuntu:hadoop /bin/bash

2.2 配置各容器hosts文件

在各容器的/etc/hosts中添加如下內容,各容器ip地址經過ifconfig查看:

master 172.17.0.2 
slave1 172.17.0.3 
slave2 172.17.0.4 
slave3 172.17.0.5

注意:docker容器重啓後,hosts內容可能會失效,經驗不足暫時只能避免容器頻繁重啓,不然得手動再次配置hosts文件;

參考http://dockone.io/question/400

1./etc/hosts, /etc/resolv.conf和/etc/hostname,容器中的這三個文件不存在於鏡像,而是存在於/var/lib/docker/containers/<container_id>,在啓動容器的時候,經過mount的形式將這些文件掛載到容器內部。所以,若是在容器中修改這些文件的話,修改部分不會存在於容器的top layer,而是直接寫入這三個物理文件中。
2.爲何重啓後修改內容不存在?緣由是:每次Docker在啓動容器的時候,經過從新構建新的/etc/hosts文件,這又是爲何呢?緣由是:容器重啓,IP地址爲改變,hosts文件中原來的IP地址無效,所以理應修改hosts文件,不然會產生髒數據。?緣由是:每次Docker在啓動容器的時候,經過從新構建新的/etc/hosts文件,這又是爲何呢?緣由是:容器重啓,IP地址爲改變,hosts文件中原來的IP地址無效,所以理應修改hosts文件,不然會產生髒數據。1./etc/hosts, /etc/resolv.conf和/etc/hostname,容器中的這三個文件不存在於鏡像,而是存在於/var/lib/docker/containers/<container_id>,在啓動容器的時候,經過mount的形式將這些文件掛載到容器內部。所以,若是在容器中修改這些文件的話,修改部分不會存在於容器的top layer,而是直接寫入這三個物理文件中。

2.3 集羣節點SSH配置

2.3.1 全部節點:安裝ssh

apt-get update
apt-get install ssh
apt-get install openssh-server

2.3.2 全部節點:生成隨機密鑰

# 生成無密碼密鑰,生成密鑰位於~/.ssh下
ssh-keygen -t rsa -P ""

2.3.3 master節點:生成證書文件authorized_keys

將生成的公鑰寫入authorized_keys中

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

2.3.4 全部節點:修改sshd_config文件

經過修改sshd_config文件,保證ssh可遠程登錄其餘節點的root用戶

vim /etc/ssh/sshd_config
# 將PermitRootLogin prohibit-password修改成PermitRootLogin yes
# 重啓ssh服務
service ssh restart

2.3.5 master節點:經過scp傳輸證書到slave節點

傳輸master節點上的authorized_keys到其餘slave節點~/.ssh下,覆蓋同名文件;保證全部節點的證書一致,所以能夠實現任意節點間能夠經過ssh訪問;

cd ~/.ssh
scp authorized_keys root@slave1:~/.ssh/
scp authorized_keys root@slave2:~/.ssh/
scp authorized_keys root@slave3:~/.ssh/

2.3.6 slave節點:修改證書權限確保生效

chmod 600 ~/.ssh/authorized_keys

注意

  • 查看ssh服務是否開啓:ps -e | grep ssh

  • 開啓ssh服務:service ssh start

  • 重啓ssh服務:service ssh restart

完成2.3.1操做後,各個容器之間可經過ssh訪問;

2.4 master節點配置

在master節點中,修改slaves文件配置slave節點

cd $HADOOP_CONFIG_HOME/
vim slaves

將其中內容覆蓋爲:

slave1
slave2
slave3

2.5 啓動hadoop集羣

進入master節點,

  • 執行hdfs namenode -format,出現相似信息表示namenode格式化成功:

common.Storage: Storage directory /software/apache/hadoop/hadoop-2.7.3/namenode has been successfully formatted.
  • 執行start_all.sh啓動集羣:

root@master:/# start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
The authenticity of host 'master (172.17.0.2)' can't be established.
ECDSA key fingerprint is SHA256:OewrSOYpvfDE6ixf6Gw9U7I9URT2zDCCtDJ6tjuZz/4.
Are you sure you want to continue connecting (yes/no)? yes
master: Warning: Permanently added 'master,172.17.0.2' (ECDSA) to the list of known hosts.
master: starting namenode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-namenode-master.out
slave3: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave3.out
slave2: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave2.out
slave1: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave1.out
Starting secondary namenodes [master]
master: starting secondarynamenode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-master.out
starting yarn daemons
starting resourcemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-resourcemanager-master.out
slave3: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave3.out
slave1: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave1.out
slave2: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave2.out

分別在master,slave節點中執行jps

  • master:

root@master:/# jps
2065 Jps
1446 NameNode
1801 ResourceManager
1641 SecondaryNameNode
  • slave1:

1107 NodeManager
1220 Jps
1000 DataNode
  • slave2:

241 DataNode
475 Jps
348 NodeManager
  • slave3:

500 Jps
388 NodeManager
281 DataNode

3. 執行wordcount

在hdfs中建立輸入目錄/hadoopinput,並將輸入文件LICENSE.txt存儲在該目錄下:

root@master:/# hdfs dfs -mkdir -p /hadoopinput
root@master:/# hdfs dfs -put LICENSE.txt /hadoopint

進入$HADOOP_HOME/share/hadoop/mapreduce,提交wordcount任務給集羣,將計算結果保存在hdfs中的/hadoopoutput目錄下:

root@master:/# cd $HADOOP_HOME/share/hadoop/mapreduce
root@master:/software/apache/hadoop/hadoop-2.7.3/share/hadoop/mapreduce# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /hadoopinput /hadoopoutput
17/05/26 01:21:34 INFO client.RMProxy: Connecting to ResourceManager at master/172.17.0.2:8032
17/05/26 01:21:35 INFO input.FileInputFormat: Total input paths to process : 1
17/05/26 01:21:35 INFO mapreduce.JobSubmitter: number of splits:1
17/05/26 01:21:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1495722519742_0001
17/05/26 01:21:36 INFO impl.YarnClientImpl: Submitted application application_1495722519742_0001
17/05/26 01:21:36 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1495722519742_0001/
17/05/26 01:21:36 INFO mapreduce.Job: Running job: job_1495722519742_0001
17/05/26 01:21:43 INFO mapreduce.Job: Job job_1495722519742_0001 running in uber mode : false
17/05/26 01:21:43 INFO mapreduce.Job:  map 0% reduce 0%
17/05/26 01:21:48 INFO mapreduce.Job:  map 100% reduce 0%
17/05/26 01:21:54 INFO mapreduce.Job:  map 100% reduce 100%
17/05/26 01:21:55 INFO mapreduce.Job: Job job_1495722519742_0001 completed successfully
17/05/26 01:21:55 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=29366
        FILE: Number of bytes written=295977
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=84961
        HDFS: Number of bytes written=22002
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=2922
        Total time spent by all reduces in occupied slots (ms)=3148
        Total time spent by all map tasks (ms)=2922
        Total time spent by all reduce tasks (ms)=3148
        Total vcore-milliseconds taken by all map tasks=2922
        Total vcore-milliseconds taken by all reduce tasks=3148
        Total megabyte-milliseconds taken by all map tasks=2992128
        Total megabyte-milliseconds taken by all reduce tasks=3223552
    Map-Reduce Framework
        Map input records=1562
        Map output records=12371
        Map output bytes=132735
        Map output materialized bytes=29366
        Input split bytes=107
        Combine input records=12371
        Combine output records=1906
        Reduce input groups=1906
        Reduce shuffle bytes=29366
        Reduce input records=1906
        Reduce output records=1906
        Spilled Records=3812
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=78
        CPU time spent (ms)=1620
        Physical memory (bytes) snapshot=451264512
        Virtual memory (bytes) snapshot=3915927552
        Total committed heap usage (bytes)=348127232
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=84854
    File Output Format Counters 
        Bytes Written=22002

計算結果保存在/hadoopoutput/part-r-00000中,查看結果:

root@master:/# hdfs dfs -ls /hadoopoutput
Found 2 items
-rw-r--r--   3 root supergroup          0 2017-05-26 01:21 /hadoopoutput/_SUCCESS
-rw-r--r--   3 root supergroup      22002 2017-05-26 01:21 /hadoopoutput/part-r-00000

root@master:/# hdfs dfs -cat /hadoopoutput/part-r-00000
""AS    2
"AS    16
"COPYRIGHTS    1
"Contribution"    2
"Contributor"    2
"Derivative    1
"Legal    1
"License"    1
"License");    1
"Licensed    1
"Licensor"    1
...

至此,基於docker1.7.03單機上部署hadoop2.7.3集羣圓滿成功!

參考

[1] http://tashan10.com/yong-dockerda-jian-hadoopwei-fen-bu-shi-ji-qun/
[2] http://blog.csdn.net/xiaoxiangzi222/article/details/52757168

相關文章
相關標籤/搜索