暑假第二彈:基於docker的hadoop分佈式集羣系統的搭建和測試

早在四月份的時候,就已經開了這篇文章。當時是參加數據挖掘的比賽,在計科院大佬的建議下用TensorFlow搞深度學習,並且要在本身的hadoop分佈式集羣系統下搞。html

當時可把咱們牛逼壞了,在沒有基礎的前提下,用一個月的時間搭建本身的大數據平臺並運用人工智能框架來解題。java

結果可想而知:GG~~~~(只是把hadoop搭建起來了。。。。最後仍是老老實實的寫爬蟲)node

當時搭建是用VM虛擬機,等因而在17臺機器上運行17個CentOS 7,如今咱們用docker來打包環境。web

1、技術架構docker

Docker 1.12.6數據庫

CentOS 7 apache

JDK1.8.0_121centos

Hadoop2.7.3 :分佈式計算框架bash

Zookeeper-3.4.9:分佈式應用程序協調服務session

Hbase1.2.4:分佈式存儲數據庫

Spark-2.0.2:大數據分佈式計算引擎

Python-2.7.13

TensorFlow1.0.1:人工智能學習系統

2、搭建環境製做鏡像

一、下載鏡像:docker pull centos

二、啓動容器:docker run -it -d --name hadoop centos

三、進入容器:docker exec -it hadoop /bin/bash

四、安裝java(這些大數據工具須要jdk的支持,有些組件就是用java寫的)我這裏裝在/usr

配置環境變量/etc/profile

#config java
export JAVA_HOME=/usr/java/jdk1.8.0_121
export JRE_HOME=/usr/java/jdk1.8.0_121/jre
export CLASSPATH=$JAVA_HOME/lib
export PATH=:$PATH:$JAVA_HOME/bin:$JRE_HOME/bin

五、安裝hadoop(http://hadoop.apache.org/releases.html)我這裏裝在/usr/local/

配置環境變量/etc/profile

#config hadoop
export HADOOP_HOME=/usr/local/hadoop/
export PATH=$HADOOP_HOME/bin:$PATH
export PATH=$PATH:$HADOOP_HOME/sbin
#hadoop?~D彗??W彖~G件路?D?~D?~M置
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

source /etc/profile讓環境變量生效

改配置/usr/local/hadoop/etc/hadoop/:

(1)slaves(添加datanode節點)

Slave1
Slave2

(2)core-site.xml

<configuration>
      <property>
          <name>hadoop.tmp.dir</name>
          <value>file:/usr/local/hadoop/tmp</value>
          <description>Abase for other temporary directories.</description>
      </property>
      <property>
          <name>fs.defaultFS</name>
          <value>hdfs://Master:9000</value>
      </property>
</configuration>

(3)hdfs-site.xml

<configuration>
       <property>
                <name>dfs.namenode.secondary.http-address</name>
               <value>Master:9001</value>
       </property>
     <property>
             <name>dfs.namenode.name.dir</name>
             <value>file:/usr/local/hadoop/dfs/name</value>
       </property>
      <property>
              <name>dfs.datanode.data.dir</name>
              <value>file:/usr/local/hadoop/dfs/data</value>
       </property>
       <property>
               <name>dfs.replication</name>
               <value>2</value>
        </property>
        <property>
                 <name>dfs.webhdfs.enabled</name>
                  <value>true</value>
         </property>
</configuration>

(4)建立mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>Master:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>Master:19888</value>
    </property>
</configuration>

(5)yarn-site.xml

<configuration>
        <property>
               <name>yarn.nodemanager.aux-services</name>
               <value>mapreduce_shuffle</value>
        </property>
        <property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
               <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
               <name>yarn.resourcemanager.address</name>
               <value>Master:8032</value>
       </property>
       <property>
               <name>yarn.resourcemanager.scheduler.address</name>
               <value>Master:8030</value>
       </property>
       <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
             <value>Master:8031</value>
      </property>
      <property>
              <name>yarn.resourcemanager.admin.address</name>
               <value>Master:8033</value>
       </property>
       <property>
               <name>yarn.resourcemanager.webapp.address</name>
               <value>Master:8088</value>
       </property>
</configuration>

六、安裝zookeeper(https://zookeeper.apache.org/)我這裏裝在/usr/local/

配置環境變量/etc/profile

#config zookeeper
export ZOOKEEPER_HOME=/usr/local/zookeeper
export PATH=$PATH:$ZOOKEEPER_HOME/bin:$ZOOKEEPER_HOME/conf

(1)/usr/local/zookeeper/conf/zoo.cfg

initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.
dataDir=/usr/local/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the 
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

七、安裝hbase(http://hbase.apache.org/)我這裏裝在/usr/local/

(1)/usr/local/hbase/conf/hbase-env.sh

export JAVA_HOME=/usr/java/jdk1.8.0_121
export HBASE_MANAGES_ZK=false

(2)hbase-site.xml

<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://Master:9000/hbase</value>
        </property>

        <property>
                <name>hbase.zookeeper.property.clientPort</name>
                <value>2181</value>
        </property>
        <property>
                <name>zookeeper.session.timeout</name>
                <value>120000</value>
        </property>
        <property>
                <name>hbase.zookeeper.quorum</name>
                <value>Master,Slave1,Slave2</value>
        </property>
        <property>
                <name>hbase.tmp.dir</name>
                <value>/usr/local/hbase/data</value>
        </property>
        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>
</configuration>

(3)core-site.xml

<configuration>
      <property>
          <name>hadoop.tmp.dir</name>
          <value>file:/usr/local/hadoop/tmp</value>
          <description>Abase for other temporary directories.</description>
      </property>
      <property>
          <name>fs.defaultFS</name>
          <value>hdfs://Master:9000</value>
      </property>
</configuration>

(4)hdfs-site.xml

<configuration>

    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
</configuration>

(5)regionservers(表明個人三個節點)

Master #namenode
Slave1 #datanode01
Slave2 #datanode02

八、安裝 spark(http://spark.apache.org/)我這裏裝在/usr/local/

配置環境變量:

#config spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

(1)cp ./conf/slaves.template ./conf/slaves 

在slaves中添加節點:

Slave1
Slave2

(2)spark-env.sh

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_MASTER_IP=10.211.1.129
export JAVA_HOME=/usr/java/jdk1.8.0_121

九、若是要用tf訓練數據的話:pip install tensorflow 

至此咱們namenode(Master)節點配置完了。。。。。。

十、exit退出容器

生成鏡像:docker commit edcabfcd69ff vitoyan/hadoop

發佈:docker push 

去Docker Hub看一看:

3、測試

若是要作徹底分佈式的話,還須要添加多個節點(多個容器或者主機)。。。。。

由一個namenode控制多個datanode。

一、安裝ssh和net工具:yum install openssh-server net-tools openssh-clients -y 

二、生成公鑰:ssh-keygen -t rsa

三、把密鑰追加到遠程主機(容器):ssh-copy-id -i ~/.ssh/id_rsa.pub  root@10.211.1.129(這樣兩個容器不用密碼就能夠互相訪問---handoop集羣的前提)

四、在宿主機上查看hadoop容器的ip:docker exec hadoop hostname -i (再用一樣的方式給容器互相添加公鑰)

五、修改hostname分別爲Master,Slave一、二、三、四、5.。。。。。以區分各個容器

六、每一個容器添加/etc/hosts:

10.211.1.129 Master
10.211.1.130 Slave1
10.211.1.131 Slave2
10.102.25.3  Slave3
10.102.25.4  Slave4
10.102.25.5  Slave5
10.102.25.6  Slave6
10.102.25.7  Slave7
10.102.25.8  Slave8
10.102.25.9  Slave9
10.102.25.10 Slave10
10.102.25.11 Slave11
10.102.25.12 Slave12
10.102.25.13 Slave13
10.102.25.14 Slave14
10.102.25.15 Slave15
10.102.25.16 Slave16

七、對應Slave的hadoop配置只須要copy,而後改爲對應的主機名。

八、基本命令

(1)、啓動hadoop分佈式集羣系統

cd /usr/local/hadoop

hdfs namenode -format

sbin/start-all.sh

檢查是否啓動成功:jps

(2)、啓動zookeeper分佈式應用程序協調服務

cd /usr/local/zookeeper/bin

./zkServer.sh start

檢查是否啓動成功:zkServer.sh status

(3)、啓動hbase分佈式數據庫

cd /usr/local/hbase/bin/

./start-hbase.sh

(5)、啓動spark大數據計算引擎集羣

cd /usr/local/spark/

sbin/start-master.sh

sbin/start-slaves.sh

集羣管理:http://master:8080

集羣基準測試:http://blog.itpub.net/8183550/viewspace-684152/

個人hadoop鏡像:https://hub.docker.com/r/vitoyan/hadoop/

歡迎pull

over!!!!!

相關文章
相關標籤/搜索