早在四月份的時候,就已經開了這篇文章。當時是參加數據挖掘的比賽,在計科院大佬的建議下用TensorFlow搞深度學習,並且要在本身的hadoop分佈式集羣系統下搞。html
當時可把咱們牛逼壞了,在沒有基礎的前提下,用一個月的時間搭建本身的大數據平臺並運用人工智能框架來解題。java
結果可想而知:GG~~~~(只是把hadoop搭建起來了。。。。最後仍是老老實實的寫爬蟲)node
當時搭建是用VM虛擬機,等因而在17臺機器上運行17個CentOS 7,如今咱們用docker來打包環境。web
1、技術架構docker
Docker 1.12.6數據庫
CentOS 7 apache
JDK1.8.0_121centos
Hadoop2.7.3 :分佈式計算框架bash
Zookeeper-3.4.9:分佈式應用程序協調服務session
Hbase1.2.4:分佈式存儲數據庫
Spark-2.0.2:大數據分佈式計算引擎
Python-2.7.13
TensorFlow1.0.1:人工智能學習系統
2、搭建環境製做鏡像
一、下載鏡像:docker pull centos
二、啓動容器:docker run -it -d --name hadoop centos
三、進入容器:docker exec -it hadoop /bin/bash
四、安裝java(這些大數據工具須要jdk的支持,有些組件就是用java寫的)我這裏裝在/usr
配置環境變量/etc/profile
#config java export JAVA_HOME=/usr/java/jdk1.8.0_121 export JRE_HOME=/usr/java/jdk1.8.0_121/jre export CLASSPATH=$JAVA_HOME/lib export PATH=:$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
五、安裝hadoop(http://hadoop.apache.org/releases.html)我這裏裝在/usr/local/
配置環境變量/etc/profile
#config hadoop export HADOOP_HOME=/usr/local/hadoop/ export PATH=$HADOOP_HOME/bin:$PATH export PATH=$PATH:$HADOOP_HOME/sbin #hadoop?~D彗??W彖~G件路?D?~D?~M置 export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
source /etc/profile讓環境變量生效
改配置/usr/local/hadoop/etc/hadoop/:
(1)slaves(添加datanode節點)
Slave1
Slave2
(2)core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://Master:9000</value> </property> </configuration>
(3)hdfs-site.xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>Master:9001</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration>
(4)建立mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>Master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>Master:19888</value> </property> </configuration>
(5)yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>Master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>Master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>Master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>Master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>Master:8088</value> </property> </configuration>
六、安裝zookeeper(https://zookeeper.apache.org/)我這裏裝在/usr/local/
配置環境變量/etc/profile
#config zookeeper export ZOOKEEPER_HOME=/usr/local/zookeeper export PATH=$PATH:$ZOOKEEPER_HOME/bin:$ZOOKEEPER_HOME/conf
(1)/usr/local/zookeeper/conf/zoo.cfg
initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. dataDir=/usr/local/zookeeper/data # the port at which the clients will connect clientPort=2181 # the maximum number of client connections. # increase this if you need to handle more clients #maxClientCnxns=60 # # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance # # The number of snapshots to retain in dataDir #autopurge.snapRetainCount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature #autopurge.purgeInterval=1
七、安裝hbase(http://hbase.apache.org/)我這裏裝在/usr/local/
(1)/usr/local/hbase/conf/hbase-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_121 export HBASE_MANAGES_ZK=false
(2)hbase-site.xml
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://Master:9000/hbase</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>zookeeper.session.timeout</name> <value>120000</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>Master,Slave1,Slave2</value> </property> <property> <name>hbase.tmp.dir</name> <value>/usr/local/hbase/data</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> </configuration>
(3)core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://Master:9000</value> </property> </configuration>
(4)hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
(5)regionservers(表明個人三個節點)
Master #namenode
Slave1 #datanode01
Slave2 #datanode02
八、安裝 spark(http://spark.apache.org/)我這裏裝在/usr/local/
配置環境變量:
#config spark export SPARK_HOME=/usr/local/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
(1)cp ./conf/slaves.template ./conf/slaves
在slaves中添加節點:
Slave1
Slave2
(2)spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath) export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export SPARK_MASTER_IP=10.211.1.129 export JAVA_HOME=/usr/java/jdk1.8.0_121
九、若是要用tf訓練數據的話:pip install tensorflow
至此咱們namenode(Master)節點配置完了。。。。。。
十、exit退出容器
生成鏡像:docker commit edcabfcd69ff vitoyan/hadoop
發佈:docker push
去Docker Hub看一看:
3、測試
若是要作徹底分佈式的話,還須要添加多個節點(多個容器或者主機)。。。。。
由一個namenode控制多個datanode。
一、安裝ssh和net工具:yum install openssh-server net-tools openssh-clients -y
二、生成公鑰:ssh-keygen -t rsa
三、把密鑰追加到遠程主機(容器):ssh-copy-id -i ~/.ssh/id_rsa.pub root@10.211.1.129(這樣兩個容器不用密碼就能夠互相訪問---handoop集羣的前提)
四、在宿主機上查看hadoop容器的ip:docker exec hadoop hostname -i (再用一樣的方式給容器互相添加公鑰)
五、修改hostname分別爲Master,Slave一、二、三、四、5.。。。。。以區分各個容器
六、每一個容器添加/etc/hosts:
10.211.1.129 Master 10.211.1.130 Slave1 10.211.1.131 Slave2 10.102.25.3 Slave3 10.102.25.4 Slave4 10.102.25.5 Slave5 10.102.25.6 Slave6 10.102.25.7 Slave7 10.102.25.8 Slave8 10.102.25.9 Slave9 10.102.25.10 Slave10 10.102.25.11 Slave11 10.102.25.12 Slave12 10.102.25.13 Slave13 10.102.25.14 Slave14 10.102.25.15 Slave15 10.102.25.16 Slave16
七、對應Slave的hadoop配置只須要copy,而後改爲對應的主機名。
八、基本命令:
(1)、啓動hadoop分佈式集羣系統
cd /usr/local/hadoop
hdfs namenode -format
sbin/start-all.sh
檢查是否啓動成功:jps
(2)、啓動zookeeper分佈式應用程序協調服務
cd /usr/local/zookeeper/bin
./zkServer.sh start
檢查是否啓動成功:zkServer.sh status
(3)、啓動hbase分佈式數據庫
cd /usr/local/hbase/bin/
./start-hbase.sh
(5)、啓動spark大數據計算引擎集羣
cd /usr/local/spark/
sbin/start-master.sh
sbin/start-slaves.sh
集羣管理:http://master:8080
集羣基準測試:http://blog.itpub.net/8183550/viewspace-684152/
個人hadoop鏡像:https://hub.docker.com/r/vitoyan/hadoop/
歡迎pull
over!!!!!