一、基礎docker鏡像:ubuntu,目前最新的版本是18html
二、需準備的環境軟件包:java
(1) spark-2.3.0-bin-hadoop2.7.tgz
(2) hadoop-2.7.3.tar.gz
(3) apache-hive-2.3.2-bin.tar.gz
(4) jdk-8u101-linux-x64.tar.gz
(5) mysql-5.5.45-linux2.6-x86_64.tar.gz、mysql-connector-java-5.1.37-bin.jar
(6) scala-2.11.8.tgz
(7) zeppelin-0.8.0-bin-all.tgz
一、獲取官方的鏡像:node
docker pull ubuntu
二、因官方鏡像中的apt源是國外資源,後續擴展安裝軟件包時較麻煩。先修改成國內源:python
(1)啓動ubuntu容器,並進入容器中的apt配置目錄mysql
docker run -it -d ubuntu
docker exec -it ubuntu /bin/bash
cd /etc/apt
(2)先將原有的源文件備份:linux
mv sources.list sources.list.bak
(3)換爲國內源,這裏提供阿里的資源。因官方的ubuntu沒有艱裝vi等軟件,使用echo指令寫入。需注意一點,資源必須與系統版本匹配。git
echo deb http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse >> sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse >> sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse >> sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse >> sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse >> sources.list
三、退出容器,提交鏡像github
exit docker commit 容器id ubuntu:latest
生成的ubuntu鏡像,就能夠作爲基礎鏡像使用。web
先前所準備的一列系軟件包,在構建鏡像時,直接用RUN ADD指令添加到鏡像中,這裏先將一些必要的配置處理好。這些配置文件,都來自於各個軟件包中的conf目錄下。sql
(1)spark-env.sh
聲明Spark須要的環境變量
SPARK_MASTER_WEBUI_PORT=8888
export SPARK_HOME=$SPARK_HOME
export HADOOP_HOME=$HADOOP_HOME
export MASTER=spark://hadoop-maste:7077
export SCALA_HOME=$SCALA_HOME
export SPARK_MASTER_HOST=hadoop-maste
export JAVA_HOME=/usr/local/jdk1.8.0_101
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
(2)spark-default.conf
關於spark的默認配置
spark.executor.memory=2G spark.driver.memory=2G spark.executor.cores=2 #spark.sql.codegen.wholeStage=false #spark.memory.offHeap.enabled=true #spark.memory.offHeap.size=4G #spark.memory.fraction=0.9 #spark.memory.storageFraction=0.01 #spark.kryoserializer.buffer.max=64m #spark.shuffle.manager=sort #spark.sql.shuffle.partitions=600 spark.speculation=true spark.speculation.interval=5000 spark.speculation.quantile=0.9 spark.speculation.multiplier=2 spark.default.parallelism=1000 spark.driver.maxResultSize=1g #spark.rdd.compress=false spark.task.maxFailures=8 spark.network.timeout=300 spark.yarn.max.executor.failures=200 spark.shuffle.service.enabled=true spark.dynamicAllocation.enabled=true spark.dynamicAllocation.minExecutors=4 spark.dynamicAllocation.maxExecutors=8 spark.dynamicAllocation.executorIdleTimeout=60 #spark.serializer=org.apache.spark.serializer.JavaSerializer #spark.sql.adaptive.enabled=true #spark.sql.adaptive.shuffle.targetPostShuffleInputSize=100000000 #spark.sql.adaptive.minNumPostShufflePartitions=1 ##for spark2.0 #spark.sql.hive.verifyPartitionPath=true #spark.sql.warehouse.dir spark.sql.warehouse.dir=/spark/warehouse
(3)節點聲明文件,包括masters文件及slaves文件
主節點聲明文件:masters
hadoop-maste
從節點文件:slaves
hadoop-node1
hadoop-node2
(1)hadoop-env.sh
聲明Hadoop須要的環境變量
export JAVA_HOME=/usr/local/jdk1.8.0_101
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_IDENT_STRING=$USER
(2)hdfs-site.xml
主要設置了Hadoop的name及data節點。name節點存儲的是元數據,data存儲的是數據文件
<?xml version="1.0"?> <configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop2.7/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop2.7/dfs/data</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.permissions.enabled</name> <value>false</value> </property> </configuration>
(3)core-site.xml
設置主節點的地址:hadoop-maste。與後面啓動容器時,設置的主節點hostname要一致。
<?xml version="1.0"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop-maste:9000/</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> </property> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.oozie.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.oozie.groups</name> <value>*</value> </property> </configuration>
(4)yarn-site.xml
<?xml version="1.0"?> <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop-maste</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>hadoop-maste:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hadoop-maste:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>hadoop-maste:8035</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>hadoop-maste:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>hadoop-maste:8088</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>5</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>22528</value> <discription>每一個節點可用內存,單位MB</discription> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>4096</value> <discription>單個任務可申請最少內存,默認1024MB</discription> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>16384</value> <discription>單個任務可申請最大內存,默認8192MB</discription> </property> </configuration>
(5)mapred-site.xml
<?xml version="1.0"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <!-- 配置實際的Master主機名和端口--> <value>hadoop-maste:10020</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>8192</value> </property> <property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/stage</value> </property> <property> <name>mapreduce.jobhistory.done-dir</name> <value>/mr-history/done</value> </property> <property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/mr-history/tmp</value> </property> </configuration>
(6)主節點聲明文件:master
hadoop-maste
(1)hive-site.xml
主要兩個:一是hive.server2.transport.mode設爲binary,使其支持JDBC鏈接;二是設置mysql的地址。
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.warehouse.dir</name> <value>/home/hive/warehouse</value> </property> <property> <name>hive.exec.scratchdir</name> <value>/tmp/hive</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://hadoop-hive:9083</value> <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> <property> <name>hive.server2.transport.mode</name> <value>binary</value> </property> <property> <name>hive.server2.thrift.http.port</name> <value>10001</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop-mysql:3306/hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> </configuration>
(1)zeppelin-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_101
export MASTER=spark://hadoop-maste:7077
export SPARK_HOME=$SPARK_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
(2)zeppelin-site.xml
http端口默認8080,這裏改成18080。爲方便加載第三方包,mvnRepo也改成阿里的源。
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server address</description>
</property>
<property>
<name>zeppelin.server.port</name>
<value>18080</value>
<description>Server port.</description>
</property>
<property>
<name>zeppelin.server.ssl.port</name>
<value>18443</value>
<description>Server ssl port. (used when ssl property is set to true)</description>
</property>
<property>
<name>zeppelin.server.context.path</name>
<value>/</value>
<description>Context Path of the Web Application</description>
</property>
<property>
<name>zeppelin.war.tempdir</name>
<value>webapps</value>
<description>Location of jetty temporary directory</description>
</property>
<property>
<name>zeppelin.notebook.dir</name>
<value>notebook</value>
<description>path or URI for notebook persist</description>
</property>
<property>
<name>zeppelin.notebook.homescreen</name>
<value></value>
<description>id of notebook to be displayed in homescreen. ex) 2A94M5J1Z Empty value displays default home screen</description>
</property>
<property>
<name>zeppelin.notebook.homescreen.hide</name>
<value>false</value>
<description>hide homescreen notebook from list when this value set to true</description>
</property>
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.GitNotebookRepo</value>
<description>versioned notebook persistence layer implementation</description>
</property>
<property>
<name>zeppelin.notebook.one.way.sync</name>
<value>false</value>
<description>If there are multiple notebook storages, should we treat the first one as the only source of truth?</description>
</property>
<property>
<name>zeppelin.interpreter.dir</name>
<value>interpreter</value>
<description>Interpreter implementation base directory</description>
</property>
<property>
<name>zeppelin.interpreter.localRepo</name>
<value>local-repo</value>
<description>Local repository for interpreter's additional dependency loading</description>
</property>
<property>
<name>zeppelin.interpreter.dep.mvnRepo</name>
<value>http://maven.aliyun.com/nexus/content/groups/public/</value>
<description>Remote principal repository for interpreter's additional dependency loading</description>
</property>
<property>
<name>zeppelin.dep.localrepo</name>
<value>local-repo</value>
<description>Local repository for dependency loader</description>
</property>
<property>
<name>zeppelin.helium.node.installer.url</name>
<value>https://nodejs.org/dist/</value>
<description>Remote Node installer url for Helium dependency loader</description>
</property>
<property>
<name>zeppelin.helium.npm.installer.url</name>
<value>http://registry.npmjs.org/</value>
<description>Remote Npm installer url for Helium dependency loader</description>
</property>
<property>
<name>zeppelin.helium.yarnpkg.installer.url</name>
<value>https://github.com/yarnpkg/yarn/releases/download/</value>
<description>Remote Yarn package installer url for Helium dependency loader</description>
</property>
<property>
<name>zeppelin.interpreters</name>
<value>org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter,org.apache.zeppelin.rinterpreter.RRepl,org.apache.zeppelin.rinterpreter.KnitR,org.apache.zeppelin.spark.SparkRInterpreter,org.apache.zeppelin.spark.SparkSqlInterpreter,org.apache.zeppelin.spark.DepInterpreter,org.apache.zeppelin.markdown.Markdown,org.apache.zeppelin.angular.AngularInterpreter,org.apache.zeppelin.shell.ShellInterpreter,org.apache.zeppelin.file.HDFSFileInterpreter,org.apache.zeppelin.flink.FlinkInterpreter,,org.apache.zeppelin.python.PythonInterpreter,org.apache.zeppelin.python.PythonInterpreterPandasSql,org.apache.zeppelin.python.PythonCondaInterpreter,org.apache.zeppelin.python.PythonDockerInterpreter,org.apache.zeppelin.lens.LensInterpreter,org.apache.zeppelin.ignite.IgniteInterpreter,org.apache.zeppelin.ignite.IgniteSqlInterpreter,org.apache.zeppelin.cassandra.CassandraInterpreter,org.apache.zeppelin.geode.GeodeOqlInterpreter,org.apache.zeppelin.jdbc.JDBCInterpreter,org.apache.zeppelin.kylin.KylinInterpreter,org.apache.zeppelin.elasticsearch.ElasticsearchInterpreter,org.apache.zeppelin.scalding.ScaldingInterpreter,org.apache.zeppelin.alluxio.AlluxioInterpreter,org.apache.zeppelin.hbase.HbaseInterpreter,org.apache.zeppelin.livy.LivySparkInterpreter,org.apache.zeppelin.livy.LivyPySparkInterpreter,org.apache.zeppelin.livy.LivyPySpark3Interpreter,org.apache.zeppelin.livy.LivySparkRInterpreter,org.apache.zeppelin.livy.LivySparkSQLInterpreter,org.apache.zeppelin.bigquery.BigQueryInterpreter,org.apache.zeppelin.beam.BeamInterpreter,org.apache.zeppelin.pig.PigInterpreter,org.apache.zeppelin.pig.PigQueryInterpreter,org.apache.zeppelin.scio.ScioInterpreter,org.apache.zeppelin.groovy.GroovyInterpreter</value>
<description>Comma separated interpreter configurations. First interpreter become a default</description>
</property>
<property>
<name>zeppelin.interpreter.group.order</name>
<value>spark,md,angular,sh,livy,alluxio,file,psql,flink,python,ignite,lens,cassandra,geode,kylin,elasticsearch,scalding,jdbc,hbase,bigquery,beam,groovy</value>
<description></description>
</property>
<property>
<name>zeppelin.interpreter.connect.timeout</name>
<value>30000</value>
<description>Interpreter process connect timeout in msec.</description>
</property>
<property>
<name>zeppelin.interpreter.output.limit</name>
<value>102400</value>
<description>Output message from interpreter exceeding the limit will be truncated</description>
</property>
<property>
<name>zeppelin.ssl</name>
<value>false</value>
<description>Should SSL be used by the servers?</description>
</property>
<property>
<name>zeppelin.ssl.client.auth</name>
<value>false</value>
<description>Should client authentication be used for SSL connections?</description>
</property>
<property>
<name>zeppelin.ssl.keystore.path</name>
<value>keystore</value>
<description>Path to keystore relative to Zeppelin configuration directory</description>
</property>
<property>
<name>zeppelin.ssl.keystore.type</name>
<value>JKS</value>
<description>The format of the given keystore (e.g. JKS or PKCS12)</description>
</property>
<property>
<name>zeppelin.ssl.keystore.password</name>
<value>change me</value>
<description>Keystore password. Can be obfuscated by the Jetty Password tool</description>
</property>
<!--
<property>
<name>zeppelin.ssl.key.manager.password</name>
<value>change me</value>
<description>Key Manager password. Defaults to keystore password. Can be obfuscated.</description>
</property>
-->
<property>
<name>zeppelin.ssl.truststore.path</name>
<value>truststore</value>
<description>Path to truststore relative to Zeppelin configuration directory. Defaults to the keystore path</description>
</property>
<property>
<name>zeppelin.ssl.truststore.type</name>
<value>JKS</value>
<description>The format of the given truststore (e.g. JKS or PKCS12). Defaults to the same type as the keystore type</description>
</property>
<!--
<property>
<name>zeppelin.ssl.truststore.password</name>
<value>change me</value>
<description>Truststore password. Can be obfuscated by the Jetty Password tool. Defaults to the keystore password</description>
</property>
-->
<property>
<name>zeppelin.server.allowed.origins</name>
<value>*</value>
<description>Allowed sources for REST and WebSocket requests (i.e. http://onehost:8080,http://otherhost.com). If you leave * you are vulnerable to https://issues.apache.org/jira/browse/ZEPPELIN-173</description>
</property>
<property>
<name>zeppelin.anonymous.allowed</name>
<value>true</value>
<description>Anonymous user allowed by default</description>
</property>
<property>
<name>zeppelin.username.force.lowercase</name>
<value>false</value>
<description>Force convert username case to lower case, useful for Active Directory/LDAP. Default is not to change case</description>
</property>
<property>
<name>zeppelin.notebook.default.owner.username</name>
<value></value>
<description>Set owner role by default</description>
</property>
<property>
<name>zeppelin.notebook.public</name>
<value>true</value>
<description>Make notebook public by default when created, private otherwise</description>
</property>
<property>
<name>zeppelin.websocket.max.text.message.size</name>
<value>1024000</value>
<description>Size in characters of the maximum text message to be received by websocket. Defaults to 1024000</description>
</property>
<property>
<name>zeppelin.server.default.dir.allowed</name>
<value>false</value>
<description>Enable directory listings on server.</description>
</property>
</configuration>
整套環境啓動較爲煩瑣,這裏將須要的操做寫成腳本,在容器啓動時,自動運行。
先前在處理集羣配置中,用到許多環境變量,這裏統一作定義profile文件,構建容器時,用它替換系統的配置文件,即/etc/profile
profile文件:
if [ "$PS1" ]; then
if [ "$BASH" ] && [ "$BASH" != "/bin/sh" ]; then
# The file bash.bashrc already sets the default PS1.
# PS1='\h:\w\$ '
if [ -f /etc/bash.bashrc ]; then
. /etc/bash.bashrc
fi
else
if [ "`id -u`" -eq 0 ]; then
PS1='# '
else
PS1='$ '
fi
fi
fi
if [ -d /etc/profile.d ]; then
for i in /etc/profile.d/*.sh; do
if [ -r $i ]; then
. $i
fi
done
unset i
fi
export JAVA_HOME=/usr/local/jdk1.8.0_101
export SCALA_HOME=/usr/local/scala-2.11.8
export HADOOP_HOME=/usr/local/hadoop-2.7.3
export SPARK_HOME=/usr/local/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/usr/local/apache-hive-2.3.2-bin
export MYSQL_HOME=/usr/local/mysql
export PATH=$HIVE_HOME/bin:$MYSQL_HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH
各個容器須要經過網絡端口鏈接在一塊兒,爲方便鏈接訪問,使用SSH無驗證登陸
ssh_config文件:
Host localhost
StrictHostKeyChecking no
Host 0.0.0.0
StrictHostKeyChecking no
Host hadoop-*
StrictHostKeyChecking no
(1)啓動腳本:start-hadoop.sh
#!/bin/bash
echo -e "\n"
hdfs namenode -format -force
echo -e "\n"
$HADOOP_HOME/sbin/start-dfs.sh
echo -e "\n"
$HADOOP_HOME/sbin/start-yarn.sh
echo -e "\n"
$SPARK_HOME/sbin/start-all.sh
echo -e "\n"
hdfs dfs -mkdir /mr-history
hdfs dfs -mkdir /stage
echo -e "\n":
(2)重啓腳本:restart-hadoop.sh
#!/bin/bash
echo -e "\n"
echo -e "\n"
$HADOOP_HOME/sbin/start-dfs.sh
echo -e "\n"
$HADOOP_HOME/sbin/start-yarn.sh
echo -e "\n"
$SPARK_HOME/sbin/start-all.sh
echo -e "\n"
hdfs dfs -mkdir /mr-history
hdfs dfs -mkdir /stage
echo -e "\n"
(1)mysql 初始化腳本:init_mysql.sh
#!/bin/bash
cd /usr/local/mysql/
echo ..........mysql_install_db --user=root.................
nohup ./scripts/mysql_install_db --user=root &
sleep 3
echo ..........mysqld_safe --user=root.................
nohup ./bin/mysqld_safe --user=root &
sleep 3
echo ..........mysqladmin -u root password 'root'.................
nohup ./bin/mysqladmin -u root password 'root' &
sleep 3
echo ..........mysqladmin -uroot -proot shutdown.................
nohup ./bin/mysqladmin -uroot -proot shutdown &
sleep 3
echo ..........mysqld_safe.................
nohup ./bin/mysqld_safe --user=root &
sleep 3
echo ...........................
nohup ./bin/mysql -uroot -proot -e "grant all privileges on *.* to root@'%' identified by 'root' with grant option;"
sleep 3
echo ........grant all privileges on *.* to root@'%' identified by 'root' with grant option...............
(1)hive初始化:init_hive.sh
#!/bin/bash
cd /usr/local/apache-hive-2.3.2-bin/bin
sleep 3
nohup ./schematool -initSchema -dbType mysql &
sleep 3
nohup ./hive --service metastore &
sleep 3
nohup ./hive --service hiveserver2 &
sleep 5
echo Hive has initiallized!
(1)Dockfile
FROM ubuntu:lin
MAINTAINER reganzm 183943842@qq.com
ENV BUILD_ON 2018-03-04
COPY config /tmp
#RUN mv /tmp/apt.conf /etc/apt/
RUN mkdir -p ~/.pip/
RUN mv /tmp/pip.conf ~/.pip/pip.conf
RUN apt-get update -qqy
RUN apt-get -qqy install netcat-traditional vim wget net-tools iputils-ping openssh-server libaio-dev apt-utils
RUN pip install pandas numpy matplotlib sklearn seaborn scipy tensorflow gensim #--proxy http://root:1qazxcde32@192.168.0.4:7890/
#添加JDK
ADD ./software/jdk-8u101-linux-x64.tar.gz /usr/local/
#添加hadoop
ADD ./software/hadoop-2.7.3.tar.gz /usr/local
#添加scala
ADD ./software/scala-2.11.8.tgz /usr/local
#添加spark
ADD ./software/spark-2.3.0-bin-hadoop2.7.tgz /usr/local
#添加Zeppelin
ADD ./software/zeppelin-0.8.0-bin-all.tgz /usr/local
#添加mysql
ADD ./software/mysql-5.5.45-linux2.6-x86_64.tar.gz /usr/local
RUN mv /usr/local/mysql-5.5.45-linux2.6-x86_64 /usr/local/mysql
ENV MYSQL_HOME /usr/local/mysql
#添加hive
ADD ./software/apache-hive-2.3.2-bin.tar.gz /usr/local
ENV HIVE_HOME /usr/local/apache-hive-2.3.2-bin
RUN echo "HADOOP_HOME=/usr/local/hadoop-2.7.3" | cat >> /usr/local/apache-hive-2.3.2-bin/conf/hive-env.sh
#添加mysql-connector-java-5.1.37-bin.jar到hive的lib目錄中
ADD ./software/mysql-connector-java-5.1.37-bin.jar /usr/local/apache-hive-2.3.2-bin/lib
RUN cp /usr/local/apache-hive-2.3.2-bin/lib/mysql-connector-java-5.1.37-bin.jar /usr/local/spark-2.3.0-bin-hadoop2.7/jars
#增長JAVA_HOME環境變量
ENV JAVA_HOME /usr/local/jdk1.8.0_101
#hadoop環境變量
ENV HADOOP_HOME /usr/local/hadoop-2.7.3
#scala環境變量
ENV SCALA_HOME /usr/local/scala-2.11.8
#spark環境變量
ENV SPARK_HOME /usr/local/spark-2.3.0-bin-hadoop2.7
#Zeppelin環境變量
ENV ZEPPELIN_HOME /usr/local/zeppelin-0.8.0-bin-all
#將環境變量添加到系統變量中
ENV PATH $HIVE_HOME/bin:$MYSQL_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$ZEPPELIN_HOME/bin:$HADOOP_HOME/bin:$JAVA_HOME/bin:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$PATH
RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
chmod 600 ~/.ssh/authorized_keys
COPY config /tmp
#將配置移動到正確的位置
RUN mv /tmp/ssh_config ~/.ssh/config && \
mv /tmp/profile /etc/profile && \
mv /tmp/masters $SPARK_HOME/conf/masters && \
cp /tmp/slaves $SPARK_HOME/conf/ && \
mv /tmp/spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf && \
mv /tmp/spark-env.sh $SPARK_HOME/conf/spark-env.sh && \
mv /tmp/zeppelin-env.sh $ZEPPELIN_HOME/conf/zeppelin-env.sh && \
mv /tmp/zeppelin-site.xml $ZEPPELIN_HOME/conf/zeppelin-site.xml && \
cp /tmp/hive-site.xml $SPARK_HOME/conf/hive-site.xml && \
mv /tmp/hive-site.xml $HIVE_HOME/conf/hive-site.xml && \
mv /tmp/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \
mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \
mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \
mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \
mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \
mv /tmp/master $HADOOP_HOME/etc/hadoop/master && \
mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \
mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \
mkdir -p /usr/local/hadoop2.7/dfs/data && \
mkdir -p /usr/local/hadoop2.7/dfs/name && \
mv /tmp/init_mysql.sh ~/init_mysql.sh && chmod 700 ~/init_mysql.sh && \
mv /tmp/init_hive.sh ~/init_hive.sh && chmod 700 ~/init_hive.sh && \
mv /tmp/restart-hadoop.sh ~/restart-hadoop.sh && chmod 700 ~/restart-hadoop.sh && \
mv /tmp/zeppelin-daemon.sh ~/zeppelin-daemon.sh && chmod 700 ~/zeppelin-daemon.sh
#建立Zeppelin環境須要的目錄,設置在zeppelin-env.sh中
RUN mkdir /var/log/zeppelin && mkdir /var/run/zeppelin && mkdir /var/tmp/zeppelin
RUN echo $JAVA_HOME
#設置工做目錄
WORKDIR /root
#啓動sshd服務
RUN /etc/init.d/ssh start
#修改start-hadoop.sh權限爲700
RUN chmod 700 start-hadoop.sh
#修改root密碼
RUN echo "root:555555" | chpasswd
CMD ["/bin/bash"]
(2)構建腳本:build.sh
echo build Spark-hadoop images
docker build -t="spark" .
(3)構建鏡像,執行:
./build.sh
(1)建立子網
全部的網絡,經過內網鏈接,這裏構建一個名爲spark的子網:build_network.sh
echo create network
docker network create --subnet=172.16.0.0/16 spark
echo create success
docker network ls
(2)容器啓動腳本:start_container.sh
echo start hadoop-hive container...
docker run -itd --restart=always --net spark --ip 172.16.0.5 --privileged --name hive --hostname hadoop-hive --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-mysql:172.16.0.6 --add-host hadoop-maste:172.16.0.2 --add-host zeppelin:172.16.0.7 spark-lin /bin/bash
echo start hadoop-mysql container ...
docker run -itd --restart=always --net spark --ip 172.16.0.6 --privileged --name mysql --hostname hadoop-mysql --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-hive:172.16.0.5 --add-host hadoop-maste:172.16.0.2 --add-host zeppelin:172.16.0.7 spark-lin /bin/bash
echo start hadoop-maste container ...
docker run -itd --restart=always --net spark --ip 172.16.0.2 --privileged -p 18032:8032 -p 28080:18080 -p 29888:19888 -p 17077:7077 -p 51070:50070 -p 18888:8888 -p 19000:9000 -p 11100:11000 -p 51030:50030 -p 18050:8050 -p 18081:8081 -p 18900:8900 -p 18088:8088 --name hadoop-maste --hostname hadoop-maste --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-hive:172.16.0.5 --add-host hadoop-mysql:172.16.0.6 --add-host zeppelin:172.16.0.7 spark-lin /bin/bash
echo "start hadoop-node1 container..."
docker run -itd --restart=always --net spark --ip 172.16.0.3 --privileged -p 18042:8042 -p 51010:50010 -p 51020:50020 --name hadoop-node1 --hostname hadoop-node1 --add-host hadoop-hive:172.16.0.5 --add-host hadoop-mysql:172.16.0.6 --add-host hadoop-maste:172.16.0.2 --add-host hadoop-node2:172.16.0.4 --add-host zeppelin:172.16.0.7 spark-lin /bin/bash
echo "start hadoop-node2 container..."
docker run -itd --restart=always --net spark --ip 172.16.0.4 --privileged -p 18043:8042 -p 51011:50011 -p 51021:50021 --name hadoop-node2 --hostname hadoop-node2 --add-host hadoop-maste:172.16.0.2 --add-host hadoop-node1:172.16.0.3 --add-host hadoop-mysql:172.16.0.6 --add-host hadoop-hive:172.16.0.5 --add-host zeppelin:172.16.0.7 spark-lin /bin/bash
echo "start Zeppeline container..."
docker run -itd --restart=always --net spark --ip 172.16.0.7 --privileged -p 38080:18080 -p 38443:18443 --name zeppelin --hostname zeppelin --add-host hadoop-maste:172.16.0.2 --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-mysql:172.16.0.6 --add-host hadoop-hive:172.16.0.5 spark-lin /bin/bash
echo start sshd...
docker exec -it hadoop-maste /etc/init.d/ssh start
docker exec -it hadoop-node1 /etc/init.d/ssh start
docker exec -it hadoop-node2 /etc/init.d/ssh start
docker exec -it hive /etc/init.d/ssh start
docker exec -it mysql /etc/init.d/ssh start
docker exec -it zeppelin /etc/init.d/ssh start
echo start service...
docker exec -it mysql bash -c "sh ~/init_mysql.sh"
docker exec -it hadoop-maste bash -c "sh ~/start-hadoop.sh"
docker exec -it hive bash -c "sh ~/init_hive.sh"
docker exec -it zeppelin bash -c "$ZEPPELIN_HOME/bin/zeppelin-daemon.sh start"
echo finished
docker ps
(3)容器中止並移除:stop_container.sh
docker stop hadoop-maste
docker stop hadoop-node1
docker stop hadoop-node2
docker stop hive
docker stop mysql
docker stop zeppelin
echo stop containers
docker rm hadoop-maste
docker rm hadoop-node1
docker rm hadoop-node2
docker rm hive
docker rm mysql
docker rm zeppelin
echo rm containers
docker ps
依次執行以下腳本:
一、建立子網
./build_network.sh
二、啓動容器
./start_container.sh
三、進入主節點:
docker exec -it hadoop-maste /bin/bash
jps一下,進程是正常的
四、訪問集羣子節點
ssh hadoop-node2
同樣能夠看到,與主節點相似的進程信息
說明集羣已是搭建起來。
五、Spark測試
訪問:http://localhost:38080
進入Zeppelin交互界面,新建一個note,使用Spark爲默認的解釋器
import org.apache.commons.io.IOUtils import java.net.URL import java.nio.charset.Charset // Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext) // So you don't need create them manually // load bank data val bankText = sc.parallelize( IOUtils.toString( new URL("http://emr-sample-projects.oss-cn-hangzhou.aliyuncs.com/bank.csv"), Charset.forName("utf8")).split("\n")) case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map( s => Bank(s(0).toInt, s(1).replaceAll("\"", ""), s(2).replaceAll("\"", ""), s(3).replaceAll("\"", ""), s(5).replaceAll("\"", "").toInt ) ).toDF() bank.registerTempTable("bank")
可視化報表,以下圖:
說明Spark已是成功運行。
下一章,將對本套環境各個模塊作測試。
OVER!