注:Hadoop-2.7.七、Hive-2.1.一、spark-1.6.0-bin-hadoop2.6,操做系統是Ubuntu18 64bit。最近作Hive on spark的任務,記錄下。html
List-1.1java
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/opt/software/docker/hadoop/hadoop-2.7.7/data/tmp</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://127.0.0.1:9000</value> </property> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> </configuration>
List-1.2node
<configuration> <property> <name>dfs.datanode.data.dir</name> <value>/opt/software/docker/hadoop/hadoop-2.7.7/data/data</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/opt/software/docker/hadoop/hadoop-2.7.7/data/name</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
List-1.3mysql
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
List-1.4git
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> </property> </configuration>
將Hadoop寫到環境變量裏面,在/etc/profile中加入以下github
List-1.5sql
#hadoop export HADOOP_HOME=/opt/software/docker/hadoop/hadoop-2.7.7 export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
啓動Hadoop,命令行執行start-dfs.sh和start-yarn.sh,沒有報錯,以後命令行運行"hadoop fs -ls /"沒出錯,就ok了。若是報錯,能夠去看日誌文件。docker
List-2.1apache
<property> <name>system:java.io.tmpdir</name> <value>/tmp/hive/java</value> </property> <property> <name>system:user.name</name> <value>${user.name}</value> </property>
將Hive配置到環境變量裏面,修改/etc/profile,加入以下內容,以後"source /etc/profile"bash
List-2.2
#hive #export HIVE_HOME=/opt/software/docker/hadoop/apache-hive-2.1.1-bin #export PATH=$HIVE_HOME/bin:$PATH
以後命令行執行"schematool -initSchema -dbType mysql",沒有報錯,報錯的話,看Hive日誌。
命令行執行hive命令,就進入Hive CLI了,以後能夠執行建立表等操做。
由於Hive的計算引擎默認是map reduce,比較慢,咱們想要將Hive的計算引擎設置爲Spark。
這是最坑的部分。
要很注意的一點是hive和的版本要和spark的版本對應,能夠看這裏 。因爲上面咱們使用的Hive版本2.1.1,因此,咱們選用的Spark版本是1.6.0。
不能使用從apache spark官網下載的bin直接使用,由於那個裏面,有與hadoop/hive有關的代碼,咱們要本身編譯。
從github下載spark源碼。安裝scala,我安裝的是2.12,/etc/profile以下
List-3.1.1
#scala export SCALA_HOME=/opt/software/tool/scala2.12 export PATH=$SCALA_HOME/bin:$PATH
以後進行spark源碼目錄,使用List-3.1.2中的命令進行打包,以後會看到一個名爲"spark-1.6.0-bin-hadoop2-without-hive.tgz"的新文件。
List-3.1.2
./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"
其實個人Spark僞分佈式是用官網下的包安裝的,只是用List-3.1.2中lib下的spark-assembly-1.6.0-hadoop2.6.0.jar替換官網二進制安裝的spark的lib下的spark-assembly-1.6.0-hadoop2.6.0.jar。
在SPARK_HOME/conf下,"cp spark-defaults.conf.template spark-defaults.conf",spark-defaults.conf的內容以下List-3.2:
List-3.2
spark.master spark://127.0.0.1:7077 spark.eventLog.enabled true spark.eventLog.dir hdfs://127.0.0.1:9000/opt/applogs/spark spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 512M spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
"cp spark-env.sh.template spark-env.sh",以後spark-env.sh內容以下,網上說的SPARK_DIST_CLASSPATH=%(hadoop classpath)不生效。
List-3.3
export JAVA_HOME=/opt/software/tool/jdk1.8 export HADOOP_HOME=/opt/software/docker/hadoop/hadoop-2.7.7 export SCALA_HOME=/opt/software/tool/scala2.12 export HADOOP_CONF_DIR=/opt/software/docker/hadoop/hadoop-2.7.7/etc/hadoop export SPARK_MASTER_IP=mjduan-host export SPARK_WORKER_MEMORY=3072M export SPARK_DIST_CLASSPATH=$(/opt/software/docker/hadoop/hadoop-2.7.7/bin/hadoop classpath);
要修改Hive的hive-site.xml:
將hive.execution.engine的值改成spark。將spark-assembly-1.6.0-hadoop2.6.0.jar放到hive的lib下。
在hdfs上新建目錄/yarn,並將List-3.1.2中獲得的spark-assembly-1.6.0-hadoop2.6.0.jar放到hdfs的/yarn目錄下;在hdfs上新建目錄/opt/applogs/spark。
以後在hive-site.xml中加入以下List-3.4:
List-3.4
<property> <name>hive.execution.engine</name> <value>spark</value> <description> Expects one of [mr, tez, spark]. Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR remains the default engine for historical reasons, it is itself a historical engine and is deprecated in Hive 2 line. It may be removed without further warning. </description> </property> <property> <name>spark.master</name> <value>spark://127.0.0.1:7077</value> </property> <property> <name>spark.eventLog.enabled</name> <value>true</value> </property> <property> <name>spark.eventLog.dir</name> <value>hdfs://127.0.0.1:9000/opt/applogs/spark</value> </property> <property> <name>spark.executor.memory</name> <value>512M</value> </property> <property> <name>spark.serializer</name> <value>org.apache.spark.serializer.KryoSerializer</value> </property> <property> <name>spark.yarn.jars</name> <value>hdfs://hdfs:9000/yarn/spark-assembly-1.6.0-hadoop2.6.0.jar</value> </property> <property> <name>hive.enable.spark.execution.engine</name> <value>true</value> </property> <property> <name>spark.executor.extraJavaOptions</name> <value>-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"</value> </property>
以後重啓Hive;進入SPARK_HOME/sbin,執行./start-all.sh,能夠查看日誌,日誌中有Spark的UI界面地址。
命令行執行hive,進入hive CLI,執行命令"set hive.execution.engine;"能夠看到當前正在使用的計算引擎。在hive CLI中建立表、插入數據,沒有報錯,基本ok了。
默認狀況下,Hive中咱們執行update/delete語句,會報錯List-4.1中的錯誤,咱們要修改hive-site.xml文件,怎麼修改參考這篇。以後重啓hive,若是要對錶中的數據進行update/delete,那麼建的表是"clustered by xxxxx..."這種的,否則會報
List-4.1
Attempt to do update or delete using transaction manager that does not support these operations
List-4.2
Attempt to do update or delete on table default.test that does not use an AcidOutputFormat or is not bucketed
涉及的東西不少,處處搜。