注意:Hive on Spark對版本有着嚴格的要求,下面的版本是通過驗證的版本java
a) apache-hive-2.3.2-bin.tar.gznode
b) hadoop-2.7.2.tar.gzmysql
c) jdk-8u144-linux-x64.tar.gzlinux
d) mysql-5.7.19-1.el7.x86_64.rpm-bundle.tarsql
e) mysql-connector-java-5.1.43-bin.jarshell
f) spark-2.0.0.tgz(spark源碼包,須要從源碼編譯)數據庫
g) Redhat Linux 7.4 64位apache
yum remove mysql-libs分佈式
rpm -ivh mysql-community-common-5.7.19-1.el7.x86_64.rpmide
rpm -ivh mysql-community-libs-5.7.19-1.el7.x86_64.rpm
rpm -ivh mysql-community-client-5.7.19-1.el7.x86_64.rpm
rpm -ivh mysql-community-server-5.7.19-1.el7.x86_64.rpm
rpm -ivh mysql-community-devel-5.7.19-1.el7.x86_64.rpm (可選)
systemctl start mysqld.service
查看root用戶的密碼:cat /var/log/mysqld.log | grep password
登陸後修改密碼:alter user 'root'@'localhost' identified by 'Welcome_1';
grant all on hive.* TO 'hiveowner'@'%';
grant all on hive.* TO 'hiveowner'@'localhost' identified by 'Welcome_1';
因爲Hive on Spark默認支持Spark on Yarn的方式,因此須要配置Hadoop。
hadoop-env.sh |
|||||
JAVA_HOME |
/root/training/jdk1.8.0_144 |
|
|||
hdfs-site.xml |
|||||
dfs.replication |
1 |
數據塊的冗餘度,默認是3 |
|||
dfs.permissions |
false |
是否開啓HDFS的權限檢查 |
|||
core-site.xml |
|||||
fs.defaultFS |
hdfs://hive77:9000 |
NameNode的地址 |
|||
hadoop.tmp.dir |
/root/training/hadoop-2.7.2/tmp/ |
HDFS數據保存的目錄 |
|||
mapred-site.xml |
|||||
mapreduce.framework.name |
yarn |
|
|||
yarn-site.xml |
|||||
yarn.resourcemanager.hostname |
hive77 |
|
|||
yarn.nodemanager.aux-services |
mapreduce_shuffle |
|
|||
yarn.resourcemanager.scheduler.class |
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler |
Spark on Yarn的方式,須要使用公平調度原則來保證Yarn集羣中的任務都能獲取到相等的資源運行。 |
(須要使用Maven,Spark源碼包中自帶Maven)
a) 執行下面的語句進行編譯(執行時間很長,耐心等待)
./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
b) 編譯成功後,會生成:spark-2.0.0-bin-hadoop2-without-hive.tgz
c) 安裝和配置Spark
1.目錄結構以下:
2.將下面的配置加入spark-env.sh
export JAVA_HOME=/root/training/jdk1.8.0_144
export HADOOP_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop
export YARN_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop
export SPARK_MASTER_HOST=hive77
export SPARK_MASTER_PORT=7077
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_DRIVER_MEMORY=512m
export SPARK_WORKER_MEMORY=512m
3.將hadoop的相關jar包放入spark的lib目錄下,以下:
cp ~/training/hadoop-2.7.2/share/hadoop/common/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/common/lib/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/lib/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/lib/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/yarn/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/yarn/lib/*.jar jars/
4.在HDFS上建立目錄:spark-jars,並將spark的jars上傳至該目錄。這樣在運行Application的時候,就無需每次都分發這些jar包。
d) 啓動Spark:sbin/start-all.sh,驗證Spark是否配置成功
a) 解壓Hive安裝包,並把mysql的JDBC驅動放到HIve的lib目錄下,以下圖:
b) 設置Hive的環境變量
HIVE_HOME=/root/training/apache-hive-2.3.2-bin
export HIVE_HOME
PATH=$HIVE_HOME/bin:$PATH
export PATH
c) 拷貝下面spark的jar包到Hive的lib目錄
d) 在HDFS上建立目錄:/sparkeventlog用於保存log信息
hdfs dfs -mkdir /sparkeventlog
e) 配置hive-site.xml,以下:
參數 |
參考值 |
javax.jdo.option.ConnectionURL |
jdbc:mysql://localhost:3306/hive?useSSL=false |
javax.jdo.option.ConnectionDriverName |
com.mysql.jdbc.Driver |
javax.jdo.option.ConnectionUserName |
hiveowner |
javax.jdo.option.ConnectionPassword |
Welcome_1 |
hive.execution.engine |
spark |
hive.enable.spark.execution.engine |
true |
spark.home |
/root/training/spark-2.0.0-bin-hadoop2-without-hive |
spark.master |
yarn-client |
spark.eventLog.enabled |
true |
spark.eventLog.dir |
hdfs://hive77:9000/sparkeventlog |
spark.serializer |
org.apache.spark.serializer.KryoSerializer |
spark.executor.memeory |
512m |
spark.driver.memeory |
512m |
f) 初始化MySQL數據庫:schematool -dbType mysql -initSchema
g) 啓動hive shell,並建立員工表,用於保存員工數據
h) 導入emp.csv文件:
load data local inpath '/root/temp/emp.csv' into table emp1;
i) 執行查詢,按照員工薪水排序:(執行失敗)
select * from emp1 order by sal;
j) 檢查Yarn Web Console
該錯誤是因爲是Yarn的虛擬內存計算方式致使,可在yarn-site.xml文件中,將yarn.nodemanager.vmem-check-enabled設置爲false,禁用虛擬內存檢查。
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
k) 重啓:Hadoop、Spark、Hive,並執行查詢
最後說明一下:因爲配置好了Spark on Yarn,咱們在執行Hive的時候,能夠不用啓動Spark集羣,由於此時都有Yarn進行管理。