Hive on Spark

1、版本以下

注意:Hive on Spark對版本有着嚴格的要求,下面的版本是通過驗證的版本java

a) apache-hive-2.3.2-bin.tar.gznode

b) hadoop-2.7.2.tar.gzmysql

c) jdk-8u144-linux-x64.tar.gzlinux

d) mysql-5.7.19-1.el7.x86_64.rpm-bundle.tarsql

e) mysql-connector-java-5.1.43-bin.jarshell

f) spark-2.0.0.tgzspark源碼包,須要從源碼編譯)數據庫

g) Redhat Linux 7.4 64apache

2、安裝LinuxJDK、關閉防火牆

 

3、安裝和配置MySQL數據庫

a) 解壓MySQL 安裝包

b) 安裝MySQL

yum remove mysql-libs分佈式

rpm -ivh mysql-community-common-5.7.19-1.el7.x86_64.rpmide

rpm -ivh mysql-community-libs-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-client-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-server-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-devel-5.7.19-1.el7.x86_64.rpm  (可選)

 

c) 啓動MySQL

systemctl start mysqld.service

 

d) 查看並修改root用戶的密碼

查看root用戶的密碼:cat /var/log/mysqld.log | grep password

登陸後修改密碼:alter user 'root'@'localhost' identified by 'Welcome_1';
 

e) 建立hive的數據庫和hiveowner用戶:

  • 建立一個新的數據庫:create database hive;
  • 建立一個新的用戶:
    create user 'hiveowner'@'%' identified by ‘Welcome_1’;
  • 給該用戶受權

    grant all on hive.* TO 'hiveowner'@'%';

    grant all on hive.* TO 'hiveowner'@'localhost' identified by 'Welcome_1';

4、安裝Hadoop(以僞分佈式爲例)

因爲Hive on Spark默認支持Spark on Yarn的方式,因此須要配置Hadoop

a) 準備工做:

  1. 配置主機名(編輯/etc/hosts文件)
  2. 配置免密碼登陸

b) Hadoop的配置文件以下:

hadoop-env.sh

JAVA_HOME

/root/training/jdk1.8.0_144

 

hdfs-site.xml

dfs.replication

1

數據塊的冗餘度,默認是3

dfs.permissions

false

是否開啓HDFS的權限檢查

core-site.xml

fs.defaultFS

hdfs://hive77:9000

NameNode的地址

hadoop.tmp.dir

/root/training/hadoop-2.7.2/tmp/

HDFS數據保存的目錄

mapred-site.xml

mapreduce.framework.name

yarn

 

yarn-site.xml

yarn.resourcemanager.hostname

hive77

 

yarn.nodemanager.aux-services

mapreduce_shuffle

 

yarn.resourcemanager.scheduler.class

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

Spark on Yarn的方式,須要使用公平調度原則來保證Yarn集羣中的任務都能獲取到相等的資源運行。

 

c)  啓動Hadoop

d) 經過Yarn Web Console檢查是否爲公平調度原則

5、編譯Spark源碼

(須要使用MavenSpark源碼包中自帶Maven

a) 執行下面的語句進行編譯(執行時間很長,耐心等待

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

 

b) 編譯成功後,會生成:spark-2.0.0-bin-hadoop2-without-hive.tgz

c) 安裝和配置Spark

  1.目錄結構以下:

  2.將下面的配置加入spark-env.sh

export JAVA_HOME=/root/training/jdk1.8.0_144

export HADOOP_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop

export YARN_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop

export SPARK_MASTER_HOST=hive77

export SPARK_MASTER_PORT=7077

export SPARK_EXECUTOR_MEMORY=512m

export SPARK_DRIVER_MEMORY=512m

export SPARK_WORKER_MEMORY=512m

  3.將hadoop的相關jar包放入sparklib目錄下,以下:

cp ~/training/hadoop-2.7.2/share/hadoop/common/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/common/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/yarn/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/yarn/lib/*.jar jars/

  4.在HDFS上建立目錄:spark-jars,並將sparkjars上傳至該目錄。這樣在運行Application的時候,就無需每次都分發這些jar包。

  • hdfs dfs -mkdir /spark-jars
  • hdfs dfs -put jars/*.jar /spark-jars

d) 啓動Sparksbin/start-all.sh,驗證Spark是否配置成功
 

6、安裝配置Hive

a) 解壓Hive安裝包,並把mysqlJDBC驅動放到HIvelib目錄下,以下圖:

 

b) 設置Hive的環境變量

HIVE_HOME=/root/training/apache-hive-2.3.2-bin

export HIVE_HOME

PATH=$HIVE_HOME/bin:$PATH

export PATH

c) 拷貝下面sparkjar包到Hivelib目錄

  1. scala-library
  2. spark-core
  3. spark-network-common

d) HDFS上建立目錄:/sparkeventlog用於保存log信息

  hdfs dfs -mkdir /sparkeventlog

 

e) 配置hive-site.xml,以下:

參數

參考值

javax.jdo.option.ConnectionURL

jdbc:mysql://localhost:3306/hive?useSSL=false

javax.jdo.option.ConnectionDriverName

com.mysql.jdbc.Driver

javax.jdo.option.ConnectionUserName

hiveowner

javax.jdo.option.ConnectionPassword

Welcome_1

hive.execution.engine

spark

hive.enable.spark.execution.engine

true

spark.home

/root/training/spark-2.0.0-bin-hadoop2-without-hive

spark.master

yarn-client

spark.eventLog.enabled

true

spark.eventLog.dir

hdfs://hive77:9000/sparkeventlog

spark.serializer

org.apache.spark.serializer.KryoSerializer

spark.executor.memeory

512m

spark.driver.memeory

512m

 

f) 初始化MySQL數據庫:schematool -dbType mysql -initSchema

g) 啓動hive shell,並建立員工表,用於保存員工數據
 

h) 導入emp.csv文件:

  load data local inpath '/root/temp/emp.csv' into table emp1;

 

i) 執行查詢,按照員工薪水排序:(執行失敗

  select * from emp1 order by sal;

j) 檢查Yarn Web Console

該錯誤是因爲是Yarn的虛擬內存計算方式致使,可在yarn-site.xml文件中,將yarn.nodemanager.vmem-check-enabled設置爲false,禁用虛擬內存檢查。

<property>

   <name>yarn.nodemanager.vmem-check-enabled</name>

   <value>false</value>

</property>

 

 

k) 重啓:HadoopSparkHive,並執行查詢

 

 

最後說明一下:因爲配置好了Spark on Yarn,咱們在執行Hive的時候,能夠不用啓動Spark集羣,由於此時都有Yarn進行管理。

相關文章
相關標籤/搜索