目的:在spark on yarn模式下,執行spark-sql訪問hive的元數據。並對比一下spark-sql 和hive的效率。
軟件環境:html
- hadoop2.7.3
- apache-hive-2.1.1-bin
- spark-2.1.0-bin-hadoop2.7
- jd1.8
hadoop是僞分佈式安裝的,1個節點,2core,4G內存。
hive是遠程模式。java
spark的下載地址:
http://spark.apache.org/downloads.html
解壓安裝spark
tar -zxvf spark-2.1.0-bin-hadoop2.7.tgz.tar
cd spark-2.1.0-bin-hadoop2.7/conf
cp spark-env.sh.template spark-env.sh
cp slaves.template slaves
cp log4j.properties.template log4j.properties
cp spark-defaults.conf.template spark-defaults.confmysql
修改spark的配置文件
cd $SPARK_HOME/conf
vi spark-env.sh
export JAVA_HOME=/usr/local/jdk export HADOOP_HOME=/home/fuxin.zhao/soft/hadoop-2.7.3 export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
sql
vi spark-defaults.conf
spark.master spark://ubuntuServer01:7077 spark.eventLog.enabled true spark.eventLog.dir hdfs://ubuntuServer01:9000/tmp/spark spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 512m spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" #spark.yarn.jars hdfs://ubuntuServer01:9000/tmp/spark/lib_jars/*.jar
數據庫
vi slaves
ubuntuServer01
apache
** 配置spark-sql讀取hive的元數據**
##將hive-site.xml 軟鏈接到spark的conf配置目錄中: cd $SPARK_HOME/conf ln -s /home/fuxin.zhao/soft/apache-hive-2.1.1-bin/conf/hive-site.xml hive-site.xml ##將鏈接 mysql-connector-java-5.1.35-bin.jar拷貝到spark的jars目錄下 cp $HIVE_HOME/lib/mysql-connector-java-5.1.35-bin.jar $SPARK_HOME/jars
ubuntu
測試spark-sql:
先使用hive建立幾個數據庫和數據表,測試spark-sql是否能夠訪問
我向 temp.s4_order表導入了6萬行,9M大小的數據。
``#先使用hive建立一下數據庫和數據表,測試spark-sql是否能夠訪問 hive -e " create database temp; create database test; use temp; CREATE EXTERNAL TABLE t_source(
sidstring,
uid` string
);ssh
load data local inpath '/home/fuxin.zhao/t_data' into table t_source;
CREATE EXTERNAL TABLE s4_order(
orderid
int ,
retailercode
string ,
orderstatus
int,
paystatus
int,
payid
string,
paytime
timestamp,
payendtime
timestamp,
salesamount
int,
description
string,
usertoken
string,
username
string,
mobile
string,
createtime
timestamp,
refundstatus
int,
subordercount
int,
subordersuccesscount
int,
subordercreatesuccesscount
int,
businesstype
int,
deductedamount
int,
refundorderstatus
int,
platform
string,
subplatform
string,
refundnumber
string,
refundpaytime
timestamp,
refundordertime
timestamp,
primarysubordercount
int,
primarysubordersuccesscount
int,
suborderprocesscount
int,
isshoworder
int,
updateshowordertime
timestamp,
devicetoken
string,
lastmodifytime
timestamp,
refundreasontype
int )
PARTITIONED BY (
dt
string);
load data local inpath '/home/fuxin.zhao/20170214003514' OVERWRITE into table s4_order partition(dt='2017-02-13');
load data local inpath '/home/fuxin.zhao/20170215000514' OVERWRITE into table s4_order partition(dt='2017-02-14');
"
```分佈式
輸入spark-sql命令,在終端中執行以下一些sql命令:
啓動spark-sql客戶端:
spark-sql --master yarn
在啓動的命令行中執行以下sql:oop
show database; use temp; show tables; select * from s4_order limit 100; select count(*) ,dt from s4_order group dt; select count(*) from s4_order ; insert overwrite table t_source select orderid,createtime from s4_order;
select count() ,dt from s4_order group dt; // spark-sql耗時 11s; hive執行耗時30秒
select count() from s4_order ; // spark-sql耗時2s;hive執行耗時25秒。
直觀的感覺是spark-sql 的效率大概是hive的 3到10倍,因爲個人測試是本地的虛擬機單機環境,hadoop也是僞分佈式環境,資源較匱乏,在生產環境中隨着集羣規模,數據量,執行邏輯的變化,執行效率應該不是這個比例。