Spark中hive的使用(hive操做es示例)

時間 2019-11-06

標籤 spark hive 使用示例欄目 Spark 简体版

原文原文鏈接

1. 配置hive-site.xml

<property>java

<name>javax.jdo.option.ConnectionURL</name>node

<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>mysql

<description>JDBC connect string for a JDBC metastore</description>sql

</property>apache

<property>服務器

<name>javax.jdo.option.ConnectionDriverName</name>app

<value>com.mysql.jdbc.Driver</value>elasticsearch

<description>Driver class name for a JDBC metastore</description>maven

</property>ide

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hive</value>

<description>username to use against metastore database</description>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>hive</value>

<description>password to use against metastore database</description>

</property>

須要的話還能夠設置hdfs的存儲地址：

<name>hive.metastore.warehouse.dir</name>

<value>/user/hive/warehouse</value>

<description>location of default database for the warehouse</description>

</property>

不設置應該會選擇一個默認地址的。

1. 建立hive的mysql賬號

create user 'hive' identified by 'hive';

grant all privileges on *.* to 'hive' with grant option;

flush privileges;

create database hive;

update user set password =PASSWORD('newPassword') where='newUser';

1. 初始化mysql

schematool -initSchema -dbType mysql

1. 啓動hive服務

hive --service metastore

bin/hive --service hiveserver2 #默認10000端口

1. 啓動客戶端

$ hive

1. 映射hbase表

CREATE EXTERNAL TABLE hive_test(

key varchar(30),

name varchar(30),

age int

)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" = ":key,info:name,info:age"

)

TBLPROPERTIES("hbase.table.name" = "test");

將hbase中的表數據加載到本地表：

INSERT OVERWRITE TABLE hive_oss_user_label_action_data_local SELECT * FROM hive_oss_user_label_action_data;

1. hive讀取elasticsearch

本文將介紹如何經過Hive來讀取ElasticSearch中的數據，而後咱們能夠像操做其餘正常Hive表同樣，使用Hive來直接操做ElasticSearch中的數據，將極大的方便開發人員。本文使用的各組件版本分別爲 Hive0.12、Hadoop-2.2.0、ElasticSearch 2.3.4。

咱們先來看看ElasticSearch中相關表的mapping：

{

"user": {

"properties": {

"regtime": {

"index": "not_analyzed",

"type": "string"

"uid": {

"type": "integer"

"mobile": {

"index": "not_analyzed",

"type": "string"

"username": {

"index": "not_analyzed",

"type": "string"

}

ElasticSearch中的index名爲iteblog，type爲user；user有regtime、uid、mobile以及username四個屬性。如今咱們在Hive端進行操做。

要讓Hive可以操做ElasticSearch中的數據咱們須要對Hive進行一些設置。值得高興的是，ElasticSearch官方爲咱們提供了一些類庫能夠實現這些要求。咱們須要引入相應的elasticsearch-hadoop-xxx.jar包，由於咱們得ElasticSearch版本是2.x的，因此咱們最少須要使用ES-Hadoop 2.2.x，本文使用的是elasticsearch-hadoop-2.3.4.jar，這個能夠到Maven中央倉庫下載。要讓Hive可以加載elasticsearch-hadoop-2.3.4.jar文件有好幾種方式：

1、直接經過add命令加載，以下：

hive > ADD JAR /home/iteblog/elasticsearch-hadoop-2.3.4.jar;

Added [/home/iteblog/elasticsearch-hadoop-2.3.4.jar] to class path

Added resources: [/home/iteblog/elasticsearch-hadoop-2.3.4.jar]

2、咱們還能夠在啓動Hive的時候進行設置，以下：

$ bin/hive --auxpath=/home/iteblog/elasticsearch-hadoop-2.3.4.jar

3、咱們還能夠經過設置hive.aux.jars.path屬性來實現：

$ bin/hive -hiveconf hive.aux.jars.path=/home/iteblog/elasticsearch-hadoop-2.3.4.jar

或者咱們把這個設置直接寫到hive-site.xml中，以便後面方便：

<value>/home/iteblog/elasticsearch-hadoop-2.3.4.jar</value>

<description>A comma separated list (with no spaces) of the jar files</description>

</property>

你們能夠根據本身實際狀況選擇設置。設置好ElasticSearch相關類庫以後，咱們就能夠到Hive中建立表了。爲了方便，咱們直接將Hive中各個字段以及類型設置成和ElasticSearch中同樣：

hive (iteblog)> create EXTERNAL table `user`(

> regtime string,

> uid int,

> mobile string,

> username string

> )

> STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

> TBLPROPERTIES('es.resource' = 'iteblog/user', 'es.nodes'='www.iteblog.com', 'es.port'='9200', 'es.nodes.wan.only'='true');

到這裏，咱們已經已經能夠在Hive裏面查詢ElasticSearch中的數據了：

hive (iteblog)> select * from `user` limit 10;

2016-10-24 13:08:16 1 13112121212 Tom

2016-10-24 14:08:16 2 13112121212 Join

2016-10-25 14:23:16 3 13112121212 iteblog

2016-10-25 13:08:16 4 NULL weixin

2016-10-25 19:08:16 5 13112121212 bbs

2016-10-25 13:14:04 6 NULL zhangshan

2016-10-25 13:08:16 7 13112121212 wangwu

2016-10-25 14:56:16 8 13112121212 Joan

2016-10-25 15:25:16 9 13112121212 White

2016-10-25 17:24:16 0 NULL lihhh

Time taken: 0.072 seconds, Fetched: 10 row(s)

如上所述，咱們已經成功經過Hive查詢到ElasticSearch中的數據了。若是你在經過Hive查詢ElasticSearch中的數據遇到以下異常：

Failed with exception java.io.IOException:org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

這個極可能是由於你配置錯了 es.nodes 或者 es.port 屬性了。

在上面的例子中，咱們爲了方便將Hive中的字段設置成和ElasticSearch中同樣；但實際狀況下，咱們可能沒法將Hive中的字段和ElasticSearch保持一致，這時候咱們須要在建立Hive表的時候作一些設置，不然將會出現錯誤。咱們能夠經過 es.mapping.names 參數實現，以下：

hive (iteblog)> create EXTERNAL table `user`(

> register_time string,

> user_id int,

> mobile string,

> username string

> )

> STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

> TBLPROPERTIES('es.resource' = 'iteblog/user', 'es.nodes'='www.iteblog.com', 'es.port'='9200', 'es.nodes.wan.only'='true','es.mapping.names'='register_time:regtime,user_id:uid');

而後咱們就能夠將Hive中的 register_time 映射到ElasticSearch中的 regtime 字段； user_id 映射到ElasticSearch中的 uid 字段。

在建立Hive表的時候，咱們還能夠經過制定 es.query 來限制須要查詢的數據，以下：

hive (iteblog)> create EXTERNAL table `user`(

> regtime string,

> uid int,

> mobile string,

> username string

> )

> STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

> TBLPROPERTIES('es.resource' = 'iteblog/user', 'es.nodes'='www.iteblog.com', 'es.port'='9200', 'es.nodes.wan.only'='true','es.query' = '?q=uid:2');

上面的查詢僅返回uid爲2的數據，而後咱們能夠看效果：

hive (iteblog)> select * from `user` limit 10;