Hive是基於Hadoop的一個數據倉庫工具,能夠將結構化的數據文件映射爲一張數據庫表,並提供完整的sql查詢功能,能夠將sql語句轉換爲 MapReduce任務進行運行。 其優勢是學習成本低,能夠經過類SQL語句快速實現簡單的MapReduce統計,沒必要開發專門的MapReduce應用,十分適合數據倉庫的統計分析。
Hive與HBase的整合功能的實現是利用二者自己對外的API接口互相進行通訊,相互通訊主要是依靠hive_hbase-handler.jar工具類, 大體意思如圖所示: java
Mapr框架安裝完後,安裝與配置hbase、hive。
其中mapr框架的安裝路徑爲/opt/mapr
Hbase的安裝路徑爲/opt/mapr/hbase/hbase-0.90.4
Hive的安裝路徑爲/opt/mapr/hive/hive-0.7.1
整合hive與hbase的過程以下:
1. 將文件 /opt/mapr/hbase/hbase-0.90.4/hbase-0.90.4.jar 與/opt/mapr/hbase/hbase-0.90.4/lib/zookeeper-3.3.2.jar拷貝到/opt/mapr/hive /hive-0.7.1/lib文件夾下面
注意:若是hive/lib下已經存在這兩個文件的其餘版本(例如zookeeper-3.3.1.jar),建議刪除後使用hbase下的相關版本
2 修改hive/conf下hive-site.xml文件,在底部添加以下內容:
<property>
<name>hive.querylog.location</name>
<value>/opt/mapr/hive/hive-0.7.1/logs</value>
</property>
<property>
<name>hive.aux.jars.path</name> <value>file:///opt/mapr/hive/hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,file:///opt/mapr/hive/hive-0.7.1/lib/hbase-0.90.4.jar,file:///opt/mapr/hive/hive-0.7.1/lib/zookeeper-3.3.2.jar</value>
</property>
注意:若是hive-site.xml不存在則自行建立,或者把hive-default.xml.template文件更名後使用。
3. 拷貝hbase-0.90.4.jar到全部hadoop節點(包括master)的hadoop/lib下。
4. 拷貝hbase/conf下的hbase-site.xml文件到全部hadoop節點(包括master)的hadoop/conf下。 node
注意,若是3,4兩步跳過的話,運行hive時極可能出現以下錯誤:
org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately.
This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and
then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information. at org.apache.hadoop.
hbase.zookeeper.ZooKeeperWatcher. sql
5 啓動hive
單節點啓動
bin/hive -hiveconf hbase.master=master:60000
集羣啓動
bin/hive -hiveconf hbase.zookeeper.quorum=node1,node2,node3 (全部的zookeeper節點)
若是hive-site.xml文件中沒有配置hive.aux.jars.path,則能夠按照以下方式啓動。
hive --auxpath /opt/mapr/hive/hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,/opt/mapr/hive/hive-0.7.1/lib/hbase-0.90.4.jar,/opt/mapr/hive/hive-0.7.1/lib/zookeeper-3.3.2.jar -hiveconf hbase.master=localhost:60000 shell
經測試修改hive的配置文件hive-site.xml 數據庫
<property>
<name>hive.zookeeper.quorum</name>
<value>node1,node2,node3</value>
<description>The list of zookeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
apache
不用增長參數啓動hive就能夠聯合hbase app
6 啓動後進行測試
(1) 建立hbase識別的表
CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES ("hbase.table.name" = "xyz");
hbase.table.name 定義在hbase的table名稱,多列時,data:1,data:2;多列族時,data1:1,data2:1;
hbase.columns.mapping 定義在hbase的列族,裏面的:key 是固定值並且要保證在表pokes中的foo字段是惟一值 框架
建立有分區的表 jsp
CREATE TABLE hbase_table_1(key int, value string) partitioned by (day string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES ("hbase.table.name" = "xyz"); ide
不支持表的修改
會提示不能修改非本地表。
hive> ALTER TABLE hbase_table_1 ADD PARTITION (day = '2012-09-22');
FAILED: Error in metadata: Cannot use ALTER TABLE on a non-native table FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
(2) 使用sql導入數據
新建hive的數據表
create table pokes(foo int,bar string)row format delimited fields terminated by ',';
批量導入數據
load data local inpath '/home/1.txt' overwrite into table pokes;
1.txt文件的內容爲
1,hello
2,pear
3,world
使用sql導入hbase_table_1
SET hive.hbase.bulk=true;
導入有分區的表
insert overwrite table hbase_table_1 partition (day='2012-01-01') select * from pokes;
(3) 查看數據
hive> select * from hbase_table_1;
OK
1 hello
2 pear
3 world
(注:與hbase整合的有分區的表存在個問題 select * from table查詢不到數據,select key,value from table能夠查到數據)
(4)登陸Hbase去查看數據
hbase shell
hbase(main):002:0> describe 'xyz'
DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE', REPLICATION_S true
COPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSI
ZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0830 seconds
hbase(main):003:0> scan 'xyz'
ROW COLUMN+CELL
1 column=cf1:val, timestamp=1331002501432, value=hello
2 column=cf1:val, timestamp=1331002501432, value=pear
3 column=cf1:val, timestamp=1331002501432, value=world
這時在Hbase中能夠看到剛纔在hive中插入的數據了。
7 對於在hbase已經存在的表,在hive中使用CREATE EXTERNAL TABLE來創建
例如hbase中的表名稱爲test1,字段爲 a: , b: ,c: 在hive中建表語句爲
create external table hive_test (key int,gid map<string,string>,sid map<string,string>,uid map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" ="a:,b:,c:") TBLPROPERTIES ("hbase.table.name" = "test1");
在hive中創建好表後,查詢hbase中test1表內容
Select * from hive_test;
OK
1 {"":"qqq"} {"":"aaa"} {"":"bbb"}
2 {"":"qqq"} {} {"":"bbb"}
查詢gid字段中value值的方法爲
select gid[''] from hbase2;
獲得查詢結果
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201203052222_0017, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201203052222_0017
Kill Command = /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=maprfs:/// -kill job_201203052222_0017
2012-03-06 14:38:29,141 Stage-1 map = 0%, reduce = 0%
2012-03-06 14:38:33,171 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201203052222_0017
OK
qqq
qqq
若是hbase表test1中的字段爲user:gid,user:sid,info:uid,info:level,在hive中建表語句爲
create external table hive_test(key int,user map<string,string>,info map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" ="user:,info:") TBLPROPERTIES ("hbase.table.name" = "test1");
查詢hbase表的方法爲
select user['gid'] from hbase2;
注:hive鏈接hbase優化,將HADOOP_HOME/conf中的hbase-site.xml文件中增長配置
<property>
<name>hbase.client.scanner.caching</name>
<value>10000</value>
</property>
或者在執行hive語句以前執行hive>set hbase.client.scanner.caching=10000;