Hbase 學習（十一）使用hive往hbase當中導入數據

時間 2019-11-21

標籤 hbase 學習十一使用 hive 當中導入數據欄目 Hadoop 简体版

原文原文鏈接

　　咱們能夠有不少方式能夠把數據導入到hbase當中，好比說用map-reduce，使用TableOutputFormat這個類，可是這種方式不是最優的方式。apache

　　Bulk的方式直接生成HFiles，寫入到文件系統當中，這種方式的效率很高。ide

　　通常的步驟有兩步工具

　　（1）使用ImportTsv或者import工具或者本身寫程序用hive/pig生成HFilesoop

　　（2）用completebulkload把HFiles加載到hdfs上ui

　　ImportTsv能把用Tab分隔的數據很方便的導入到hbase當中，但還有不少數據不是用Tab分隔的下面咱們介紹如何使用hive來導入數據到hbase當中。this

　　1.準備輸入內容
　　a.建立一個tables.ddl文件spa

-- pagecounts data comes from http://dumps.wikimedia.org/other/
pagecounts-raw/
-- documented http://www.mediawiki.org/wiki/Analytics/Wikistats
-- define an external table over raw pagecounts data
CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename
STRING, pageviews STRING, bytes STRING)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/wikistats';
-- create a view, building a custom hbase rowkey
CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS
SELECT concat_ws('/',
projectcode,
concat_ws('/',
pagename,
regexp_extract(INPUT__FILE__NAME, 'pagecounts-(\\d{8}-\\d{6})\
\..*$', 1))),
pageviews, bytes
FROM pagecounts;
-- create a table to hold the input split partitions
CREATE EXTERNAL TABLE IF NOT EXISTS hbase_splits(partition STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.binarysortable.
BinarySortableSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.
HiveNullValueSequenceFileOutputFormat'
LOCATION '/tmp/hbase_splits_out';
-- create a location to store the resulting HFiles
CREATE TABLE hbase_hfiles(rowkey STRING, pageviews STRING, bytes STRING)
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat'
TBLPROPERTIES('hfile.family.path' = '/tmp/hbase_hfiles/w');

View Code

　　b.建立HFils分隔文件，例子：sample.hqlcode

-- prepate range partitioning of hfiles
ADD JAR /usr/lib/hive/lib/hive-contrib-0.11.0.1.3.0.0-104.jar;
SET mapred.reduce.tasks=1;
CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.
UDFRowSequence';
-- input file contains ~4mm records. Sample it so as to produce 5 input
splits.
INSERT OVERWRITE TABLE hbase_splits
SELECT rowkey FROM
(SELECT rowkey, row_seq() AS seq FROM pgc
TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rowkey) s
ORDER BY rowkey
LIMIT 400) x
WHERE (seq % 100) = 0
ORDER BY rowkey
LIMIT 4;
-- after this is finished, combined the splits file:
dfs -cp /tmp/hbase_splits_out/* /tmp/hbase_splits;

View Code

　　c.建立hfiles.hqlregexp

ADD JAR /usr/lib/hbase/hbase-0.94.6.1.3.0.0-104-security.jar;
ADD JAR /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-104.jar;
SET mapred.reduce.tasks=5;
SET hive.mapred.partitioner=org.apache.hadoop.mapred.lib.
TotalOrderPartitioner;
SET total.order.partitioner.path=/tmp/hbase_splits;
-- generate hfiles using the splits ranges
INSERT OVERWRITE TABLE hbase_hfiles
SELECT * FROM pgc
CLUSTER BY rowkey;

View Code

　　2.導入數據orm

　　注意：/$Path_to_Input_Files_on_Hive_Client是hive客戶端的數據存儲目錄

mkdir /$Path_to_Input_Files_on_Hive_Client/wikistats
wget http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/
pagecounts-20081001-000000.gz 
hadoop fs -mkdir /$Path_to_Input_Files_on_Hive_Client/wikistats
hadoop fs -put pagecounts-20081001-000000.
gz /$Path_to_Input_Files_on_Hive_Client/wikistats/

　　3.建立必要的表

　　注意：$HCATALOG_USER是HCatalog服務的用戶(默認是hcat)

$HCATALOG_USER-f /$Path_to_Input_Files_on_Hive_Client/tables.ddl

　　執行以後，咱們會看到以下的提示：

OK
Time taken: 1.886 seconds
OK
Time taken: 0.654 seconds
OK
Time taken: 0.047 seconds
OK
Time taken: 0.115 seconds

　　4.確認表已經正確建立

　　執行如下語句

$HIVE_USER-e "select * from pagecounts limit 10;"

　　執行以後，咱們會看到以下的提示：

...
OK
aa Main_Page 4 41431
aa Special:ListUsers 1 5555
aa Special:Listusers 1 1052

　　再執行

$HIVE_USER-e "select * from pgc limit 10;"

　　執行以後，咱們會看到以下的提示：

...
OK
aa/Main_Page/20081001-000000 4 41431
aa/Special:ListUsers/20081001-000000 1 5555
aa/Special:Listusers/20081001-000000 1 1052
...

　　5.生成HFiles分隔文件

$HIVE_USER-f /$Path_to_Input_Files_on_Hive_Client/sample.hql
hadoop fs -ls /$Path_to_Input_Files_on_Hive_Client/hbase_splits

　　爲了確認，執行如下命令

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.0.1.
3.0.0-104.jar -libjars /usr/lib/hive/lib/hive-exec-0.11.0.1.3.0.0-104.
jar -input /tmp/hbase_splits -output /tmp/hbase_splits_txt -inputformat
SequenceFileAsTextInputFormat

　　執行以後，咱們會看到以下的提示：

...
INFO streaming.StreamJob: Output: /tmp/hbase_splits_txt

　　再執行這一句

hadoop fs -cat /tmp/hbase_splits_txt/*

　　執行以後，咱們會看到相似這樣的結果

1 61 66 2e 71 2f 4d 61 69 6e 5f 50 61 67 65 2f 32 30 30 38 31 30 30 31 2d 30
30 30 30 30 30 00 (null)
01 61 66 2f 31 35 35 30 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00 
(null)
01 61 66 2f 32 38 5f 4d 61 61 72 74 2f 32 30 30 38 31 30 30 31 2d 30 30 30
30 30 30 00 (null)
01 61 66 2f 42 65 65 6c 64 3a 31 30 30 5f 31 38 33 30 2e 4a 50 47 2f 32 30
30 38 31 30 30 31 2d 30 30 30 30 30 30 00 (null)

　　7.生成HFiles

HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.94.6.1.3.0.0-104-security.jar hive -f /$Path_to_Input_Files_on_Hive_Client/hfiles.hql

　　以上內容是hdp的用戶手冊中推薦的方式，而後我順便也從網上把最後的一步的命令格式給找出來了

hadoop jar hbase-VERSION.jar completebulkload /user/todd/myoutput mytable

相關標籤/搜索

hbase+hive

hbase

hadoop+hbase+hive

hadoop+hive+hbase+spark

hdfs+mapreduce+hbase+hive

hadoop+hive+spark+hbase

hadoo+hive+mongodb+hbase

hdfs&mapreduce&hbase&hive

hdfs+mapreduce+hive+hbase

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。