CDH使用Solr實現HBase二級索引

時間 2019-11-11

標籤 cdh 使用 solr 實現 hbase 二級索引欄目 Hadoop 简体版

原文原文鏈接

1、爲何要使用Solr作二級索引
2、實時查詢方案
3、部署流程
3.1 安裝HBase、Solr
3.2 增長HBase複製功能
3.3建立相應的 SolrCloud 集合
3.4 建立 Lily HBase Indexer 配置
3.5建立 Morphline 配置文件
3.6 註冊 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service
3.7 同步數據
3.8批量同步索引
3.9 設置多個indexer
4、數據的增刪改查
4.1 增長
4.2更新
4.3刪除
4.4 總結
5、擴展命令
6、F&Q
6.1建立indexer失敗，原來indexer已經存在
6.2建立indexer失敗
6.3使用自帶的indexer工具批量同步索引失敗,提示找不到morphlines.conf
6.4使用自帶的indexer工具批量同步索引失敗,提示找不到solrconfig.xml
6.5使用自帶的indexer工具批量同步索引失敗,提示找不到Java heap space
6.6 HBaseIndexer啓動後一下子就自動退出
6.7 HBaseIndexer同步的數據與Solr不一致
6.8 出現了6.7的問題以後，修改了read-row="never"後，丟失部分字段

1、爲何要使用Solr作二級索引

在Hbase中,表的RowKey 按照字典排序, Region按照RowKey設置split point進行shard，經過這種方式實現的全局、分佈式索引. 成爲了其成功的最大的砝碼。html

然而單一的經過RowKey檢索數據的方式,再也不知足更多的需求，查詢成爲Hbase的瓶頸，人們更加但願像Sql同樣快速檢索數據，但是，Hbase以前定位的是大表的存儲，要進行這樣的查詢，每每是要經過相似Hive、Pig等系統進行全表的MapReduce計算，這種方式既浪費了機器的計算資源，又因高延遲使得應用黯然失色。因而，針對HBase Secondary Indexing的方案出現了。前端

Solrjava

Solr是一個獨立的企業級搜索應用服務器，是Apache Lucene項目的開源企業搜索平臺,git

其主要功能包括全文檢索、命中標示、分面搜索、動態聚類、數據庫集成，以及富文本（如Word、PDF）的處理。Solr是高度可擴展的，並提供了分佈式搜索和索引複製。Solr 4還增長了NoSQL支持，以及基於Zookeeper的分佈式擴展功能SolrCloud。SolrCloud的說明能夠參看：SolrCloud分佈式部署。它的主要特性包括：高效、靈活的緩存功能，垂直搜索功能，Solr是一個高性能，採用Java5開發，基於Lucene的全文搜索服務器。同時對其進行了擴展，提供了比Lucene更爲豐富的查詢語言，同時實現了可配置、可擴展並對查詢性能進行了優化，而且提供了一個完善的功能管理界面，是一款很是優秀的全文搜索引擎。github

Solr能夠高亮顯示搜索結果，經過索引複製來提升可用，性，提供一套強大Data Schema來定義字段，類型和設置文本分析，提供基於Web的管理界面等。shell

Key-Value Store Indexer數據庫

這個組件很是關鍵，是Hbase到Solr生成索引的中間工具。緩存

在CDH5.3.2中的Key-Value Indexer使用的是Lily HBase NRT Indexer服務.服務器

Lily HBase Indexer是一款靈活的、可擴展的、高容錯的、事務性的，而且近實時的處理HBase列索引數據的分佈式服務軟件。它是NGDATA公司開發的Lily系統的一部分，已開放源代碼。Lily HBase Indexer使用SolrCloud來存儲HBase的索引數據，當HBase執行寫入、更新或刪除操做時，Indexer經過HBase的replication功能來把這些操做抽象成一系列的Event事件，並用來保證寫入Solr中的HBase索引數據的一致性。而且Indexer支持用戶自定義的抽取，轉換規則來索引HBase列數據。Solr搜索結果會包含用戶自定義的columnfamily:qualifier字段結果，這樣應用程序就能夠直接訪問HBase的列數據。並且Indexer索引和搜索不會影響HBase運行的穩定性和HBase數據寫入的吞吐量，由於索引和搜索過程是徹底分開而且異步的。Lily HBase Indexer在CDH5中運行必須依賴HBase、SolrCloud和Zookeeper服務。app

2、實時查詢方案

Hbase —–> Key Value Store —> Solr ——-> Web前端實時查詢展現

1.Hbase 提供海量數據存儲

2.Solr提供索引構建與查詢

3. Key Value Store 提供自動化索引構建(從Hbase到Solr)

3、部署流程

3.1 安裝HBase、Solr

HBase的實例

Key-Value Store Indexer的實例（目錄在/opt/cloudera/parcels/CDH/lib/hbase-solr）

Solr的實例

3.2 增長HBase複製功能

默認安裝了Key-Value Store Indexer以後就會打開HBase的複製功能

接下來就是對HBase得表進行改造了
對於初次創建得表，可使用

create 'table',{NAME =>'cf', REPLICATION_SCOPE =>1}
#其中1表示開啓replication功能，0表示不開啓，默認爲0

對於已經存在得表，能夠

disable 'table'
alter 'table',{NAME =>'cf', REPLICATION_SCOPE =>1}
enable 'table'

這裏，爲了測試，我新建一張表，名字叫作

create 'HBase_Indexer_Test',{NAME => 'cf1', REPLICATION_SCOPE => 1}
並插入兩條數據

put 'HBase_Indexer_Test','001','cf1:name','xiaoming'
put 'HBase_Indexer_Test','002','cf1:name','xiaohua'

3.3建立相應的 SolrCloud 集合

接下來在安裝有Solr的機器上運行
這裏得路徑和用戶名均可以本身定義

# 生成實體配置文件：
solrctl instancedir --generate $HOME/hbase-indexer/bqjr

此時會在home下生成hbase-indexer/bqjr文件夾，裏面包含一個conf文件夾，咱們修改下面得schema.xml文件.
咱們新建一個filed字段

<fieldname="HBase_Indexer_Test_cf1_name"type="string"indexed="true"stored="true"/>

這裏重點解釋一下name字段，它對應了咱們後續須要修改Morphline.conf文件中的outputField屬性。所以能夠當作是hbase中須要建立索引的值。所以咱們建議將其與表名和列族結合。其對應關係以下

HBase	Solr
name	HBase_Indexer_Test_cf1_name

再修改solrconfig.xml文件，將硬提交打開（會影響部分性能）

# 建立 collection實例並將配置文件上傳到 zookeeper：
solrctl instancedir --create bqjr $HOME/hbase-indexer/bqjr
# 上傳到 zookeeper 以後，其餘節點就能夠從zookeeper下載配置文件。接下來建立 collection:
solrctl collection --create bqjr

若是但願將數據分散到各個節點進行存儲和檢索，則須要建立多個shard，須要使用以下命令

solrctl collection --create bqjr -s 7-r 3-m 21

其中-s表示設置Shard數爲7，-r表示設置的replica數爲3,-m表示最大shards數目(7*3)

3.4 建立 Lily HBase Indexer 配置

在前面定義的$HOME/hbase-indexer/bqjr目錄下，建立一個morphline-hbase-mapper.xml文件,內容以下：

<?xml version="1.0"?>


<indexertable="HBase_Indexer_Test"mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"read-row="never">


<paramname="morphlineFile"value="morphlines.conf"/>

<paramname="morphlineId"value="bqjrMap"/>
</indexer>

其中：
** indexer table="HBase_Indexer_Test"得table對應HBase的表HBase_Indexer_Test**
**對應了Morphlines.conf 中morphlines 屬性id值**
read-row="never"詳見 6.7 HBaseIndexer同步的數據與Solr不一致

3.5建立 Morphline 配置文件

經過CM頁面進入到Key-Value Store Indexer的配置頁面，裏面有一個Morphlines文件。咱們編輯它
每一個Collection對應一個morphline-hbase-mapper.xml

SOLR_LOCATOR :{
# Name of solr collection
collection : bqjr
# ZooKeeper ensemble
zkHost :"$ZK_HOST"
}
#注意SOLR_LOCATOR只能設置單個collection，若是咱們須要配置多個怎麼辦呢？後面咱們會講
morphlines :[
{
id : bqjrMap
importCommands :["org.kitesdk.**","com.ngdata.**"]
commands :[
{
extractHBaseCells {
mappings :[
{
inputColumn :"cf1:name"
outputField :"HBase_Indexer_Test_cf1_name"
type :string
source : value
}
]
}
}
{ logDebug { format :"output record: {}", args :["@{}"]}}
]
}
]

其中

** id:表示當前morphlines文件的ID名稱。**

** importCommands:須要引入的命令包地址。**

** extractHBaseCells：該命令用來讀取HBase列數據並寫入到SolrInputDocument對象中，該命令必須包含零個或者多個mappings命令對象。**

** mappings:用來指定HBase列限定符的字段映射。**

** inputColumn:須要寫入到solr中的HBase列字段。值包含列族和列限定符，並用‘ : ’分開。其中列限定符也可使用通配符‘’來表示，譬如可使用data:表示讀取只要列族爲data的全部hbase列數據，也能夠經過data:my*來表示讀取列族爲data列限定符已my開頭的字段值。

** outputField:用來表示morphline讀取的記錄須要輸出的數據字段名稱，該名稱必須和solr中的schema.xml文件的字段名稱保持一致，不然寫入不正確。**

** type:用來定義讀取HBase數據的數據類型，咱們知道HBase中的數據都是以byte[]的形式保存，可是全部的內容在Solr中索引爲text形式，因此須要一個方法來把byte[]類型轉換爲實際的數據類型。type參數的值就是用來作這件事情的。如今支持的數據類型有：byte,int,long,string,boolean,float,double,short和bigdecimal。固然你也能夠指定自定的數據類型，只須要實現com.ngdata.hbaseindexer.parse.ByteArrayValueMapper接口便可。**

** source:用來指定HBase的KeyValue那一部分做爲索引輸入數據，可選的有‘value’和'qualifier',當爲value的時候表示使用HBase的列值做爲索引輸入，當爲qualifier的時候表示使用HBase的列限定符做爲索引輸入。**

3.6 註冊 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service

當 Lily HBase Indexer 配置 XML文件的內容使人滿意，將它註冊到 Lily HBase Indexer Service。上傳 Lily HBase Indexer 配置 XML文件至 ZooKeeper，由給定的 SolrCloud 集合完成此操做。

hbase-indexer add-indexer \
--name bqjrIndexer \
--indexer-conf $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--connection-param solr.zk=bqbps1.bqjr.cn:2181,bqbpm1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--connection-param solr.collection=bqjr \
--zookeeper bqbps1.bqjr.cn:2181,bqbpm1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181

再次運行hbase-indexer list-indexers查看。添加成功

3.7 同步數據

put 'HBase_Indexer_Test','003','cf1:name','xiaofang'
put 'HBase_Indexer_Test','004','cf1:name','xiaogang'

咱們進入Solr的查詢界面，在q裏面輸入HBase_Indexer_Test_cf1_name:xiaogang能夠看到對應得HBase得rowkey

咱們也可使用:查詢所有數據

3.8批量同步索引

仔細觀察3.7咱們會發現一個問題，咱們只記錄了後面插入得數據，那原來就存在HBase的數據怎麼辦呢？

在運行命令的目錄下必須有morphlines.conf文件，執行
find / |grep morphlines.conf$

通常咱們選擇最新的那個process
進入到
/opt/cm-5.7.0/run/cloudera-scm-agent/process/1386-ks_indexer-HBASE_INDEXER/morphlines.conf
或者加上
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1501-ks_indexer-HBASE_INDEXER/morphlines.conf
執行下面的命令

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh5.7.0-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--zk-host bqbpm1.bqjr.cn:2181,bqbps1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--collection bqjr \
--go-live

提示找不到solrconfig.xml，這個問題糾結了好久。最終加上reducers--reducers 0就能夠了

將修改的

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1501-ks_indexer-HBASE_INDEXER/morphlines.conf \
--zk-host bqbpm2.bqjr.cn:2181/solr \
--collection bqjr \
--reducers 0 \
--go-live

3.9 設置多個indexer

每個Hbase Table對應生成一個Solr的Collection索引，每一個索引對應一個Lily HBase Indexer 配置文件morphlines.conf和morphline配置文件morphline-hbase-mapper.xml，其中morphlines.conf可由CDH的Key-Value Store Indexer控制檯管理，以id區分
可是咱們再CDH中沒辦法配置多個morphlines.conf文件的，那咱們怎麼讓indexer和collection關聯呢？
其實咱們仔細回想增長indexer的時候有指定具體的collection，如--connection-param solr.collection=bqjr
因此咱們的morphlines.conf能夠直接這麼寫

SOLR_LOCATOR :{
# ZooKeeper ensemble
zkHost :"$ZK_HOST"
}
morphlines :[
{
id : XDGL_ACCT_FEE_Map
importCommands :["org.kitesdk.**","com.ngdata.**"]
commands :[
{
extractHBaseCells {
mappings :[
{
inputColumn :"cf1:ETL_IN_DT"
outputField :"XDGL_ACCT_FEE_cf1_ETL_IN_DT"
type :string
source : value
}
]
}
}
{ logDebug { format :"output record: {}", args :["@{}"]}}
]
},
{
id : XDGL_ACCT_PAYMENT_LOG_Map
importCommands :["org.kitesdk.**","com.ngdata.**"]
commands :[
{
extractHBaseCells {
mappings :[
{
inputColumn :"cf1:ETL_IN_DT"
outputField :"XDGL_ACCT_PAYMENT_LOG_cf1_ETL_IN_DT"
type :string
source : value
}
]
}
}
{ logDebug { format :"output record: {}", args :["@{}"]}}
]
}
]

4、數據的增刪改查

4.1 增長

put 'HBase_Indexer_Test','005','cf1:name','bob'

在Solr中新增了一條名爲bob的索引

4.2更新

put 'HBase_Indexer_Test','005','cf1:name','Ash'

咱們嘗試將bob改爲Ash，過了幾秒，發現Solr也隨之更新了

4.3刪除

deleteall 'HBase_Indexer_Test','005'

咱們刪除剛剛插入的005的索引，Solr也跟着刪除了

4.4 總結

經過Lily HBase Indexer工具同步到Solr的索引，會很智能的將增刪改操做同步過去，徹底不用咱們操做。很是方便

5、擴展命令

#solrctl
solrctl instancedir --list
solrctl collection --list
# 更新coolection配置
solrctl instancedir --update User $HOME/hbase-indexer/User
solrctl collection --reload User
#刪除instancedir
solrctl instancedir --deleteUser
#刪除collection
solrctl collection --deleteUser
#刪除collection全部doc
solrctl collection --deletedocs User
#刪除User配置目錄
rm -rf $HOME/hbase-indexer/User
# hbase-indexer
# 若修改了morphline-hbase-mapper.xml，需更新索引
hbase-indexer update-indexer -n userIndexer
# 刪除索引
hbase-indexer delete-indexer -n userIndexer
#查看索引
hbase-indexer list-indexers

6、F&Q

6.1建立indexer失敗，原來indexer已經存在

執行了hbase-indexer add-indexer命令後發現原來已經存在了indexer

使用hbase-indexer delete-indexer --name $IndxerName刪除原來的indexer

6.2建立indexer失敗

使用hbase-indexer list-indexers命令，查看是否建立成功

此時咱們發現，

說明咱們建立失敗了。緣由是zookeeper我只設置了一個。
錯誤示例：

hbase-indexer add-indexer \
--name bqjrIndexer \
--indexer-conf $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--connection-param solr.zk=bqbpm2.bqjr.cn:2181/solr \
--connection-param solr.collection=bqjr \
--zookeeper bqbpm2.bqjr.cn:2181

正確示例

hbase-indexer add-indexer \
--name bqjrIndexer \
--indexer-conf $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--connection-param solr.zk=bqbps1.bqjr.cn:2181,bqbpm1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--connection-param solr.collection=bqjr \
--zookeeper bqbps1.bqjr.cn:2181,bqbpm1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181

再次運行hbase-indexer list-indexers查看。此次成功了

6.3使用自帶的indexer工具批量同步索引失敗,提示找不到morphlines.conf

首先，命令中要指定morphlines.conf文件路徑和morphline-hbase-mapper.xml文件路徑。執行：
find / |grep morphlines.conf$

通常咱們選擇最新的那個process，咱們將其拷貝或者添加到配置項中
進入到
/opt/cm-5.7.0/run/cloudera-scm-agent/process/1386-ks_indexer-HBASE_INDEXER/morphlines.conf
或者加上
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1501-ks_indexer-HBASE_INDEXER/morphlines.conf
執行下面的命令

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh5.7.0-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1629-ks_indexer-HBASE_INDEXER/morphlines.conf \
--zk-host bqbpm1.bqjr.cn:2181,bqbps1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--collection bqjr \
--go-live

6.4使用自帶的indexer工具批量同步索引失敗,提示找不到solrconfig.xml

提示找不到solrconfig.xml，這個問題糾結了好久。最終加上reducers--reducers 0就能夠了

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1501-ks_indexer-HBASE_INDEXER/morphlines.conf \
--zk-host bqbpm2.bqjr.cn:2181/solr \
--collection bqjr \
--reducers 0 \
--go-live

可是爲何會出現這個問題呢？其實咱們犯了一個錯誤，咱們add-indexer的時候，指定的zookeeper信息中有兩個節點忘了加端口，寫成了

hbase-indexer add-indexer \
--name XDGL_WITHHOLD_KFT_INFO \
--indexer-conf $HOME/hbase-indexer/XDGL_WITHHOLD_KFT_INFO/morphline-hbase-mapper.xml \
--connection-param solr.zk=bqbpm2.bqjr.cn:2181/solr \
--connection-param solr.collection=XDGL_WITHHOLD_KFT_INFO \
--zookeeper bqbps1.bqjr.cn,bqbpm1.bqjr.cn,bqbpm2.bqjr.cn:2181

因此在其餘zookeeper節點找不到solrconfig.xml也正常，咱們添加正確後，運行又好了

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh5.7.0-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/XDGL_ACCT_FEE/morphline-hbase-mapper.xml \
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1629-ks_indexer-HBASE_INDEXER/morphlines.conf \
--zk-host bqbpm1.bqjr.cn:2181,bqbps1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--collection XDGL_ACCT_FEE \
--go-live