HBase協處理器同步二級索引到Solr(續)

時間 2019-12-11

標籤 hbase 處理器同步二級索引 solr 欄目 Hadoop 简体版

原文原文鏈接

1、已知的問題和不足
2、解決思路
3、代碼
3.1 讀取config文件內容
3.2 封裝SolrServer的獲取方式
3.3 編寫提交數據到Solr的代碼
3.4 攔截HBase的Put和Delete操做信息
4、使用

1、已知的問題和不足

在上一個版本中，實現了使用HBase的協處理器將HBase的二級索引同步到Solr中，可是仍舊有幾個缺陷：java

寫入Solr的Collection是寫死在代碼裏面，且是惟一的。若是咱們有一張表的數據但願將不一樣的字段同步到Solr中該如何作呢？
目前全部配置相關信息都是寫死到了代碼中的，是否能夠添加外部配置文件。
原來的方法是每次都須要編譯新的Jar文件單獨運行，可否將全部的同步使用一段通用的代碼完成？

2、解決思路

針對上面的三個主要問題，咱們一一解決shell

一般一張表會對應多個SolrCollection以及不一樣的Column。咱們可使用Map[表名->List[（Collection1，List[Columns]),(Collection2,List[Columns])...]]這樣的類型，根據表名獲取全部的Collection和Column。
經過Typesafe Config讀取外部配置文件，達到全部信息可配的目的。
全部的數據都只有Put和Delete，只要咱們攔截到具體的消息以後判斷當前的表名，而後根據問題一中的Collection和Column便可寫入對應的SolrServer。在協處理器中獲取表名的是e.getEnvironment().getRegion().getTableDesc().getTableName().getNameAsString()其中e是ObserverContext ；

3、代碼

3.1 讀取config文件內容

使用typesafe的config組件讀取morphlines.conf文件，將內容轉換爲 Map<String,List<HBaseIndexerMappin>>。具體代碼以下緩存

 
 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
 
public class ConfigManager { private static SourceConfig sourceConfig = new SourceConfig(); public static Config config; static { sourceConfig.setConfigFiles("morphlines.conf"); config = sourceConfig.getConfig(); } public static Map<String,List<HBaseIndexerMappin>> getHBaseIndexerMappin(){ Map<String,List<HBaseIndexerMappin>> mappin = new HashMap<String, List<HBaseIndexerMappin>>(); Config mappinConf = config.getConfig("Mappin"); List<String> tables = mappinConf.getStringList("HBaseTables"); for (String table :tables){ List<Config> confList = (List<Config>) mappinConf.getConfigList(table); List<HBaseIndexerMappin> maps = new LinkedList<HBaseIndexerMappin>(); for(Config tmp :confList){ HBaseIndexerMappin map = new HBaseIndexerMappin(); map.solrConnetion = tmp.getString("SolrCollection"); map.columns = tmp.getStringList("Columns"); maps.add(map); } mappin.put(table,maps); } return mappin; }}

3.2 封裝SolrServer的獲取方式

由於目前我使用的環境是Solr和HBase公用的同一套Zookeeper，所以咱們徹底能夠藉助HBase的Zookeeper信息。HBase的協處理器是運行在HBase的環境中的，天然能夠經過HBase的Configuration獲取當前的Zookeeper節點和端口，而後輕鬆的獲取到Solr的地址。併發

 
 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
 
public class SolrServerManager implements LogManager { static Configuration conf = HBaseConfiguration.create(); public static String ZKHost = conf.get("hbase.zookeeper.quorum","bqdpm1,bqdpm2,bqdps2"); public static String ZKPort = conf.get("hbase.zookeeper.property.clientPort","2181"); public static String SolrUrl = ZKHost + ":" + ZKPort + "/" + "solr"; public static int zkClientTimeout = 1800000;// 心跳 public static int zkConnectTimeout = 1800000;// 鏈接時間 public static CloudSolrServer create(String defaultCollection){ log.info("Create SolrCloudeServer .This collection is " + defaultCollection); CloudSolrServer solrServer = new CloudSolrServer(SolrUrl); solrServer.setDefaultCollection(defaultCollection); solrServer.setZkClientTimeout(zkClientTimeout); solrServer.setZkConnectTimeout(zkConnectTimeout); return solrServer; }}

3.3 編寫提交數據到Solr的代碼

理想狀態下，咱們時時刻刻都須要提交數據到Solr中，可是事實上咱們數據寫入的時間是比較分散的，可能集中再每一天的某幾個時間點。所以咱們必須保證在高併發下能達到必定數據量自動提交，在低併發的狀況下能隔一段時間寫入一次。只有兩種機制並存的狀況下才能保證數據能即時寫入。app

 
 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
 
public class SolrCommitTimer extends TimerTask implements LogManager { public Map<String,List<SolrInputDocument>> putCache = new HashMap<String, List<SolrInputDocument>>();//Collection名字->更新（插入）操做緩存 public Map<String,List<String>> deleteCache = new HashMap<String, List<String>>();//Collection名字->刪除操做緩存 Map<String,CloudSolrServer> solrServers = new HashMap<String, CloudSolrServer>();//Collection名字->SolrServers int maxCache = ConfigManager.config.getInt("MaxCommitSize"); // 任什麼時候候，保證只能有一個線程在提交索引，並清空集合 final static Semaphore semp = new Semaphore(1); //添加Collection和SolrServer public void addCollecttion(String collection,CloudSolrServer server){ this.solrServers.put(collection,server); }//往Solr添加（更新）數據 public UpdateResponse put(CloudSolrServer server,SolrInputDocument doc) throws IOException, SolrServerException { server.add(doc); return server.commit(false, false); }//往Solr添加（更新）數據 public UpdateResponse put(CloudSolrServer server,List<SolrInputDocument> docs) throws IOException, SolrServerException { server.add(docs); return server.commit(false, false); }//根據ID刪除Solr數據 public UpdateResponse delete(CloudSolrServer server,String rowkey) throws IOException, SolrServerException { server.deleteById(rowkey); return server.commit(false, false); }//根據ID刪除Solr數據 public UpdateResponse delete(CloudSolrServer server,List<String> rowkeys) throws IOException, SolrServerException { server.deleteById(rowkeys); return server.commit(false, false); }//將doc添加到緩存 public void addPutDocToCache(String collection, SolrInputDocument doc) throws IOException, SolrServerException, InterruptedException { semp.acquire(); log.debug("addPutDocToCache:" + "collection=" + collection + "data=" + doc.toString()); if(!putCache.containsKey(collection)){ List<SolrInputDocument> docs = new LinkedList<SolrInputDocument>(); docs.add(doc); putCache.put(collection,docs); }else { List<SolrInputDocument> cache = putCache.get(collection); cache.add(doc); if (cache.size() >= maxCache) { try { this.put(solrServers.get(collection), cache); } finally { putCache.get(collection).clear(); } } } semp.release();//釋放信號量 }//添加刪除操做到緩存 public void addDeleteIdCache(String collection,String rowkey) throws IOException, SolrServerException, InterruptedException { semp.acquire(); log.debug("addDeleteIdCache:" + "collection=" + collection + "rowkey=" + rowkey); if(!deleteCache.containsKey(collection)){ List<String> rowkeys = new LinkedList<String>(); rowkeys.add(rowkey); deleteCache.put(collection,rowkeys); }else{ List<String> cache = deleteCache.get(collection); cache.add(rowkey); if (cache.size() >= maxCache) { try{ this.delete(solrServers.get(collection),cache); }finally { putCache.get(collection).clear(); } } } semp.release();//釋放信號量 } @Override public void run() { try { semp.acquire(); log.debug("開始插入...."); Set<String> collections = solrServers.keySet(); for(String collection:collections){ if(putCache.containsKey(collection) && (!putCache.get(collection).isEmpty()) ){ this.put(solrServers.get(collection),putCache.get(collection)); putCache.get(collection).clear(); } if(deleteCache.containsKey(collection) && (!deleteCache.get(collection).isEmpty())){ this.delete(solrServers.get(collection),deleteCache.get(collection)); deleteCache.get(collection).clear(); } } } catch (InterruptedException e) { e.printStackTrace(); } catch (Exception e) { log.error("Commit putCache to Solr error!Because :" + e.getMessage()); }finally { semp.release();//釋放信號量 } }}

3.4 攔截HBase的Put和Delete操做信息

在每一個prePut和preDelete中攔截操做信息，記錄表名、列名、值。將這些信息根據表名和Collection名進行分類寫入緩存。ide

 
 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
 
public class HBaseIndexerToSolrObserver extends BaseRegionObserver implements LogManager{ Map<String,List<HBaseIndexerMappin>> mappins = ConfigManager.getHBaseIndexerMappin(); Timer timer = new Timer(); int maxCommitTime = ConfigManager.config.getInt("MaxCommitTime"); //最大提交時間，s SolrCommitTimer solrCommit = new SolrCommitTimer(); public HBaseIndexerToSolrObserver(){ log.info("Initialization HBaseIndexerToSolrObserver ..."); for(Map.Entry<String,List<HBaseIndexerMappin>> entry : mappins.entrySet() ){ List<HBaseIndexerMappin> solrmappin = entry.getValue(); for(HBaseIndexerMappin map:solrmappin){ String collection = map.solrConnetion;//獲取Collection名字 log.info("Create Solr Server connection .The collection is " + collection); CloudSolrServer solrserver = SolrServerManager.create(collection);//根據Collection初始化SolrServer鏈接 solrCommit.addCollecttion(collection,solrserver); } } timer.schedule(solrCommit, 10 * 1000L, maxCommitTime * 1000L); } @Override public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException { String table = e.getEnvironment().getRegion().getTableDesc().getTableName().getNameAsString();//獲取表名 String rowkey= Bytes.toString(put.getRow());//獲取主鍵 SolrInputDocument doc = new SolrInputDocument(); List<HBaseIndexerMappin> mappin = mappins.get(table); for(HBaseIndexerMappin mapp : mappin){ for(String column : mapp.columns){ String[] tmp = column.split(":"); String cf = tmp[0]; String cq = tmp[1]; if(put.has(Bytes.toBytes(cf),Bytes.toBytes(cq))){ Cell cell = put.get(Bytes.toBytes(cf),Bytes.toBytes(cq)).get(0);//獲取制定列的數據 Map<String, String > operation = new HashMap<String,String>(); operation.put("set",Bytes.toString(CellUtil.cloneValue(cell))); doc.setField(cq,operation);//使用原子更新的方式將HBase二級索引寫入Solr } } doc.addField("id",rowkey); try { solrCommit.addPutDocToCache(mapp.solrConnetion,doc);//添加doc到緩存 } catch (SolrServerException e1) { e1.printStackTrace(); } catch (InterruptedException e1) { e1.printStackTrace(); } } } @Override public void postDelete(ObserverContext<RegionCoprocessorEnvironment> e, Delete delete, WALEdit edit, Durability durability) throws IOException{ String table = e.getEnvironment().getRegion().getTableDesc().getTableName().getNameAsString(); String rowkey= Bytes.toString(delete.getRow()); List<HBaseIndexerMappin> mappin = mappins.get(table); for(HBaseIndexerMappin mapp : mappin){ try { solrCommit.addDeleteIdCache(mapp.solrConnetion,rowkey);//添加刪除操做到緩存 } catch (SolrServerException e1) { e1.printStackTrace(); } catch (InterruptedException e1) { e1.printStackTrace(); } } }}

4、使用

首先須要添加morphlines.conf文件。裏面包含了須要同步數據到Solr的HBase表名、對應的Solr Collection的名字、要同步的列、多久提交一次、最大批次容量的相關信息。具體配置以下：高併發

 
 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
 
#最大提交時間（單位：秒）MaxCommitTime = 30#最大批次容量MaxCommitSize = 10000Mappin { HBaseTables: ["HBASE_OBSERVER_TEST"] #須要同步的HBase表名 "HBASE_OBSERVER_TEST": [ { SolrCollection: "bqjr" #Solr Collection名字 Columns: [ "cf1:test_age", #須要同步的列，格式<列族:列> "cf1:test_name" ] }, ]}

該配置文件默認放在各個節點的/etc/hbase/conf/下。若是你但願將配置文件路徑修改成其餘路徑，請修改com.bqjr.bigdata.HBaseObserver.comm.config.SourceConfig類中的configHome路徑。post

而後將代碼打包，上傳到HDFS中，將協處理器添加到對應的表中。ui

 
 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
 
#先禁用這張表disable 'HBASE_OBSERVER_TEST'#爲這張表添加協處理器,設置的參數具體爲： jar文件路徑|類名|優先級（SYSTEM或者USER）alter 'HBASE_OBSERVER_TEST','coprocessor'=>'hdfs://hostname:8020/ext_lib/HBaseObserver-1.0.0.jar|com.bqjr.bigdata.HBaseObserver.server.HBaseIndexerToSolrObserver||'#啓用這張表enable 'HBASE_OBSERVER_TEST'#刪除某個協處理器，"$<bumber>"後面跟的ID號與desc裏面的ID號相同alter 'HBASE_OBSERVER_TEST',METHOD=>'table_att_unset',NAME => 'coprocessor$1'