Hbase數據備份&&容災方案

標籤（空格分隔）： Hbasenode

1、Distcp

在使用distcp命令copy hdfs文件的方式實現備份時，須要禁用備份表確保copy時該表沒有數據寫入，對於在線服務的hbase集羣，該方式不可用，將靜態此目錄distcp 到其餘HDFS文件系統時候，能夠經過在其餘集羣直接啓動新Hbase 集羣將全部數據恢復。

2、CopyTable

執行命令前，需在對端集羣先建立表支持時間區間、row區間，改變表名稱，改變列簇名稱，指定是否copy刪除數據等功能，例如：shell

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr= dstClusterZK:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable

一、同一個集羣不一樣表名稱

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy  srcTable

二、跨集羣copy表

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase srcTable

跨集羣copytable 必須注意是用推的方式，即從原集羣運行此命令。apache

copytable eg

$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>

Options:
 rs.class     hbase.regionserver.class of the peer cluster,
              specify if different from current cluster
 rs.impl      hbase.regionserver.impl of the peer cluster,
 startrow     the start row
 stoprow      the stop row
 starttime    beginning of the time range (unixtime in millis)
              without endtime means from starttime to forever
 endtime      end of the time range.  Ignored if no starttime specified.
 versions     number of cell versions to copy
 new.name     new table's name
 peer.adr     Address of the peer cluster given in the format
              hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
 families     comma-separated list of families to copy
              To copy from cf1 to cf2, give sourceCfName:destCfName.
              To keep the same name, just give "cfName"
 all.cells    also copy delete markers and deleted cells

Args:
 tablename    Name of the table to copy

Examples:
 To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable

For performance consider the following general options:
  It is recommended that you set the following to >=100. A higher value uses more memory but
  decreases the round trip time to the server and may increase performance.
    -Dhbase.client.scanner.caching=100
  The following should always be set to false, to prevent writing data twice, which may produce
  inaccurate results.
    -Dmapred.map.tasks.speculative.execution=false

一些示例

hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1478448000000 –endtime=1478591994506 –peer.adr=VECS00001,VECS00002,VECS00003:2181:/hbase –families=txjl –new.name=hy_membercontacts_bk  hy_membercontacts

#根據時間範圍備份
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1478448000000 –endtime=1478591994506 –new.name=hy_membercontacts_bk  hy_membercontacts
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1477929600000 –endtime=1478591994506 –new.name=hy_linkman_tmp hy_linkman

#備份全表
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=hy_mobileblacklist_bk_before_del hy_mobileblacklist

#拓展根據時間範圍查詢
scan ‘hy_linkman’, {COLUMNS => ‘lxr:sguid’, TIMERANGE => [1478966400000, 1479052799000]}
scan ‘hy_mobileblacklist’, {COLUMNS => ‘mobhmd:sguid’, TIMERANGE => [1468719824000, 1468809824000]}
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=hy_mobileblacklist_bk_before_del_20161228 hy_mobileblacklist

3、Export/Import(使用mapreduce)

##Export 執行導出命令可以使用-D命令自定義參數，此處限定表名、列族、開始結束RowKey、以及導出到HDFS的目錄app

hbase org.apache.hadoop.hbase.mapreduce.Export -D hbase.mapreduce.scan.column.family=cf -D hbase.mapreduce.scan.row.start=0000001 -D hbase.mapreduce.scan.row.stop=1000000 table_name /tmp/hbase_export

###可選的-D參數配置項dom

Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]

  Note: -D properties will be applied to the conf used. 
  For example: 
   -D mapred.output.compress=true
   -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
   -D mapred.output.compression.type=BLOCK
  Additionally, the following SCAN properties can be specified
  to control/limit what is exported..
   -D hbase.mapreduce.scan.column.family=<familyName>
   -D hbase.mapreduce.include.deleted.rows=true
For performance consider the following properties:
   -Dhbase.client.scanner.caching=100
   -Dmapred.map.tasks.speculative.execution=false
   -Dmapred.reduce.tasks.speculative.execution=false
For tables with very wide rows consider setting the batch size as below:
   -Dhbase.export.scanner.batch=10

##Import 執行導入命令tcp

必須在導入前存在表 create 'table_name','cf'ide

###運行導入命令工具

hbase org.apache.hadoop.hbase.mapreduce.Import table_name hdfs://flashhadoop/tmp/hbase_export/

可選的-D參數配置項

Usage: Import [options] <tablename> <inputdir>

By default Import will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
  -Dimport.bulk.output=/path/for/output
 To apply a generic org.apache.hadoop.hbase.filter.Filter to the input, use
  -Dimport.filter.class=<name of filter class>
  -Dimport.filter.args=<comma separated list of args for filter
 NOTE: The filter will be applied BEFORE doing key renames via the HBASE_IMPORTER_RENAME_CFS property. Futher, filters will only use the Filter#filterRowKey(byte[] buffer, int offset, int length) method to identify  whether the current row needs to be ignored completely for processing and  Filter#filterKeyValue(KeyValue) method to determine if the KeyValue should be added; Filter.ReturnCode#INCLUDE and #INCLUDE_AND_NEXT_COL will be considered as including the KeyValue.
For performance consider the following options:
  -Dmapred.map.tasks.speculative.execution=false
  -Dmapred.reduce.tasks.speculative.execution=false
  -Dimport.wal.durability=<Used while writing data to hbase. Allowed values are the supported durability values like SKIP_WAL/ASYNC_WAL/SYNC_WAL/...>

4、Snapshot

即爲Hbase 表的鏡像。oop

須要提早開啓Hbase 集羣的snapshot 功能。

<property>
    <name>hbase.snapshot.enabled</name>
    <value>true</value>
</property>

在hbase shell中使用clone_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot命令但是是想建立快照，查看快照，經過快照恢復表，經過快照建立一個新的表等功能，ui

在建立snapshot後，能夠經過ExportSnapshot工具把快照導出到另一個集羣，實現數據備份或者數據遷移，ExportSnapshot工具的用法以下：(必須爲推送的方式，即從現集羣到目的集羣)

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot table_name_snapshot -copy-to hdfs://flashhadoop_2/hbase -mappers 2

執行該命令後，在flashhadoop_2的hdfs中會把table_name_snapshot文件夾copy到/hbase/.hbase-snapshot文件下，進入flashhadoop_2這個hbase集羣，執行list_snapshots會看到有一個快照：table_name_snapshot，經過命令clone_snapshot能夠把該快照copy成一個新的表，不用提早建立表，新表的region個數等信息徹底與快照保持一致。也能夠先建立一張與原表相同的表，而後經過restore snapshot的方式恢復表，但會多出一個region.這個region 將會失效。

在使用snapshot把一個集羣的數據copy到新集羣后，應用程序開啓雙寫，而後能夠使用Export工具把快照與雙寫之間的數據導入到新集羣，從而實現數據遷移，爲保障數據不丟失，Export導出時指定的時間範圍能夠適當放寬。

5、Replication

能夠經過replication機制實現hbase集羣的主從模式，或者能夠說主主模式，也就是兩邊都作雙向同步，具體步驟以下：一、若是主從hbase集羣共用一個zk集羣，則zookeeper.znode.parent不能都是默認的hbase，能夠配置爲hbase-master和hbase-slave，總之在zk 中的znode節點命名不能衝突。 2，在主,從hbase集羣的hbase-site.xml中添加配置項：(其實作主從模式的話，只須要將從集羣hbase.replication設置爲true 便可，其餘能夠忽略。)

<property>
    <name>hbase.replication</name>
    <value>true</value>
</property>

<property>
    <name>replication.source.nb.capacity</name>
    <value>25000</value>
<description>主集羣每次向從集羣發送的entry最大的個數，默認值25000，可根據集羣規模作出適當調整</description>
</property>

<property>
    <name>replication.source.size.capacity</name>
    <value>67108864</value>
    <description>主集羣每次向從集羣發送的entry的包的最大值大小，默認爲64M</description>
</property>

<property>
    <name>replication.source.ratio</name>
    <value>1</value>
    <description>主集羣使用的從集羣的RS的數據百分比，默認爲0.1,1.X.X版本默認0.15，需調整爲1，充分利用從集羣的RS</description>
</property>

<property>
    <name>replication.sleep.before.failover</name>
    <value>2000</value>
<description>主集羣在RS宕機多長時間後進行failover，默認爲2秒，具體的sleep時間是： sleepBeforeFailover + (long) (new Random().nextFloat() * sleepBeforeFailover) </description>
</property>

<property>
    <name>replication.executor.workers</name>
    <value>1</value>
    <description>從事replication的線程數，默認爲1，若是寫入量大，能夠適當調大</description>
</property>

3，重啓主從集羣，新集羣搭建請忽略重啓，直接啓動便可。
4，分別在主從集羣hbase shell中

add_peer 'ID' 'CLUSTER_KEY'

The ID must be a short integer. To compose the CLUSTER_KEY, use the following template:

hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent

This will show you the help to setup the replication stream between both clusters. If both clusters use the same Zookeeper cluster, you have to use a different zookeeper.znode.parent since they can't write in the same folder.

1，

增長主Hbase 到容災 Hbase 數據表 同步 
add_peer '1',  "VECS00840,VECS00841,VECS00842,VECS00843,VECS00844:2181:/hbase"

2，

增長容災Hbase 到主 Hbase 數據表 同步 
add_peer '2',  "VECS00994,VECS00995,VECS00996,VECS00997,VECS00998:2181:/hbase"

3，而後在主，備集羣建表結構，屬性徹底相同的表。（注意，是徹底相同）

主從集羣都創建。
hbase shell>
create 't_warehouse_track', {NAME => 'cf', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

4，在主集羣hbase shell

enable_table_replication 't_warehouse_track'

5,在容災集羣hbase shell

disable 'your_table'
alter 'your_table', {NAME => 'family_name', REPLICATION_SCOPE => '1'}
enable 'your_table

此處的REPLICATION_SCOPE => '1'中的1，與第3步中設置到「ID」無關係，這個值只有0或者1，標示開啓複製或者關閉。