這本書是 HBase 的官方指南。 版本爲 0.95-SNAPSHOT 。能夠在HBase官網上找到它。也能夠在 javadoc, JIRA 和 wiki 找到更多的資料。html
此書正在編輯中。 能夠向 HBase 官方提供補丁JIRA.前端
若這是你第一次踏入分佈式計算的精彩世界,你會感到這是一個有趣的年代。分佈式計算是很難的,作一個分佈式系統須要不少軟硬件和網絡的技能。你的集羣能夠會由於各式各樣的錯誤發生故障。好比HBase自己的Bug,錯誤的配置(包括操做系統),硬件的故障(網卡和磁盤甚至內存) 若是你一直在寫單機程序的話,你須要從新開始學習。這裏就是一個好的起點: 分佈式計算的謬論.node
Section 1.2, 「快速開始」會介紹如何運行一個單機版的HBase.他運行在本地磁盤上。 Section 2, 「配置」 會介紹如何運行一個分佈式的HBase。他運行在HDFS上linux
選擇一個 Apache 下載鏡像,下載 HBase Releases. 點擊 stable
目錄,而後下載後綴爲 .tar.gz
的文件; 例如 hbase-0.95-SNAPSHOT.tar.gz
$ tar xfz hbase-0.95-SNAPSHOT.tar.gz $ cd hbase-0.95-SNAPSHOT
如今你已經能夠啓動HBase了。可是你可能須要先編輯 conf/hbase-site.xml
,來選擇HBase將數據寫到哪一個目錄 .
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value>file:///DIRECTORY/hbase</value> </property> </configuration>
替換成你指望寫文件的目錄. 默認 hbase.rootdir
是指向 /tmp/hbase-${user.name}
$ ./bin/start-hbase.sh starting Master, logging to logs/hbase-user-master-example.org.out
你須要確認安裝了Oracle的1.6 版本的java.若是你在命令行鍵入java有反應說明你安裝了Java。若是沒有裝,你須要先安裝,而後編輯conf/hbase-env.sh
$ ./bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010 hbase(main):001:0>
輸入 help 而後 <RETURN> 能夠看到一列shell命令。這裏的幫助很詳細,要注意的是表名,行和列須要加引號。
建立一個名爲 test
的表,這個表只有一個 列族 爲 cf
hbase(main):003:0> create 'test', 'cf' 0 row(s) in 1.2200 seconds hbase(main):003:0> list 'table' test 1 row(s) in 0.0550 seconds hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.0560 seconds hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in 0.0370 seconds hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in 0.0450 seconds
, 列爲 cf:a
, 值是 value1
。HBase中的列是由 列族前綴和列的名字組成的,以冒號間隔。例如這一行的列名就是a
hbase(main):007:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value1 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 3 row(s) in 0.0590 seconds
hbase(main):008:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1288380727188, value=value1 1 row(s) in 0.0400 seconds
disable 再 drop 這張表,能夠清除你剛剛的操做
hbase(main):012:0> disable 'test' 0 row(s) in 1.0930 seconds hbase(main):013:0> drop 'test' 0 row(s) in 0.0770 seconds
hbase(main):014:0> exit
以上步驟僅僅適用於實驗和測試。接下來你能夠看 Section 2., 「配置」 ,咱們會介紹不一樣的HBase運行模式,運行分佈式HBase中須要的軟件 和如何配置。
HBase使用和Hadoop同樣配置系統。要配置部署,編輯conf/hbase-env.sh文件中的環境變量——該配置文件主要啓動腳本用於獲取已啓動的集羣——而後增長配置到XML文件,如同覆蓋HBase缺省配置,告訴HBase用什麼文件系統, 所有ZooKeeper位置 [1] 。
[1] 當心編輯XML。確認關閉全部元素。採用 xmllint 或相似工具確認文檔編輯後是良好格式化的。
This section lists required services and some required system configuration.
必須安裝ssh , sshd 也必須運行,這樣Hadoop的腳本才能夠遠程操控其餘的Hadoop和HBase進程。ssh之間必須都打通,不用密碼均可以登陸,詳細方法能夠Google一下 ("ssh passwordless login").
HBase使用本地 hostname 纔得到IP地址. 正反向的DNS都是能夠的.
若是還不夠,你能夠設置 hbase.regionserver.dns.interface
還有一種方法是設置 hbase.regionserver.dns.nameserver
HBase expects the loopback IP address to be Ubuntu and some other distributions, for example, will default to and this will cause problems for you.
should look something like this: localhost ubuntu.ubuntu-domain ubuntu
集羣的時鐘要保證基本的一致。稍有不一致是能夠容忍的,可是很大的不一致會形成奇怪的行爲。 運行 NTP 或者其餘什麼東西來同步你的時間.
HBase是數據庫,會在同一時間使用不少的文件句柄。大多數linux系統使用的默認值1024是不能知足的,會致使FAQ: Why do I see "java.io.IOException...(Too many open files)" in my logs?異常。還可能會發生這樣的異常
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException 2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
因此你須要修改你的最大文件句柄限制。能夠設置到10k。大體的數學運算以下:每列族至少有1個存儲文件(StoreFile) 可能達到5-6個若是區域有壓力。將每列族的存儲文件平均數目和每區域服務器的平均區域數目相乘。例如:假設一個模式有3個列族,每一個列族有3個存儲文件,每一個區域服務器有100個區域,JVM 將打開3 * 3 * 100 = 900 個文件描述符(不包含打開的jar文件,配置文件等)
你還須要修改 hbase 用戶的 nproc
,在壓力下,若是太低會形成 OutOfMemoryError
異常[3] [4]。
在文件 /etc/security/limits.conf
hadoop - nofile 32768
能夠把 hadoop
替換成你運行HBase和Hadoop的用戶。若是你用兩個用戶,你就須要配兩個。還有配nproc hard 和 soft limits. 如:
hadoop soft/hard nproc 32000
在 /etc/pam.d/common-session
session required pam_limits.so
不然在 /etc/security/limits.conf
若是你實在是想運行,須要安裝Cygwin 並虛擬一個unix環境.詳情請看 Windows 安裝指導 . 或者 搜索郵件列表找找最近的關於windows的注意點
選擇 Hadoop 版本對HBase部署很關鍵。下表顯示不一樣HBase支持的Hadoop版本信息。基於HBase版本,應該選擇合適的Hadoop版本。咱們沒有綁定 Hadoop 發行版選擇。能夠從Apache使用 Hadoop 發行版,或瞭解一下Hadoop發行商產品: http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
Table 2.1. Hadoop version support matrix
HBase-0.92.x | HBase-0.94.x | HBase-0.96 | |
Hadoop-0.20.205 | S | X | X |
Hadoop-0.22.x | S | X | X |
Hadoop-1.0.x | S | S | S |
Hadoop-1.1.x | NT | S | S |
Hadoop-0.23.x | X | S | NT |
Hadoop-2.x | X | S | S |
S = supported and tested,支持 |
X = not supported,不支持 |
NT = not tested enough.能夠運行但測試不充分 |
因爲 HBase 依賴 Hadoop,它配套發佈了一個Hadoop jar 文件在它的 lib 下。該套裝jar僅用於獨立模式。在分佈式模式下,Hadoop版本必須和HBase下的版本一致。用你運行的分佈式Hadoop版本jar文件替換HBase lib目錄下的Hadoop jar文件,以免版本不匹配問題。確認替換了集羣中全部HBase下的jar文件。Hadoop版本不匹配問題有不一樣表現,但看起來都像掛掉了。
HBase 0.92 and 0.94 versions can work with Hadoop versions, 0.20.205, 0.22.x, 1.0.x, and 1.1.x. HBase-0.94 can additionally work with Hadoop-0.23.x and 2.x, but you may have to recompile the code using the specific maven profile (see top level pom.xml)
Apache HBase 0.96.0 requires Apache Hadoop 1.x at a minimum, and it can run equally well on hadoop-2.0. As of Apache HBase 0.96.x, Apache Hadoop 1.0.x at least is required. We will no longer run properly on older Hadoops such as 0.20.205 or branch-0.20-append. Do not move to Apache HBase 0.96.x if you cannot upgrade your Hadoop[6].
HBase will lose data unless it is running on an HDFS that has a durable sync implementation. DO NOT use Hadoop 0.20.2, Hadoop, and Hadoop which DO NOT have this attribute. Currently only Hadoop versions 0.20.205.x or any release in excess of this version -- this includes hadoop-1.0.0 -- have a working, durable sync[7]. Sync has to be explicitly enabled by setting dfs.support.append equal to true on both the client side -- in hbase-site.xml -- and on the serverside in hdfs-site.xml (The sync facility HBase needs is a subset of the append code path).
<property> <name>dfs.support.append</name> <value>true</value> </property>
You will have to restart your cluster after making this edit. Ignore the chicken-little comment you'll find in the hdfs-default.xml in the description for thedfs.support.append configuration.
HBase運行在Hadoop 0.20.x上,就能夠使用其中的安全特性 -- 只要你用這兩個版本0.20S 和CDH3B3,而後把hadoop.jar替換掉就能夠了.
一個 Hadoop HDFS Datanode 有一個同時處理文件的上限. 這個參數叫 xcievers
(Hadoop的做者把這個單詞拼錯了). 在你加載以前,先確認下你有沒有配置這個文件conf/hdfs-site.xml
<property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property>
若是沒有這一項配置,你可能會遇到奇怪的失敗。你會在Datanode的日誌中看到xcievers exceeded,可是運行起來會報 missing blocks錯誤。例如:10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
HBase有兩個運行模式: Section 2.4.1, 「單機模式」 和 Section 2.4.2, 「分佈式模式」. 默認是單機模式,若是要分佈式模式你須要編輯 conf
無論是什麼模式,你都須要編輯 conf/hbase-env.sh
來告知HBase java的安裝路徑.在這個文件裏你還能夠設置HBase的運行環境,諸如 heapsize和其餘JVM有關的選項, 還有Log文件地址,等等. 設置 JAVA_HOME
指向 java安裝的路徑.
這是默認的模式,在 Section 1.2, 「快速開始」 一章中介紹的就是這個模式. 在單機模式中,HBase使用本地文件系統,而不是HDFS ,全部的服務和zooKeeper都運做在一個JVM中。zookeep監聽一個端口,這樣客戶端就能夠鏈接HBase了。
分佈式模式分兩種。僞分佈式模式是把進程運行在一臺機器上,但不是一個JVM.而徹底分佈式模式就是把整個服務被分佈在各個節點上了 [6].
分佈式模式須要使用 Hadoop Distributed File System (HDFS).能夠參見 HDFS需求和指導來得到關於安裝HDFS的指導。在操做HBase以前,你要確認HDFS能夠正常運做。
在咱們安裝以後,你須要確認你的僞分佈式模式或者 徹底分佈式模式的配置是否正確。這兩個模式能夠使用同一個驗證腳本Section 2.2.3, 「運行和確認你的安裝」。
你確認HDFS安裝成功以後,就能夠先編輯 conf/hbase-site.xml
。在這個文件你能夠加入本身的配置,這個配置會覆蓋 Section, 「HBase 默認配置」 and Section, 「HDFS客戶端配置」. 運行HBase須要設置hbase.rootdir
目錄,讓namenode 監聽locahost的9000端口,只有一份數據拷貝(HDFS默認是3份拷貝)。能夠在 hbase-site.xml
<configuration> ... <property> <name>hbase.rootdir</name> <value>hdfs://localhost:9000/hbase</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>dfs.replication</name> <value>1</value> <description>The replication count for HLog & HFile storage. Should not be greater than HDFS datanode count. </description> </property> ... </configuration>
讓HBase本身建立 hbase.rootdir
上面咱們綁定到 localhost
. 也就是說除了本機,其餘機器連不上HBase。因此你須要設置成別的,才能使用它。
如今能夠跳到 Section 2.2.3, 「運行和確認你的安裝」 來運行和確認你的僞分佈式模式安裝了。 [7]
hdfs-site.xml<configuration> ... <property> <name>dfs.name.dir</name> <value>/Users/local/user.name/hdfs-data-name</value> </property> <property> <name>dfs.data.dir</name> <value>/Users/local/user.name/hdfs-data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> ... </configuration>hbase-site.xml
<configuration> ... <property> <name>hbase.rootdir</name> <value>hdfs://localhost:8020/hbase</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>localhost</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> ... </configuration>
啓動初始 HBase 集羣...
% bin/start-hbase.sh
% bin/local-master-backup.sh start 1
... '1' 表示使用端口 60001 & 60011, 該備份主服務器及其log文件放在logs/hbase-${USER}-1-master-${HOSTNAME}.log.
% bin/local-master-backup.sh start 2 3
能夠啓動到 9 個備份服務器 (總數10 個).
啓動更多 regionservers...
% bin/local-regionservers.sh start 1
'1' 表示使用端口 60201 & 60301 ,log文件在 logs/hbase-${USER}-1-regionserver-${HOSTNAME}.log.
在剛運行的regionserver上增長 4 個額外 regionservers ...
% bin/local-regionservers.sh start 2 3 4 5
支持到 99 個額外regionservers (總100個).
要想運行徹底分佈式模式,你要進行以下配置,先在 hbase-site.xml
, 加一個屬性 hbase.cluster.distributed
設置爲 true
而後把 hbase.rootdir
設置爲HDFS的NameNode的位置。 例如,你的namenode運行在namenode.example.org,端口是9000 你指望的目錄是 /hbase
<configuration> ... <property> <name>hbase.rootdir</name> <value>hdfs://namenode.example.org:9000/hbase</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) </description> </property> ... </configuration>
. 在 Section, 「regionservers
」 列出了你但願運行的所有 HRegionServer,一行寫一個host (就像Hadoop裏面的 slaves
同樣). 列在這裏的server會隨着集羣的啓動而啓動,集羣的中止而中止.
若是你但願Hadoop集羣上作HDFS 客戶端配置 ,例如你的HDFS客戶端的配置和服務端的不同。按照以下的方法配置,HBase就能看到你的配置信息:
下面加一個 hdfs-site.xml
(或者 hadoop-site.xml
) ,最好是軟鏈接
若是你的HDFS客戶端的配置很少的話,你能夠把這些加到 hbase-site.xml
例如HDFS的配置 dfs.replication
中的 bin/start-hdfs.sh
你如今已經啓動HBase了。HBase把log記在 logs
子目錄裏面. 當HBase啓動出問題的時候,能夠看看Log.
HBase也有一個界面,上面會列出重要的屬性。默認是在Master的60010端口上H (HBase RegionServers 會默認綁定 60020端口,在端口60030上有一個展現信息的界面 ).若是Master運行在 master.example.org
,端口是默認的話,你能夠用瀏覽器在 http://master.example.org:60010
看到主界面. .
一旦HBase啓動,參見Section 1.2.3, 「Shell 練習」能夠看到如何建表,插入數據,scan你的表,還有disable這個表,最後把它刪掉。
能夠在HBase Shell中止HBase
$ ./bin/stop-hbase.sh stopping hbase...............
配置系統的部署信息和環境變量。 -- 這個配置會被啓動shell使用 -- 而後在XML文件裏配置信息,覆蓋默認的配置。告知HBase使用什麼目錄地址,ZooKeeper的位置等等信息。 [10] .
目錄下。HBase不會幫你作這些,你得用 rsync.
,HBase的配置文件是 conf/hbase-site.xml
. 你能夠在 Section, 「HBase 默認配置」找到配置的屬性列表。你也能夠看有代碼裏面的hbase-default.xml
不是全部的配置都在 hbase-default.xml
該文檔是用hbase默認配置文件生成的,文件源是 hbase-default.xml
這個目錄是region server的共享目錄,用來持久化HBase。URL須要是'徹底正確'的,還要包含文件系統的scheme。例如,要表示hdfs中的'/hbase'目錄,namenode 運行在namenode.example.org的9090端口。則須要設置爲hdfs://namenode.example.org:9000/hbase。默認狀況下HBase是寫到/tmp的。不改這個配置,數據會在重啓的時候丟失。
默認: file:///tmp/hbase-${user.name}/hbase
默認: 60000
默認: false
默認: ${hbase.tmp.dir}/local/
HBase Master web 界面端口. 設置爲-1 意味着你不想讓他運行。
默認: 60010
HBase Master web 界面綁定的端口
HTable客戶端的寫緩衝的默認大小。這個值越大,須要消耗的內存越大。由於緩衝在客戶端和服務端都有實例,因此須要消耗客戶端和服務端兩個地方的內存。獲得的好處是,能夠減小RPC的次數。能夠這樣估算服務器端被佔用的內存: hbase.client.write.buffer * hbase.regionserver.handler.count
默認: 2097152
HBase RegionServer綁定的端口
默認: 60020
HBase RegionServer web 界面綁定的端口 設置爲 -1 意味這你不想與運行 RegionServer 界面.
默認: 60030
默認: false
HBase RegionServer web 界面的IP地址
RegionServer 使用的接口。客戶端打開代理來鏈接region server的時候會使用到。
默認: org.apache.hadoop.hbase.ipc.HRegionInterface
默認: 1000
最大重試次數。全部需重試操做的最大值。例如從root region服務器獲取root region,Get單元值,行Update操做等等。這是最大重試錯誤的值。 默認: 10.
默認: 10
最大重試次數。 原子批加載嘗試的迭代最大次數。 0 永不放棄。默認: 0.
默認: 0
默認: 100
一個KeyValue實例的最大size.這個是用來設置存儲文件中的單個entry的大小上界。由於一個KeyValue是不能分割的,因此能夠避免由於數據過大致使region不可分割。明智的作法是把它設爲能夠被最大region size整除的數。若是設置爲0或者更小,就會禁用這個檢查。默認10MB。
默認: 10485760
客戶端租用HRegion server 期限,即超時閥值。單位是毫秒。默認狀況下,客戶端必須在這個時間內發一條信息,不然視爲死掉。
默認: 60000
RegionServers受理的RPC Server實例數量。對於Master來講,這個屬性是Master受理的handler數量
默認: 10
RegionServer 發消息給 Master 時間間隔,單位是毫秒
默認: 3000
默認: 1000
默認: 2147483647
提交commit log的間隔,無論有沒有寫足夠的值。
默認: 3600000
HLog file reader 的實現.
默認: org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader
HLog file writer 的實現.
默認: org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter
儲備的內存block的數量(譯者注:就像石油儲備同樣)。當發生out of memory 異常的時候,咱們能夠用這些內存在RegionServer中止以前作清理操做。
默認: 4
默認: default
當使用DNS的時候,Zookeepr使用的DNS的域名或者IP 地址,Zookeeper用它來肯定和master用來進行通信的域名.
默認: default
默認: default
當使用DNS的時候,RegionServer使用的DNS的域名或者IP 地址,RegionServer用它來肯定和master用來進行通信的域名.
默認: default
默認: default
當使用DNS的時候,RegionServer使用的DNS的域名或者IP 地址,Master用它來肯定用來進行通信的域名.
默認: default
Master執行region balancer的間隔。
默認: 300000
當任一區域服務器有average + (average * slop)個分區,將會執行從新均衡。默認 20% slop .
Hlog存在於.oldlogdir 文件夾的最長時間, 超過了就會被 Master 的線程清理掉.
默認: 600000
LogsCleaner服務會執行的一組LogCleanerDelegat。值用逗號間隔的文本表示。這些WAL/HLog cleaners會按順序調用。能夠把先調用的放在前面。你能夠實現本身的LogCleanerDelegat,加到Classpath下,而後在這裏寫下類的全稱。通常都是加在默認值的前面。
默認: org.apache.hadoop.hbase.master.TimeToLiveLogCleaner
單個region server的所有memtores的最大值。超過這個值,一個新的update操做會被掛起,強制執行flush操做。
默認: 0.4
當強制執行flush操做的時候,當低於這個值的時候,flush會中止。默認是堆大小的 35% . 若是這個值和 hbase.regionserver.global.memstore.upperLimit 相同就意味着當update操做由於內存限制被掛起時,會盡可能少的執行flush(譯者注:一旦執行flush,值就會比下限要低,再也不執行)
默認: 0.35
service工做的sleep間隔,單位毫秒。 能夠做爲service線程的sleep間隔,好比log roller.
默認: 10000
退出前嘗試寫版本文件的次數。每次嘗試由 hbase.server.thread.wakefrequency 毫秒數間隔。
默認: 3
默認: 5242880
若是memstore有hbase.hregion.memstore.block.multiplier倍數的hbase.hregion.flush.size的大小,就會阻塞update操做。這是爲了預防在update高峯期會致使的失控。若是不設上界,flush的時候會花很長的時間來合併或者分割,最壞的狀況就是引起out of memory異常。(譯者注:內存操做的速度和磁盤不匹配,須要等一等。原文彷佛有誤)
默認: 2
體驗特性:啓用memStore分配本地緩衝區。這個特性是爲了防止在大量寫負載的時候堆的碎片過多。這能夠減小GC操做的頻率。(GC有可能會Stop the world)(譯者注:實現的原理至關於預分配內存,而不是每個值都要從堆裏分配)
默認: true
最大HStoreFile大小。若某個列族的HStoreFile增加達到這個值,這個Hegion會被切割成兩個。 默認: 10G.
當一個HStore含有多於這個值的HStoreFiles(每個memstore flush產生一個HStoreFile)的時候,會執行一個合併操做,把這HStoreFiles寫成一個。這個值越大,須要合併的時間就越長。
默認: 3
當一個HStore含有多於這個值的HStoreFiles(每個memstore flush產生一個HStoreFile)的時候,會執行一個合併操做,update會阻塞直到合併完成,直到超過了hbase.hstore.blockingWaitTime的值
默認: 7
默認: 90000
默認: 10
一個Region中的全部HStoreFile的major compactions的時間間隔。默認是1天。 設置爲0就是禁用這個功能。
默認: 86400000
容許 StoreFileScanner 並行搜索 StoreScanner, 一個在特定條件降低低延遲的特性。
默認: false
默認: 10
MapReduce中HFileOutputFormat能夠寫 storefiles/hfiles. 這個值是hfile的blocksize的最小值。一般在HBase寫Hfile的時候,bloocksize是由table schema(HColumnDescriptor)決定的,可是在mapreduce寫的時候,咱們沒法獲取schema中blocksize。這個值越小,你的索引就越大,你隨機訪問須要獲取的數據就越小。若是你的cell都很小,並且你須要更快的隨機訪問,能夠把這個值調低。
默認: 65536
分配給HFile/StoreFile的block cache佔最大堆(-Xmx setting)的比例。默認0.25意思是分配25%,設置爲0就是禁用,但不推薦。
哈希函數使用的哈希算法。能夠選擇兩個值:: murmur (MurmurHash) 和 jenkins (JenkinsHash). 這個哈希是給 bloom filters用的.
默認: murmur
默認: false
默認: 131072
默認: 2
默認: 131072
默認: false
塊結束時,是否 HFile塊應添加塊緩存。
默認: false
用於服務器的RPC調用編組的org.apache.hadoop.hbase.ipc.RpcServerEngine 實現.
默認: org.apache.hadoop.hbase.ipc.ProtobufRpcServerEngine
設置RPC套接字鏈接不延遲。參考 http://docs.oracle.com/javase/1.5.0/docs/api/java/net/Socket.html#getTcpNoDelay()
默認: true
HMaster server驗證登陸使用的kerberos keytab 文件路徑。(譯者注:HBase使用Kerberos實現安全)
例如. "hbase/_HOST@EXAMPLE.COM". HMaster運行須要使用 kerberos principal name. principal name 能夠在: user/hostname@DOMAIN 中獲取. 若是 "_HOST" 被用作hostname portion,須要使用實際運行的hostname來替代它。
HRegionServer驗證登陸使用的kerberos keytab 文件路徑。
例如. "hbase/_HOST@EXAMPLE.COM". HRegionServer運行須要使用 kerberos principal name. principal name 能夠在: user/hostname@DOMAIN 中獲取. 若是 "_HOST" 被用作hostname portion,須要使用實際運行的hostname來替代它。在這個文件中必需要有一個entry來描述 hbase.regionserver.keytab.file
默認: hbase-policy.xml
擁有完整的特權用戶或組的列表(逗號分隔), 不限於本地存儲的 ACLs, 或整個集羣. 僅當HBase啓用了安全設置可用.
The update interval for master key for authentication tokens in servers in milliseconds. Only used when HBase security is enabled.
默認: 86400000
The maximum lifetime in milliseconds after which an authentication token expires. Only used when HBase security is enabled.
默認: 604800000
ZooKeeper 會話超時.HBase把這個值傳遞改zk集羣,向他推薦一個會話的最大超時時間。詳見http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions "客戶端發送請求超時,服務器響應它能夠給客戶端的超時"。 單位是毫秒
默認: 180000
默認: /hbase
ZNode 保存的 根region的路徑. 這個值是由Master來寫,client和regionserver 來讀的。若是設爲一個相對地址,父目錄就是 ${zookeeper.znode.parent}.默認情形下,意味着根region的路徑存儲在/hbase/root-region-server.
默認: root-region-server
Root ZNode for access control lists.
默認: acl
一個逗號分隔的org.apache.hadoop.hbase.coprocessor.MasterObserver 微處理器列表,主HMaster進程默認加載。對於任何實現的協處理器方法,列出的類將按順序調用。實現你本身的MasterObserver後,就把它放在HBase的路徑,在此處添加徹底限定名稱(fully qualified class name)。
默認: localhost
默認: 2888
默認: 3888
指示HBase使用ZooKeeper的多更新功能。這讓某些管理員操做完成更迅速,防止一些問題與罕見的複製失敗的狀況下(見例hbase-2611版本說明)。重要的是:只有設置爲true,若是集羣中的全部服務器上的管理員3.4版本不會被降級。ZooKeeper管理員3.4以前的版本不支持多更新不優雅地失敗若是多更新時把它放在HBase的路徑,在這裏添加徹底限定名稱。 (參考 ZOOKEEPER-1495).
默認: false
ZooKeeper的zoo.conf中的配置。 初始化synchronization階段的ticks數量限制
默認: 10
ZooKeeper的zoo.conf中的配置。 發送一個請求到得到認可之間的ticks的數量限制
默認: 5
ZooKeeper的zoo.conf中的配置。 快照的存儲位置
默認: ${hbase.tmp.dir}/zookeeper
ZooKeeper的zoo.conf中的配置。 客戶端鏈接的端口
默認: 2181
ZooKeeper的zoo.conf中的配置。 ZooKeeper集羣中的單個節點接受的單個Client(以IP區分)的請求的併發數。這個值能夠調高一點,防止在單機和僞分佈式模式中出問題。
默認: 300
HBase REST server的端口
默認: 8080
定義REST server的運行模式。能夠設置成以下的值: false: 全部的HTTP請求都是被容許的 - GET/PUT/POST/DELETE. true:只有GET請求是被容許的
默認: false
設置爲true,跳過 'hbase.defaults.for.version' 檢查。 設置爲 true 相對maven 生成頗有用,例如ide環境. 你須要將該布爾值設爲 true,避免看到RuntimException 抱怨: "hbase-default.xml file seems to be for and old version of HBase (\${hbase.version}), this version is X.X.X-SNAPSHOT"
默認: false
Set to true to cause the hosting server (master or regionserver) to abort if a coprocessor throws a Throwable object that is not IOException or a subclass of IOException. Setting it to true might be useful in development environments where one wants to terminate the server as soon as possible to simplify coprocessor failure analysis.
默認: false
Set true to enable online schema changes. This is an experimental feature. There are known issues modifying table schemas at the same time a region split is happening so your table needs to be quiescent or else you have to be running with splits disabled.
默認: false
Set to true to enable locking the table in zookeeper for schema change operations. Table locking from master prevents concurrent schema modifications to corrupt table state.
默認: true
Does HDFS allow appends to files? This is an hdfs config. set in here so the hdfs client will do append support. You must ensure that this config. is true serverside too when running hbase (You will have to restart your cluster after setting it).
默認: true
The "core size" of the thread pool. New threads are created on every connection until this many threads are created.
默認: 16
The maximum size of the thread pool. When the pending request queue overflows, new threads are created until their number reaches this number. After that, the server starts dropping connections.
默認: 1000
The maximum number of pending Thrift connections waiting in the queue. If there are no idle threads in the pool, the server queues requests. Only when the queue overflows, new threads are added, up to hbase.thrift.maxQueuedRequests threads.
默認: 1000
The amount of off heap space to be allocated towards the experimental off heap cache. If you desire the cache to be disabled, simply set this value to 0.
默認: 0
Enable, if true, that file permissions should be assigned to the files written by the regionserver
默認: false
默認: 000
Whether to include the prefix "tbl.tablename" in per-column family metrics. If true, for each metric M, per-cf metrics will be reported for tbl.T.cf.CF.M, if false, per-cf metrics will be aggregated by column-family across tables, and reported for cf.CF.M. In both cases, the aggregated metric M across tables and cfs will be reported.
默認: true
是否報告有關在區域服務器上執行操做的時間的度量。 Get, Put, Delete, Increment, 及 Append 均可以有他們的時間,每CF和每一個區域均可以經過Hadoop metrics暴露出來。
默認: true
一個以逗號分隔的HFileCleanerDelegate HFileCleaner調用的服務。這些HFile清除服務被按順序調用,因此把清除大部分文件的清除服務放在最前面。實現本身的HFileCleanerDelegate,需把它放在HBase的類路徑,並在此添加徹底限定類名。老是添加上述日誌清除服務,由於會被hbase-site.xml的配置覆蓋。
默認: org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner
Catalog Janitor從regionserver到 META的超時值.
默認: 600000
Catalog Janitor從master到 META的超時值.
默認: 600000
設置爲true,容許HBaseConfiguration讀取 zoo.cfg 文件,獲取ZooKeeper配置。切換爲true是不推薦的,由於從zoo.cfg文件讀取ZK配置功能已廢棄。
默認: false
默認: true
默認: 100
REST服務器線程池的最小線程數。線程池始終具備該數量的線程,以便REST服務器準備好爲傳入的請求服務。默認值是 2.
默認: 2
由於HBase的Master有可能轉移,全部客戶端須要訪問ZooKeeper來得到如今的位置。ZooKeeper會保存這些值。所以客戶端必須知道Zookeeper集羣的地址,不然作不了任何事情。一般這個地址存在 hbase-site.xml
放入你的 classpath,這樣 hbase-site.xml
HBase客戶端最小化的依賴是 hbase, hadoop, log4j, commons-logging, commons-lang, 和 ZooKeeper ,這些jars 須要能在 CLASSPATH
下面是一個基本的客戶端 hbase-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.zookeeper.quorum</name> <value>example1,example2,example3</value> <description>The directory shared by region servers. </description> </property> </configuration>
Java客戶端使用的配置信息是被映射在一個HBaseConfiguration 實例中. HBaseConfiguration有一個工廠方法, HBaseConfiguration.create();
,讀他發現的第一個配置文件的內容。 (這個方法還會去找hbase-default.xml
; hbase.X.X.X.jar
裏面也會有一個an hbase-default.xml). 不使用任何hbase-site.xml
Configuration config = HBaseConfiguration.create(); config.set("hbase.zookeeper.quorum", "localhost"); // Here we are running zookeeper locally
文件中作得同樣). 這個 Configuration
實例會被傳遞到 HTable, 之類的實例裏面去.
這裏是一個10節點的HBase的簡單示例,這裏的配置都是基本的,節點名爲 example0
, example1
... 一直到 example9
. HBase Master 和 HDFS namenode 運做在同一個節點 example0
上. RegionServers 運行在節點 example1
. 一個 3-節點 ZooKeeper 集羣運行在example1
, example2
, 和 example3
,端口保持默認. ZooKeeper 的數據保存在目錄 /export/zookeeper
. 下面咱們展現主要的配置文件-- hbase-site.xml
, regionservers
, 和 hbase-env.sh
-- 這些文件能夠在 conf
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.zookeeper.quorum</name> <value>example1,example2,example3</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/export/zookeeper</value> <description>Property from ZooKeeper's config zoo.cfg. The directory where the snapshot is stored. </description> </property> <property> <name>hbase.rootdir</name> <value>hdfs://example0:9000/hbase</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) </description> </property> </configuration>
這個文件把RegionServer的節點列了下來。在這個例子裏面咱們讓全部的節點都運行RegionServer,除了第一個節點 example1
,它要運行 HBase Master 和 HDFS namenode
example1 example3 example4 example5 example6 example7 example8 example9
下面咱們用diff 命令來展現 hbase-env.sh
文件相比默認變化的部分. 咱們把HBase的堆內存設置爲4G而不是默認的1G.
$ git diff hbase-env.sh diff --git a/conf/hbase-env.sh b/conf/hbase-env.sh index e70ebc6..96f8c27 100644 --- a/conf/hbase-env.sh +++ b/conf/hbase-env.sh @@ -31,7 +31,7 @@ export JAVA_HOME=/usr/lib//jvm/java-6-sun/ # export HBASE_CLASSPATH= # The maximum amount of heap to use, in MB. Default is 1000. -# export HBASE_HEAPSIZE=1000 +export HBASE_HEAPSIZE=4096 # Extra Java runtime options. # Below are what we set by default. May only work with SUN JVM.
你能夠使用 rsync 來同步 conf
下面咱們會列舉重要 的配置. 這個章節講述必須的配置和那些值得一看的配置。(譯者注:淘寶的博客也有本章節的內容,HBase性能調優,很詳盡)。
要想改變這個配置,能夠編輯 hbase-site.xml
, 將配置部署到所有集羣,而後重啓。
咱們之因此把這個值調的很高,是由於咱們不想一天到晚在論壇裏回答新手的問題。「爲何我在執行一個大規模數據導入的時候Region Server死掉啦」,一般這樣的問題是由於長時間的GC操做引發的,他們的JVM沒有調優。咱們是這樣想的,若是一我的對HBase不很熟悉,不能指望他知道全部,打擊他的自信心。等到他逐漸熟悉了,他就能夠本身調這個參數了。
這個設置決定了處理用戶請求的線程數量。默認是10,這個值設的比較小,主要是爲了預防用戶用一個比較大的寫緩衝,而後還有不少客戶端併發,這樣region servers會垮掉。有經驗的作法是,當請求內容很大(上MB,如大puts, 使用緩存的scans)的時候,把這個值放低。請求內容較小的時候(gets, 小puts, ICVs, deletes),把這個值放大。
你可能會對handler太多或太少有感受,能夠經過 Section, 「啓用 RPC級 日誌」 ,在單個RegionServer啓動log並查看log末尾 (請求隊列消耗內存)。
應該考慮啓用ColumnFamily 壓縮。有好幾個選項,經過下降存儲文件大小以下降IO,下降消耗且大多狀況下提升性能。
參考 Appendix C, HBase壓縮 獲取更多信息.
更大的Region能夠使你集羣上的Region的總數量較少。 通常來言,更少的Region能夠使你的集羣運行更加流暢。(你能夠本身隨時手工將大Region切割,這樣單個熱點Region就會被分佈在集羣的更多節點上)。
較少的Region較好。通常每一個RegionServer在20到小几百之間。 調整Region大小以適合該數字。
0.90.x 版本, 默認狀況下單個Region是256MB。Region 大小的上界是 4Gb. 0.92.x 版本, 因爲 HFile v2 已經將Region大小支持得大不少, (如, 20Gb).
中的 hbase.hregion.max.filesize
屬性. RegionSize 也能夠基於每一個表設置: HTableDescriptor.
除了讓HBase自動切割你的Region,你也能夠手動切割。 [12] 隨着數據量的增大,splite會被持續執行。若是你須要知道你如今有幾個region,好比長時間的debug或者作調優,你須要手動切割。經過跟蹤日誌來了解region級的問題是很難的,由於他在不停的切割和重命名。data offlineing bug和未知量的region會讓你沒有辦法。若是一個 HLog
或者 StoreFile
因爲一個奇怪的bug,HBase沒有執行它。等到一天以後,你才發現這個問題,你能夠確保如今的regions和那個時候的同樣,這樣你就能夠restore或者replay這些數據。你還能夠調優你的合併算法。若是數據是均勻的,隨着數據增加,很容易致使split / compaction瘋狂的運行。由於全部的region都是差很少大的。用手的切割,你就能夠交錯執行定時的合併和切割操做,下降IO負載。
爲何我關閉自動split呢?由於自動的splite是配置文件中的 hbase.hregion.max.filesize
決定的. 你把它設置成Long.MAX_VALUE
是不推薦的作法,要是你忘記了手工切割怎麼辦.推薦的作法是設置成100GB,一旦到達這樣的值,至少須要一個小時執行 major compactions。
那什麼是最佳的在pre-splite regions的數量呢。這個決定於你的應用程序了。你能夠先從低的開始,好比每一個server10個pre-splite regions.而後花時間觀察數據增加。有太少的region至少比出錯好,你能夠以後再rolling split.一個更復雜的答案是這個值是取決於你的region中的最大的storefile。隨着數據的增大,這個也會跟着增大。 你能夠當這個文件足夠大的時候,用一個定時的操做使用Store
的合併選擇算法(compact selection algorithm)來僅合併這一個HStore。若是你不這樣作,這個算法會啓動一個 major compactions,不少region會受到影響,你的集羣會瘋狂的運行。須要注意的是,這樣的瘋狂合併操做是數據增加形成的,而不是手動分割操做決定的。
若是你 pre-split 致使 regions 很小,你能夠經過配置HConstants.MAJOR_COMPACTION_PERIOD
把你的major compaction參數調大
腳原本執行鍼對所有集羣的一個網絡IO安全的rolling split操做。
一般管理技術是手動管理主緊縮(major compactions), 而不是讓HBase 來作。 缺省HConstants.MAJOR_COMPACTION_PERIOD 是一天。主緊縮可能強行進行,在你並不太但願發生的時候——特別是在一個繁忙系統。關閉自動主緊縮,設置該值爲0.
重點強調,主緊縮對存儲文件(StoreFile)清理是絕對必要的。惟一變量是發生的時間。能夠經過HBase shell進行管理,或經過 HBaseAdmin.
更多信息關於緊縮和緊縮文件選擇過程,參考 Section, 「緊縮」
負載均衡器(LoadBalancer)是在主服務器上運行的按期操做,以從新分佈集羣區域。經過hbase.balancer.period 設置,缺省值300000 (5 分鐘).
參考 Section, 「負載均衡」 獲取關於負載均衡器( LoadBalancer )的更多信息。
若是操做HBase時看到大量40ms左右的偶然延時,嘗試Nagles配置。如,參考用戶郵件列表線索, Inconsistent scan performance with caching set to 1 ,該議題在其中啓用tcpNoDelay (譯者注,本英文原文notcpdelay有誤)提升了掃描速度。你也能夠查看該文檔的尾部圖表:HBASE-7008 Set scanner caching to a better default (xie liang),咱們的Lars Hofhansl 嘗試了各類不一樣的數據大小,Nagle打開或關閉的測量結果。
[1] Be careful editing XML. Make sure you close all elements. Run your file through xmllint or similar to ensure well-formedness of your document after an edit session.
[2] The hadoop-dns-checker tool can be used to verify DNS is working correctly on the cluster. The project README file provides detailed instructions on usage.
[3] 參考 Jack Levin's major hdfs issues note up on the user list.
[4] The requirement that a database requires upping of system limits is not peculiar to HBase. 參考 for example the section Setting Shell Limits for the Oracle User inShort Guide to install Oracle 10 on Linux.
[5] A useful read setting config on you hadoop cluster is Aaron Kimballs' Configuration Parameters: What can you just ignore?
[6] The Cloudera blog post An update on Apache Hadoop 1.0 by Charles Zedlweski has a nice exposition on how all the Hadoop versions relate. Its worth checking out if you are having trouble making sense of the Hadoop version morass.
[7] Until recently only the branch-0.20-append branch had a working sync but no official release was ever made from this branch. You had to build it yourself. Michael Noll wrote a detailed blog, Building an Hadoop 0.20.x version for HBase 0.90.2, on how to build an Hadoop from branch-0.20-append. Recommended.
[8] Praveen Kumar has written a complimentary article, Building Hadoop and HBase for HBase Maven application development.
[10] 參考 Hadoop HDFS: Deceived by Xciever for an informative rant on xceivering.
[11] The pseudo-distributed vs fully-distributed nomenclature comes from Hadoop.
[12] 參考 Section, 「Pseudo-distributed Extras」 for notes on how to start extra Masters and RegionServers when running pseudo-distributed.
[13] 對 ZooKeeper 所有配置,參考ZooKeeper 的zoo.cfg
. HBase 沒有包含 zoo.cfg
,因此須要瀏覽合適的獨立ZooKeeper下載版本的 conf
[14] 下面是來自org.apache.hadoop.hbase.util.RegionSplitter的javadoc的文件頭。自HBase post-0.90.0版添加。
不能跳過主要版本升級。若是想從0.20.x 升級到 0.92.x,必須從0.20.x 升級到 0.90.x ,再從0.90.x 升級到 0.92.x.
參見 Section 2, 「配置」, 須要特別注意有關Hadoop 版本的信息.
You will have to stop your old 0.94 cluster completely to upgrade. If you are replicating between clusters, both clusters will have to go down to upgrade. Make sure it is a clean shutdown so there are no WAL files laying around (TODO: Can 0.96 read 0.94 WAL files?). Make sure zookeeper is cleared of state. All clients must be upgraded to 0.96 too.
The API has changed in a few areas; in particular how you use coprocessors (TODO: MapReduce too?)
You will find that 0.92.0 runs a little differently to 0.90.x releases. Here are a few things to watch out for upgrading from 0.90.x to 0.92.0.
If you've not patience, here are the important things to know upgrading.
To move to 0.92.0, all you need to do is shutdown your cluster, replace your hbase 0.90.x with hbase 0.92.0 binaries (be sure you clear out all 0.90.x instances) and restart (You cannot do a rolling restart from 0.90.x to 0.92.x -- you must restart). On startup, the .META. table content is rewritten removing the table schema from the info:regioninfo column. Also, any flushes done post first startup will write out data in the new 0.92.0 file format, HFile V2. This means you cannot go back to 0.90.x once you’ve started HBase 0.92.0 over your HBase data directory.
In 0.92.0, the hbase.hregion.memstore.mslab.enabled flag is set to true (參考 Section, 「Long GC pauses」). In 0.90.x it was false. When it is enabled, memstores will step allocate memory in MSLAB 2MB chunks even if the memstore has zero or just a few small elements. This is fine usually but if you had lots of regions per regionserver in a 0.90.x cluster (and MSLAB was off), you may find yourself OOME'ing on upgrade because the thousands of regions * number of column families * 2MB MSLAB (at a minimum) puts your heap over the top. Set hbase.hregion.memstore.mslab.enabled to false or set the MSLAB size down from 2MB by setting hbase.hregion.memstore.mslab.chunksize to something less.
Previous, WAL logs on crash were split by the Master alone. In 0.92.0, log splitting is done by the cluster (參考 「HBASE-1364 [performance] Distributed splitting of regionserver commit logs」). This should cut down significantly on the amount of time it takes splitting logs and getting regions back online again.
In 0.92.0, Appendix E, HFile format version 2 indices and bloom filters take up residence in the same LRU used caching blocks that come from the filesystem. In 0.90.x, the HFile v1 indices lived outside of the LRU so they took up space even if the index was on a ‘cold’ file, one that wasn’t being actively used. With the indices now in the LRU, you may find you have less space for block caching. Adjust your block cache accordingly. 參考 the Section 9.6.4, 「Block Cache」 for more detail. The block size default size has been changed in 0.92.0 from 0.2 (20 percent of heap) to 0.25.
Run 0.92.0 on Hadoop 1.0.x (or CDH3u3 when it ships). The performance benefits are worth making the move. Otherwise, our Hadoop prescription is as it has been; you need an Hadoop that supports a working sync. 參考 Section 2.3, 「Hadoop」.
If running on Hadoop 1.0.x (or CDH3u3), enable local read. 參考 Practical Caching presentation for ruminations on the performance benefits ‘going local’ (and for how to enable local reads).
If you can, upgrade your zookeeper. If you can’t, 3.4.2 clients should work against 3.3.X ensembles (HBase makes use of 3.4.2 API).
In 0.92.0, we’ve added an experimental online schema alter facility (參考 hbase.online.schema.update.enable). Its off by default. Enable it at your own risk. Online alter and splitting tables do not play well together so be sure your cluster quiescent using this feature (for now).
The webui has had a few additions made in 0.92.0. It now shows a list of the regions currently transitioning, recent compactions/flushes, and a process list of running processes (usually empty if all is well and requests are being handled promptly). Other additions including requests by region, a debugging servlet dump, etc.
(譯者注:on-heap和off-heap是Terracotta 公司提出的概念。on-heap指java對象在GC內存儲管理,效率較高,但GC只能管理2G內存,有時成爲性能瓶頸。off-heap又叫BigMemory ,是JVM的GC機制的替代,在GC外存儲,100倍速於DiskStore,cache量目前(2012年末)達到350GB)
A new cache was contributed to 0.92.0 to act as a solution between using the 「on-heap」 cache which is the current LRU cache the region servers have and the operating system cache which is out of our control. To enable, set 「-XX:MaxDirectMemorySize」 in hbase-env.sh to the value for maximum direct memory size and specify hbase.offheapcache.percentage in hbase-site.xml with the percentage that you want to dedicate to off-heap cache. This should only be set for servers and not for clients. Use at your own risk. See this blog post for additional information on this new experimental feature: http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/
0.92.0 adds two new features: multi-slave and multi-master replication. The way to enable this is the same as adding a new peer, so in order to have multi-master you would just run add_peer for each cluster that acts as a master to the other slave clusters. Collisions are handled at the timestamp level which may or may not be what you want, this needs to be evaluated on a per use case basis. Replication is still experimental in 0.92 and is disabled by default, run it at your own risk.
If an OOME, we now have the JVM kill -9 the regionserver process so it goes down fast. Previous, a RegionServer might stick around after incurring an OOME limping along in some wounded state. To disable this facility, and recommend you leave it in place, you’d need to edit the bin/hbase file. Look for the addition of the -XX:OnOutOfMemoryError="kill -9 %p" arguments (參考 [HBASE-4769] - ‘Abort RegionServer Immediately on OOME’)
0.92.0 stores data in a new format, Appendix E, HFile format version 2. As HBase runs, it will move all your data from HFile v1 to HFile v2 format. This auto-migration will run in the background as flushes and compactions run. HFile V2 allows HBase run with larger regions/files. In fact, we encourage that all HBasers going forward tend toward Facebook axiom #1, run with larger, fewer regions. If you have lots of regions now -- more than 100s per host -- you should look into setting your region size up after you move to 0.92.0 (In 0.92.0, default size is not 1G, up from 256M), and then running online merge tool (參考 「HBASE-1621 merge tool should work on online cluster, but disabled table」).
0.90.x 版本的HBase能夠在 HBase 0.20.x 或者 HBase 0.89.x的數據上啓動. 不須要轉換數據文件, HBase 0.89.x 和 0.90.x 的region目錄名是不同的 -- 老版本用md5 hash 而不是jenkins hash 來命名region-- 這就意味着,一旦啓動,不再能回退到 HBase 0.20.x.
從你的 conf
目錄刪掉。 0.20.x 版本的配置對於 0.90.x HBase不是最佳的. hbase-default.xml
如今已經被打包在 HBase jar 裏面了. 若是你想看看這個文件內容,你能夠在src目錄下 src/main/resources/hbase-default.xml
或者在 Section 2.31.1, 「HBase 默認配置」看到.
最後,若是從0.20.x升級,須要在shell裏檢查 .META.
schema . 過去,咱們推薦用戶使用16KB的 MEMSTORE_FLUSHSIZE
. 在shell中運行 hbase> scan '-ROOT-'
. 會顯示當前的.META.
的大小. 看看是否是 16KB (16384)? 若是是的話,你須要修改它(默認的值是 64MB (67108864)) 運行腳本 bin/set_meta_memstore_size.rb
. 這個腳本會修改 .META.
schema. 若是不運行的話,集羣會比較慢[15] .
HBase Shell is 在(J)Ruby的IRB的基礎上加上了HBase的命令。任何你能夠在IRB裏作的事情均可在在HBase Shell中作。
你能夠這樣來運行HBase Shell:
$ ./bin/hbase shell
輸入 help 就會返回Shell的命令列表和選項。能夠看看在Help文檔尾部的關於如何輸入變量和選項。尤爲要注意的是表名,行,列名必需要加引號。
參見 Section 1.2.3, 「Shell 練習」能夠看到Shell的基本使用例子。
目錄.在裏面找到後綴爲 *.rb
$ ./bin/hbase org.jruby.Main PATH_TO_SCRIPT
文件. 在這個文件里加入自定義的命令。有一個有用的命令就是記錄命令歷史,這樣你就能夠把你的命令保存起來。
$ more .irbrc require 'irb/ext/save-history' IRB.conf[:SAVE_HISTORY] = 100 IRB.conf[:HISTORY_FILE] = "#{ENV['HOME']}/.irb-save-history"
能夠參見 ruby 關於 .irbrc
能夠將日期'08/08/16 20:56:29'從hbase log 轉換成一個 timestamp, 操做以下:
hbase(main):021:0> import java.text.SimpleDateFormat hbase(main):022:0> import java.text.ParsePosition hbase(main):023:0> SimpleDateFormat.new("yy/MM/dd HH:mm:ss").parse("08/08/16 20:56:29", ParsePosition.new(0)).getTime() => 1218920189000
hbase(main):021:0> import java.util.Date hbase(main):022:0> Date.new(1218920189000).toString() => "Sat Aug 16 20:56:29 UTC 2008"
要想把日期格式和HBase log格式徹底相同,能夠參見文檔 SimpleDateFormat.
你能夠將shell切換成debug模式。這樣能夠看到更多的信息。 -- 例如能夠看到命令異常的stack trace:
hbase> debug <RETURN>
想要在shell中看到 DEBUG 級別的 logging ,能夠在啓動的時候加上 -d 參數.
$ ./bin/hbase shell -d
下面是根據BigTable 論文稍加修改的例子。 有一個名爲webtable
有兩個列 (anchor:cssnsi.com
, anchor:my.look.ca
一個列名是由它的列族前綴和修飾符(qualifier)鏈接而成。例如列contents:html是列族 contents
)加 修飾符 html
Table 5.1. 表 webtable
Row Key | Time Stamp | ColumnFamily contents |
ColumnFamily anchor |
"com.cnn.www" | t9 | anchor:cnnsi.com = "CNN" |
"com.cnn.www" | t8 | anchor:my.look.ca = "CNN.com" |
"com.cnn.www" | t6 | contents:html = "<html>..." |
"com.cnn.www" | t5 | contents:html = "<html>..." |
"com.cnn.www" | t3 | contents:html = "<html>..." |
儘管在概念視圖裏,表能夠被當作是一個稀疏的行的集合。但在物理上,它的是區分列族 存儲的。新的columns能夠不通過聲明直接加入一個列族.
Table 5.2. ColumnFamily anchor
Row Key | Time Stamp | Column Family anchor |
"com.cnn.www" | t9 | anchor:cnnsi.com = "CNN" |
"com.cnn.www" | t8 | anchor:my.look.ca = "CNN.com" |
Table 5.3. ColumnFamily contents
Row Key | Time Stamp | ColumnFamily "contents:" |
"com.cnn.www" | t6 | contents:html = "<html>..." |
"com.cnn.www" | t5 | contents:html = "<html>..." |
"com.cnn.www" | t3 | contents:html = "<html>..." |
For more information about the internals of how HBase stores data, see Section 9.7, 「Regions」.
在HBase是列族一些列的集合。一個列族全部列成員是有着相同的前綴。好比,列courses:history 和 courses:math都是 列族 courses的成員.冒號(:)是列族的分隔符,用來區分前綴和列名。column 前綴必須是可打印的字符,剩下的部分(稱爲qualify),能夠又任意字節數組組成。列族必須在表創建的時候聲明。column就不須要了,隨時能夠新建。
在物理上,一個的列族成員在文件系統上都是存儲在一塊兒。由於存儲優化都是針對列族級別的,這就意味着,一個colimn family的全部成員的是用相同的方式訪問的。
四個主要的數據模型操做是 Get, Put, Scan, 和 Delete. 經過 HTable 實例進行操做.
Get 返回特定行的屬性。 Gets 經過 HTable.get 執行。
Put 要麼向表增長新行 (若是key是新的) 或更新行 (若是key已經存在)。 Puts 經過 HTable.put (writeBuffer) 或 HTable.batch (non-writeBuffer)執行。
Scan 容許多行特定屬性迭代。
下面是一個在 HTable 表實例上的示例。 假設表有幾行鍵值爲 "row1", "row2", "row3", 還有一些行有鍵值 "abc1", "abc2", 和 "abc3". 下面的示例展現startRow 和 stopRow 能夠應用到一個Scan 實例,以返回"row"打頭的行。
HTable htable = ... // instantiate HTable Scan scan = new Scan(); scan.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("attr")); scan.setStartRow( Bytes.toBytes("row")); // start key is inclusive scan.setStopRow( Bytes.toBytes("row" + (char)0)); // stop key is exclusive ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // process result... } finally { rs.close(); // always close the ResultScanner! }
Delete 從表中刪除一行. 刪除經過HTable.delete 執行。
HBase 沒有修改數據的合適方法。因此經過建立名爲墓碑(tombstones)的新標誌進行處理。這些墓碑和死去的值,在主緊縮時清除。
參考 Section, 「Delete」 獲取刪除列版本的更多信息。參考Section, 「Compaction」 獲取更多有關緊縮的信息。
一個 {row, column, version} 元組是HBase中的一個單元(cell
rows和column key是用字節數組表示的,version則是用一個長整型表示。這個long的值使用 java.util.Date.getTime()
或者 System.currentTimeMillis()
產生的。這就意味着他的含義是「當前時間和1970-01-01 UTC的時間差,單位毫秒。」
Gets實在Scan的基礎上實現的。能夠詳細參見下面的討論 Get 一樣能夠用 Scan來描述.
若是想要返回的版本不僅是最近的,參見 Get.setTimeRange()
要向查詢的最新版本要小於或等於給定的這個值,這就意味着給定的'最近'的值能夠是某一個時間點。能夠使用0到你想要的時間來設置,還要把max versions設置爲1.
Get get = new Get(Bytes.toBytes("row1")); Result r = htable.get(get); byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
Get get = new Get(Bytes.toBytes("row1")); get.setMaxVersions(3); // will return last 3 versions of row Result r = htable.get(get); byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns all versions of this column
Put put = new Put(Bytes.toBytes(row)); put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data)); htable.put(put);
有三種不一樣類型的內部刪除標記 [19]:
Delete: 刪除列的指定版本.
Delete column: 刪除列的全部版本.
Delete family: 刪除特定列族全部列
參考 Section, 「KeyValue」 獲取內部 KeyValue 格式更多信息。
刪除標記操做可能會標記其後put的數據。[21]記住,當寫下一個墓碑標記後,只有下一個主緊縮操做發起以後,墓碑纔會清除。假設你刪除全部<= 時間T的數據。但以後,你又執行了一個Put操做,時間戳<= T。就算這個Put發生在刪除操做以後,他的數據也打上了墓碑標記。這個Put並不會失敗,但你作Get操做時,會注意到Put沒有產生影響。只有一個主緊縮執行後,一切纔會恢復正常。若是你的Put操做一直使用升序的版本,這個問題不會有影響。可是即便你不關心時間,也可能出現該狀況。只需刪除和插入迅速相互跟隨,就有機會在同一毫秒中遇到。
「設想一下,你一個cell有三個版本t1,t2和t3。你的maximun-version設置是2.當你請求獲取所有版本的時候,只會返回兩個,t2和t3。若是你將t2和t3刪除,就會返回t1。可是若是在刪除以前,發生了major compaction操做,那麼什麼值都很差返回了。[22]」
全部數據模型操做 HBase 返回排序的數據。先是行,再是列族,而後是列修飾(column qualifier), 最後是時間戳(反向排序,因此最新的在前).
對列族,沒有內部的KeyValue以外的元數據保存。這樣,HBase不只在一行中支持不少列,並且支持行之間不一樣的列。 由你本身負責跟蹤列名。
惟一獲取列族的完整列名的方法是處理全部行。HBase內部保存數據更多信息,請參考 Section, 「KeyValue」.
HBase是否支持聯合是一個網上常問問題。簡單來講 : 不支持。至少不想傳統RDBMS那樣支持(如 SQL中帶 equi-joins 或 outer-joins). 正如本章描述的,讀數據模型是 Get 和 Scan.
但並不表示等價聯合不能在應用程序中支持,只是必須本身作。 兩種方法,要麼指示要寫到HBase的數據,要麼查詢表並在應用或MapReduce代碼中作聯合(如 RDBMS所展現,有幾種步驟來實現,依賴於表的大小。如 nested loops vs. hash-joins). 哪一個更好?依賴於你準備作什麼,因此沒有一個單一的回答適合全部方面。
[16] 目前,只有最新的那個是能夠獲取到的。.
[17] 能夠
[18] 參考 HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limiitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.
[19] 參考 Lars Hofhansl's blog for discussion of his attempt adding another, Scanning in HBase: Prefix Delete Marker
[20] 當HBase執行一次major compaction,標記刪除的數據會被實際的刪除,刪除標記也會被刪除。
[21] HBASE-2256
[22] 參考垃圾收集: Bending time in HBase
一份關於各類NSQL數據庫的優勢和缺點的通用介紹,就是 Ian Varley的博士論文, No Relation: The Mixed Blessings of Non-Relational Databases。 推薦。也可閱讀 Section, 「KeyValue」 ,瞭解HBase如何內部保存數據。
能夠使用Chapter 4, HBase Shell 或Java API的HBaseAdmin來建立和編輯HBase的模式。
Configuration config = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); String table = "myTable"; admin.disableTable(table); HColumnDescriptor cf1 = ...; admin.addColumn(table, cf1); // adding new ColumnFamily HColumnDescriptor cf2 = ...; admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table);參考 Section 2.3.4, 「Client configuration and dependencies connecting to an HBase cluster」 ,獲取更多配置客戶端鏈接的信息。
注意: 0.92.x 支持在線修改模式, 但 0.90.x 須要禁用表。
當表或列族改變時(如 region size, block size), 當下次存在主緊縮及存儲文件重寫時起做用。
參考 Section 9.7.5, 「Store」 獲取存儲文件的更多信息。
如今HBase並不能很好的處理兩個或者三個以上的列族,因此儘可能讓你的列族數量少一些。目前,flush和compaction操做是針對一個Region。因此當一個列族操做大量數據的時候會引起一個flush。那些不相關的列族也有進行flush操做,儘管他們沒有操做多少數據。Compaction操做如今是根據一個列族下的所有文件的數量觸發的,而不是根據文件大小觸發的。當不少的列族在flush和compaction時,會形成不少沒用的I/O負載(要想解決這個問題,須要將flush和compaction操做只針對一個列族) 。 更多緊縮信息, 參考Section, 「Compaction」.
一個表存在多列族,注意基數(如, 行數). 若是列族A有100萬行,列族B有10億行,列族A可能被分散到不少不少區(及區服務器)。這致使掃描列族A低效。
在Tom White的Hadoop: The Definitive Guide一書中,有一個章節描述了一個值得注意的問題:在一個集羣中,一個導入數據的進程一動不動,全部的client都在等待一個region(就是一個節點),過了一會後,變成了下一個region...若是使用了單調遞增或者時序的key就會形成這樣的問題。詳情能夠參見IKai畫的漫畫monotonically increasing values are bad。使用了順序的key會將本沒有順序的數據變得有順序,把負載壓在一臺機器上。因此要儘可能避免時間戳或者(e.g. 1, 2, 3)這樣的key。
在HBase中,值是做爲一個單元(Cell)保存在系統的中的,要定位一個單元,須要行,列名和時間戳。一般狀況下,若是你的行和列的名字要是太大(甚至比value的大小還要大)的話,你可能會遇到一些有趣的狀況。例如Marc Limotte 在 HBASE-3551(推薦!)尾部提到的現象。在HBase的存儲文件Section, 「StoreFile (HFile)」中,有一個索引用來方便值的隨機訪問,可是訪問一個單元的座標要是太大的話,會佔用很大的內存,這個索引會被用盡。因此要想解決,能夠設置一個更大的塊大小,固然也能夠使用更小的列名 。壓縮也能獲得更大指數。參考話題 a question storefileIndexSize 用戶郵件列表.
大部分時候,小的低效不會影響很大。不幸的是,這裏會是個問題。不管是列族,屬性和行鍵都會在數據中重複上億次。參考 Section, 「KeyValue」 獲取更多信息,關於HBase 內部保存數據,瞭解爲何這很重要。
儘可能使列族名小,最好一個字符。(如 "d" 表示 data/default).
參考 Section, 「KeyValue」 獲取更多信息,關於HBase 內部保存數據,瞭解爲何這很重要。
詳細屬性名 (如, "myVeryImportantAttribute") 易讀,最好仍是用短屬性名 (e.g., "via") 保存到HBase.
參考 Section, 「KeyValue」 獲取更多信息,關於HBase 內部保存數據,瞭解爲何這很重要。
long 類型有 8 字節. 8字節內能夠保存無符號數字到18,446,744,073,709,551,615. 若是用字符串保存--假設一個字節一個字符--,須要將近3倍的字節數。
不信? 下面是示例代碼,能夠本身運行一下。
// long // long l = 1234567890L; byte[] lb = Bytes.toBytes(l); System.out.println("long bytes length: " + lb.length); // returns 8 String s = "" + l; byte[] sb = Bytes.toBytes(s); System.out.println("long as string length: " + sb.length); // returns 10 // hash // MessageDigest md = MessageDigest.getInstance("MD5"); byte[] digest = md.digest(Bytes.toBytes(s)); System.out.println("md5 digest bytes length: " + digest.length); // returns 16 String sDigest = new String(digest); byte[] sbDigest = Bytes.toBytes(sDigest); System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26(譯者注:實測值爲22)
一個數據庫處理的一般問題是找到最近版本的值。採用倒序時間戳做爲鍵的一部分能夠對此特定狀況有很大幫助。也在Tom White的Hadoop書籍的HBase 章節能找到: The Definitive Guide (O'Reilly), 該技術包含追加(Long.MAX_VALUE - timestamp) 到key的後面,如 [key][reverse_timestamp].
表內[key]的最近的值能夠用[key]進行 Scan 找到並獲取第一個記錄。因爲 HBase 行鍵是排序的,該鍵排在任何比它老的行鍵的前面,因此必然是第一個。
該技術能夠用於代替Section 6.4, 「 版本的數量 」 ,其目的是保存全部版本到「永遠」(或一段很長時間) 。同時,採用一樣的Scan技術,能夠很快獲取其餘版本。
行的版本的數量是HColumnDescriptor設置的,每一個列族能夠單獨設置,默認是3。這個設置是很重要的,在Chapter 5, 數據模型有描述,由於HBase是不會去覆蓋一個值的,他只會在後面在追加寫,用時間戳來區分、過早的版本會在執行主緊縮的時候刪除。這個版本的值能夠根據具體的應用增長減小。
不推薦將版本最大值設到一個很高的水平 (如, 成百或更多),除非老數據對你很重要。由於這會致使存儲文件變得極大。
和行的最大版本數同樣,最小版本數也是經過HColumnDescriptor 在每一個列族中設置的。最小版本數缺省值是0,表示該特性禁用。 最小版本數參數和存活時間一塊兒使用,容許配置如「保存最後T秒有價值數據,最多N個版本,但最少約M個版本」(M是最小版本數,M<N)。 該參數僅在存活時間對列族啓用,且必須小於行版本數。
HBase 經過 Put 和 Result支持 "bytes-in/bytes-out" 接口,因此任何可被轉爲字節數組的東西能夠做爲值存入。輸入能夠是字符串,數字,複雜對象,甚至圖像,只要他們能轉爲字節。
存在值的實際長度限制 (如 保存 10-50MB 對象到 HBase 可能對查詢來講太長); 搜索郵件列表獲取本話題的對話。 HBase的全部行都遵循 Chapter 5, 數據模型, 包括版本化。 設計時需考慮到這些,以及列族的塊大小。
若是有多個表,不要在模式設計中忘了 Section 5.11, 「Joins」 的潛在因素。
列族能夠設置TTL秒數,HBase 在超時後將自動刪除數據。影響 所有 行的所有版本 - 甚至當前版本。HBase裏面TTL 時間時區是 UTC.
參考 HColumnDescriptor 獲取更多信息。
列族容許是否保留單元。這就是說 Get 或 Scan 操做仍能夠獲取刪除的單元。因爲這些操做指定時間範圍,結束在刪除單元發生效果以前。這甚至容許在刪除進行時進行即時查詢。
刪除的單元仍然受TTL控制,並永遠不會超過「最大版本數」被刪除的單元。新 "raw" scan 選項返回全部已刪除的行和刪除標誌。
參考 HColumnDescriptor 獲取更多信息
本節標題也能夠爲"若是表的行鍵像這樣 ,但我又想像那樣查詢該表." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.
There is no single answer on the best way to handle this because it depends on...
... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches.
It should not be a surprise that secondary indexes require additional cluster space and processing. This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RBDMS products are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off.
Pay attention to Chapter 11, Performance Tuning when implementing any of these approaches.
Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase
根據具體應用,可能和 Section 9.4, 「Client Request Filters」 用法至關。在這種狀況下,沒有第二索引被建立。然而,不要像這樣從應用 (如單線程客戶端)中對大表嘗試全表掃描。
參考 Section 7.2.2, 「HBase MapReduce Read/Write Example」 獲取更多信息.
另外一個策略是在將數據寫到集羣的同時建立第二索引(如:寫到數據表,同時寫到索引表)。若是該方法在數據表存在以後採用,則須要利用MapReduce任務來生成已有數據的第二索引。 (參考 Section 6.9.2, 「 Periodic-Update Secondary Index 」).
對時間跨度長 (e.g., 年報) 和數據量巨大,彙總表是通用路徑。可經過MapReduce任務生成到另外一個表。
參考 Section 7.2.4, 「HBase MapReduce Summary to HBase Example」 獲取更多信息。
協處理動做像 RDBMS 觸發器。 這在 0.92中添加. 更多參考 Section 9.6.3, 「Coprocessors」
HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised usage for Constraints is in enforcing business rules for attributes in the table (eg. make sure values are in the range 1-10). Constraints could also be used to enforce referential integrity, but this is strongly discouraged as it will dramatically decrease the write throughput of the tables where integrity checking is enabled. Extensive documentation on using Constraints can be found at: Constraint since version 0.94.
This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see:http://opentsdb.net/schema.html, and Lessons Learned from OpenTSDB from HBaseCon2012.
But this is how the general concept works: data is ingested, for example, in this manner…
[hostname][log-event][timestamp1] [hostname][log-event][timestamp2] [hostname][log-event][timestamp3]
… with separate rowkeys for each detailed event, but is re-written like this…
… and each of the above events are converted into columns stored with a time-offset relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very advanced processing technique, but HBase makes this possible.
Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.
The Customer record type would include all the things that you’d typically expect:
The Order record type would include things like:
Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose the rowkey, and specifically a composite key such as:
[customer number][order number]
… for a ORDER table. However, there are more design decisions to make: are the raw values the best choices for rowkeys?
The same design questions in the Log Data use-case confront us here. What is the keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a reasonable spread in the keyspace, similar options appear:
Composite Rowkey With Hashes:
Composite Numeric/Hash Combo Rowkey:
A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple record types into a single table (e.g., CUSTOMER++).
Customer Record Type Rowkey:
Order Record Type Rowkey:
The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id (e.g., a single scan could get you everything about that customer). The disadvantage is that it’s not as easy to scan for a particular record-type.
Now we need to address how to model the Order object. Assume that the class structure is as follows:
Order ShippingLocation (an Order can have multiple ShippingLocations) LineItem (a ShippingLocation can have multiple LineItems)
... there are multiple options on storing this data.
With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM.
The ORDER table's rowkey was described above: Section 6.11.3, 「Case Study - Customer/Order」
The SHIPPING_LOCATION's composite rowkey would be something like this:
The LINE_ITEM table's composite rowkey would be something like this:
Such a normalized model is likely to be the approach with an RDBMS, but that's not your only option with HBase. The cons of such an approach is that to retrieve information about any Order, you will need:
... granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase you're just more aware of this fact.
With this approach, there would exist a single table ORDER that would contain
The Order rowkey was described above: Section 6.11.3, 「Case Study - Customer/Order」
The ShippingLocation composite rowkey would be something like this:
The LineItem composite rowkey would be something like this:
A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance.
The LineItem composite rowkey would be something like this:
... and the LineItem columns would be something like this:
The pros of this approach include a less complex object heirarchy, but one of the cons is that updating gets more complicated in case any of this information changes.
With this approach, the entire Order object graph is treated, in one way or another, as a BLOB. For example, the ORDER table's rowkey was described above: Section 6.11.3, 「Case Study - Customer/Order」, and a single column called "order" would contain an object that could be deserialized that contained a container Order, ShippingLocations, and LineItems.
There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc. All of them are variants of the same approach: encode the object graph to a byte-array. Care should be taken with this approach to ensure backward compatibilty in case the object model changes such that older persisted structures can still be read back out of HBase.
Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per Order in this example), but the cons include the aforementioned warning about backward compatiblity of serialization, language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in getting frameworks like Hive to work with custom objects like this.
This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.
A common question is whether one should prefer rows or HBase's built-in-versioning. The context is typically where there are "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 3 max versions). The rows-approach would require storing a timstamp in some portion of the rowkey so that they would not overwite with each successive update.
Preference: Rows (generally speaking).
Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.
Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the standard use-case where one needs to store a few dozen or hundred columns. But there is also a middle path between these two options, and that is "Rows as Columns."
The middle path between Rows vs. Columns is packing data that would be a separate row into columns, for certain rows. OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as columns. This approach is often more complex, and may require the additional complexity of re-writing your data, but has the advantage of being I/O efficient. For an overview of this approach, see ???.
The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in Apache HBase.
*** QUESTION ***
We're looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like:
<FixedWidthUserName><FixedWidthValueId1>:"" (no value) <FixedWidthUserName><FixedWidthValueId2>:"" (no value) <FixedWidthUserName><FixedWidthValueId3>:"" (no value)The other option we had was to do this entirely using:
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
where each row would contain multiple values. So in one case reading the first thirty values would be:
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}And in the second case it would be
get 'FixedWidthUserName\x00\x00\x00\x00'
The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have <= 30 total values in these lists, and some users would have millions (i.e. power-law distribution)
The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?
My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we'll always need the same page size. I've ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we'd need to update all subsequent rows).
Thanks for help / suggestions / follow-up questions.
*** ANSWER ***
If I understand you correctly, you're ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:
"user123, firstname, Paul", "user234, lastname, Smith"
(But the usernames are fixed width, and the valueids are fixed width).
And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?
The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you're really sure it is needed.
Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you're giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn't sound like you need that. Doing it this way is generally recommended (see here #schema.smackdown).
Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I'm guessing you jumped to the "paginated" version because you're assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you're not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn't be fundamentally worse. The client has methods that allow you to get specific slices of columns.
Note that neither case fundamentally uses more disk space than the other; you're just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George's excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).
A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don't have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! :)
參考 the Performance section Section 11.6, 「Schema Design」 for more information operational and performance schema design options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes.
Table of Contents
關於 HBase 和 MapReduce詳見 javadocs. 下面是一些附加的幫助文檔. MapReduce的更多信息 (如,通用框架), 參考 Hadoop MapReduce Tutorial.
當 MapReduce 任務的HBase 表使用TableInputFormat爲數據源格式的時候,他的splitter會給這個table的每一個region一個map。所以,若是一個table有100個region,就有100個map-tasks,不論需須要scan多少個列族 。
For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. That is where the logic for map-task assignment resides.
下面是使用HBase 做爲源的MapReduce讀取示例。特別是僅有Mapper實例,沒有Reducer。Mapper什麼也不產生。
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
null, // mapper output key
null, // mapper output value
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
public class MyMapper extends TableMapper<Text, LongWritable> { public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { // process data for the row from the Result instance.
下面是使用HBase 做爲源和目標的MapReduce示例. 本示例簡單從一個表複製到另外一個表。
Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleReadWrite"); job.setJarByClass(MyReadWriteJob.class); // class that contains mapper Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper class null, // mapper output key null, // mapper output value job); TableMapReduceUtil.initTableReducerJob( targetTable, // output table null, // reducer class job); job.setNumReduceTasks(0); boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }
, 特別是對 reducer. TableOutputFormat 做爲 outputFormat 類, 幾個參數在config中設置(e.g., TableOutputFormat.OUTPUT_TABLE), 同時設置reducer output key 到 ImmutableBytesWritable
和 reducer value到 Writable
. 這能夠編程時設置到job和conf,但TableMapReduceUtil
下面是 mapper示例, 建立一個 Put
,匹配輸入的 Result
並提交. Note: 這是 CopyTable 工具作的.
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> { public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { // this example is just copying the data from the source table... context.write(row, resultToPut(row,value)); } private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { Put put = new Put(key.get()); for (KeyValue kv : result.raw()) { put.add(kv); } return put; } }
這不是真正的 reducer 步驟, 因此 TableOutputFormat
處理髮送 Put
這僅是示例, 開發者能夠選擇不使用TableOutputFormat
下面是使用HBase 做爲源和目標的MapReduce示例,具備彙總步驟。本示例計算一個表中值的個數,並將彙總的計數輸出到另外一個表。
Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummary"); job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper class Text.class, // mapper output key IntWritable.class, // mapper output value job); TableMapReduceUtil.initTableReducerJob( targetTable, // output table MyTableReducer.class, // reducer class job); job.setNumReduceTasks(1); // at least one, adjust as required boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }
本示例mapper,將一個列的一個字符串值做爲彙總值。該值做爲key在mapper中生成。 IntWritable
public static class MyMapper extends TableMapper<Text, IntWritable> { private final IntWritable ONE = new IntWritable(1); private Text text = new Text(); public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { String val = new String(value.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr1"))); text.set(val); // we can only emit Writables... context.write(text, ONE); } }
在 reducer, "ones" 被統計 (和其餘 MR 示例同樣), 產生一個 Put
public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); } Put put = new Put(Bytes.toBytes(key.toString())); put.add(Bytes.toBytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(i)); context.write(null, put); } }
This very similar to the summary example above, with exception that this is using HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.
Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummaryToFile"); job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper class Text.class, // mapper output key IntWritable.class, // mapper output value job); job.setReducerClass(MyReducer.class); // reducer class job.setNumReduceTasks(1); // at least one, adjust as required FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }As stated above, the previous Mapper can run unchanged with this example. As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); } context.write(key, new IntWritable(i)); } }
It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue
would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the cleanup
method of the mapper. However, your milage may vary depending on the number of rows to be processed and unique keys.
In the end, the summary results are in HBase.
有時更合適產生彙總到 RDBMS.這種狀況下,能夠將彙總直接經過一個自定義的reducer輸出到 RDBMS 。 setup
方法能夠鏈接到 RDBMS (鏈接信息能夠經過context的自定義參數傳遞), cleanup 能夠關閉鏈接.
關鍵須要理解job的多個reducer會影響彙總實現,必須在reducer中進行設計。不管是一個recucer仍是多個reducer。無論對錯, 依賴於你的用例。認識到多個reducer分配到job,須要建立多個併發的RDBMS鏈接-能夠擴充,但僅在一個點。
public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Connection c = null; public void setup(Context context) { // create DB connection... } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // do summarization // in this example the keys are Text, but this is just an example } public void cleanup(Context context) { // close db connection } }
最後,彙總的結果被寫入到 RDBMS 表.
儘管現有的框架容許一個HBase table做爲一個MapReduce job的輸入,其餘的HBase table能夠同時做爲普通的表被訪問。例如在一個MapReduce的job中,能夠在Mapper的setup方法中建立HTable實例。
public class MyMapper extends TableMapper<Text, LongWritable> { private HTable myOtherTable; @Override public void setup(Context context) { myOtherTable = new HTable("myOtherTable"); }
一般建議關掉針對HBase的MapReduce job的預測執行(speculative execution)功能。這個功能也能夠用每一個Job的配置來完成。對於整個集羣,使用預測執行意味着雙倍的運算量。這可不是你所但願的。
參考 Section, 「Speculative Execution」 獲取更多信息。
Table of Contents
新版 HBase (>= 0.92) 支持客戶端可選 SASL 認證.
這裏描述如何設置HBase 和 HBase 客戶端,以安全鏈接到HBase 資源.
HBase 必須使用 安全Hadoop/HBase的新 maven 配置文件: -P security
. Secure Hadoop dependent classes are separated under a pseudo-module in the security/ directory and are only included if built with the secure Hadoop profile.
You need to have a working Kerberos KDC.
A HBase configured for secure client access is expected to be running on top of a secured HDFS cluster. HBase must be able to authenticate to HDFS services. HBase needs Kerberos credentials to interact with the Kerberos-enabled HDFS daemons. Authenticating a service should be done using a keytab file. The procedure for creating keytabs for HBase service is the same as for creating keytabs for Hadoop. Those steps are omitted here. Copy the resulting keytab files to wherever HBase Master and RegionServer processes are deployed and make them readable only to the user account under which the HBase daemons will run.
A Kerberos principal has three parts, with the form username/fully.qualified.domain.name@YOUR-REALM.COM
. We recommend using hbase
as the username portion.
The following is an example of the configuration properties for Kerberos operation that must be added to the hbase-site.xml
file on every server machine in the cluster. Required for even the most basic interactions with a secure Hadoop configuration, independent of HBase security.
<property> <name>hbase.regionserver.kerberos.principal</name> <value>hbase/_HOST@YOUR-REALM.COM</value> </property> <property> <name>hbase.regionserver.keytab.file</name> <value>/etc/hbase/conf/keytab.krb5</value> </property> <property> <name>hbase.master.kerberos.principal</name> <value>hbase/_HOST@YOUR-REALM.COM</value> </property> <property> <name>hbase.master.keytab.file</name> <value>/etc/hbase/conf/keytab.krb5</value> </property>
Each HBase client user should also be given a Kerberos principal. This principal should have a password assigned to it (as opposed to a keytab file). The client principal's maxrenewlife
should be set so that it can be renewed enough times for the HBase client process to complete. For example, if a user runs a long-running HBase client process that takes at most 3 days, we might create this user's principal within kadmin
with: addprinc -maxrenewlife 3days
Long running daemons with indefinite lifetimes that require client access to HBase can instead be configured to log in from a keytab. For each host running such daemons, create a keytab with kadmin
or kadmin.local
. The procedure for creating keytabs for HBase service is the same as for creating keytabs for Hadoop. Those steps are omitted here. Copy the resulting keytab files to where the client daemon will execute and make them readable only to the user account under which the daemon will run.
增長下列內容到 hbase-site.xml
file on every server machine in the cluster:
<property> <name>hbase.security.authentication</name> <value>kerberos</value> </property> <property> <name>hbase.security.authorization</name> <value>true</value> </property> <property> <name>hbase.rpc.engine</name> <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.TokenProvider</value> </property>
A full shutdown and restart of HBase service is required when deploying these configuration changes.
每一個客戶端增長下列內容到 hbase-site.xml
<property> <name>hbase.security.authentication</name> <value>kerberos</value> </property> <property> <name>hbase.rpc.engine</name> <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value> </property>
The client environment must be logged in to Kerberos from KDC or keytab via the kinit
command before communication with the HBase cluster will be possible.
Be advised that if the hbase.security.authentication
and hbase.rpc.engine
properties in the client- and server-side site files do not match, the client will not be able to communicate with the cluster.
Once HBase is configured for secure RPC it is possible to optionally configure encrypted communication. To do so, 增長下列內容到 hbase-site.xml
file on every client:
<property> <name>hbase.rpc.protection</name> <value>privacy</value> </property>
This configuration property can also be set on a per connection basis. Set it in the Configuration
supplied to HTable
Configuration conf = HBaseConfiguration.create(); conf.set("hbase.rpc.protection", "privacy"); HTable table = new HTable(conf, tablename);
Expect a ~10% performance penalty for encrypted communication.
每一個Thrift網關增長下列內容到 hbase-site.xml
<property> <name>hbase.thrift.keytab.file</name> <value>/etc/hbase/conf/hbase.keytab</value> </property> <property> <name>hbase.thrift.kerberos.principal</name> <value>$USER/_HOST@HADOOP.LOCALDOMAIN</value> </property>
Substitute the appropriate credential and keytab for $USER and $KEYTAB respectively.
The Thrift gateway will authenticate with HBase using the supplied credential. No authentication will be performed by the Thrift gateway itself. All client access via the Thrift gateway will use the Thrift gateway's credential and have its privilege.
每一個REST網關增長下列內容到 hbase-site.xml
<property> <name>hbase.rest.keytab.file</name> <value>$KEYTAB</value> </property> <property> <name>hbase.rest.kerberos.principal</name> <value>$USER/_HOST@HADOOP.LOCALDOMAIN</value> </property>
Substitute the appropriate credential and keytab for $USER and $KEYTAB respectively.
The REST gateway will authenticate with HBase using the supplied credential. No authentication will be performed by the REST gateway itself. All client access via the REST gateway will use the REST gateway's credential and have its privilege.
It should be possible for clients to authenticate with the HBase cluster through the REST gateway in a pass-through manner via SPEGNO HTTP authentication. This is future work.
Newer releases of HBase (>= 0.92) support optional access control list (ACL-) based protection of resources on a column family and/or table basis.
This describes how to set up Secure HBase for access control, with an example of granting and revoking user permission on table resources provided.
You must configure HBase for secure operation. Refer to the section "安全客戶端訪問HBase" and complete all of the steps described there.
You must also configure ZooKeeper for secure operation. Changes to ACLs are synchronized throughout the cluster using ZooKeeper. Secure authentication to ZooKeeper must be enabled or otherwise it will be possible to subvert HBase access control via direct client access to ZooKeeper. Refer to the section on secure ZooKeeper configuration and complete all of the steps described there.
With Secure RPC and Access Control enabled, client access to HBase is authenticated and user data is private unless access has been explicitly granted. Access to data can be granted at a table or per column family basis.
However, the following items have been left out of the initial implementation for simplicity:
Row-level or per value (cell): This would require broader changes for storing the ACLs inline with rows. It is a future goal.
Push down of file ownership to HDFS: HBase is not designed for the case where files may have different permissions than the HBase system principal. Pushing file ownership down into HDFS would necessitate changes to core code. Also, while HDFS file ownership would make applying quotas easy, and possibly make bulk imports more straightforward, it is not clear that it would offer a more secure setup.
HBase managed "roles" as collections of permissions: We will not model "roles" internally in HBase to begin with. We instead allow group names to be granted permissions, which allows external modeling of roles via group membership. Groups are created and manipulated externally to HBase, via the Hadoop group mapping service.
Access control mechanisms are mature and fairly standardized in the relational database world. The HBase implementation approximates current convention, but HBase has a simpler feature set than relational databases, especially in terms of client operations. We don't distinguish between an insert (new record) and update (of existing record), for example, as both collapse down into a Put. Accordingly, the important operations condense to four permissions: READ, WRITE, CREATE, and ADMIN.
Operation To Permission MappingPermissionOperationReadGetExistsScanWritePutDeleteLock/UnlockRowIncrementColumnValueCheckAndDelete/PutFlushCompactCreateCreateAlterDropAdminEnable/DisableSplitMajor CompactGrantRevokeShutdownPermissions can be granted in any of the following scopes, though CREATE and ADMIN permissions are effective only at table scope.
Read: User can read from any column family in table
Write: User can write to any column family in table
Create: User can alter table attributes; add, alter, or drop column families; and drop the table.
Admin: User can alter table attributes; add, alter, or drop column families; and enable, disable, or drop the table. User can also trigger region (re)assignments or relocation.
Column Family
Read: User can read from the column family
Write: User can write to the column family
There is also an implicit global scope for the superuser.
The superuser is a principal, specified in the HBase site configuration file, that has equivalent access to HBase as the 'root' user would on a UNIX derived system. Normally this is the principal that the HBase processes themselves authenticate as. Although future versions of HBase Access Control may support multiple superusers, the superuser privilege will always include the principal used to run the HMaster process. Only the superuser is allowed to create tables, switch the balancer on or off, or take other actions with global consequence. Furthermore, the superuser has an implicit grant of all permissions to all resources.
Tables have a new metadata attribute: OWNER, the user principal who owns the table. By default this will be set to the user principal who creates the table, though it may be changed at table creation time or during an alter operation by setting or changing the OWNER table attribute. Only a single user principal can own a table at a given time. A table owner will have all permissions over a given table.
Enable the AccessController coprocessor in the cluster configuration and restart HBase. The restart can be a rolling one. Complete the restart of all Master and RegionServer processes before setting up ACLs.
To enable the AccessController, modify the hbase-site.xml
file on every server machine in the cluster to look like:
<property> <name>hbase.coprocessor.master.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.TokenProvider, org.apache.hadoop.hbase.security.access.AccessController</value> </property>
The HBase shell has been extended to provide simple commands for editing and updating user permissions. The following commands have been added for access control list management:
grant <user> <permissions> <table> [ <column family> [ <column qualifier> ] ]
is zero or more letters from the set "RWCA": READ('R'), WRITE('W'), CREATE('C'), ADMIN('A').
Note: Grants and revocations of individual permissions on a resource are both accomplished using the grant
command. A separate revoke
command is also provided by the shell, but this is for fast revocation of all of a user's access rights to a given resource only.
revoke <user> <table> [ <column family> [ <column qualifier> ] ]
The alter
command has been extended to allow ownership assignment:
alter 'tablename', {OWNER => 'username'}
User Permission
The user_permission
command shows all access permissions for the current user for a given table:
user_permission <table>
Bulk loading in secure mode is a bit more involved than normal setup, since the client has to transfer the ownership of the files generated from the mapreduce job to HBase. Secure bulk loading is implemented by a coprocessor, named SecureBulkLoadEndpoint. SecureBulkLoadEndpoint uses a staging directory "hbase.bulkload.staging.dir", which defaults to /tmp/hbase-staging/. The algorithm is as follows.
Like delegation tokens the strength of the security lies in the length and randomness of the secret directory.
You have to enable the secure bulk load to work properly. You can modify the hbase-site.xml file on every server machine in the cluster and add the SecureBulkLoadEndpoint class to the list of regionserver coprocessors:
<property> <name>hbase.bulkload.staging.dir</name> <value>/tmp/hbase-staging</value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.TokenProvider, org.apache.hadoop.hbase.security.access.AccessController,org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint</value> </property>
Table of Contents
HBase是一種 "NoSQL" 數據庫. "NoSQL"是一個通用詞表示數據庫不是RDBMS ,後者支持 SQL 做爲主要訪問手段。有許多種 NoSQL 數據庫: BerkeleyDB 是本地 NoSQL 數據庫例子, 而 HBase 是大型分佈式數據庫。 技術上來講, HBase 更像是"數據存儲(Data Store)" 多於 "數據庫(Data Base)"。由於缺乏不少RDBMS特性, 如列類型,第二索引,觸發器,高級查詢語言等.
然而, HBase 有許多特徵同時支持線性化和模塊化擴充。 HBase 集羣經過增長RegionServers進行擴充。 它能夠放在普通的服務器中。例如,若是集羣從10個擴充到20個RegionServer,存儲空間和處理容量都同時翻倍。 RDBMS 也能很好擴充, 但僅對一個點 - 特別是對一個單獨數據庫服務器的大小 - 同時,爲了更好的性能,須要特殊的硬件和存儲設備。 HBase 特性:
首先,確信有足夠多數據,若是有上億或上千億行數據,HBase是很好的備選。 若是隻有上千或上百萬行,則用傳統的RDBMS多是更好的選擇。由於全部數據能夠在一兩個節點保存,集羣其餘節點可能閒置。
其次,確信能夠不依賴全部RDBMS的額外特性 (e.g., 列數據類型, 第二索引, 事物,高級查詢語言等.) 一個創建在RDBMS上應用,如不能僅經過改變一個JDBC驅動移植到HBase。相對於移植, 需考慮從RDBMS 到 HBase是一次徹底的從新設計。
第三, 確信你有足夠硬件。甚至 HDFS 在小於5個數據節點時,幹很差什麼事情 (根據如 HDFS 塊複製具備缺省值 3), 還要加上一個 NameNode.
HBase 能在單獨的筆記本上運行良好。但這應僅當成開發配置。
HDFS 是分佈式文件系統,適合保存大文件。官方宣稱它並不是普通用途文件系統,不提供文件的個別記錄的快速查詢。 另外一方面,HBase基於HDFS且提供大表的記錄快速查找(和更新)。這有時可能引發概念混亂。 HBase 內部將數據放到索引好的 "存儲文件(StoreFiles)" ,以便高速查詢。存儲文件位於 HDFS中。參考Chapter 5, 數據模型 和該章其餘內容獲取更多HBase如何歸檔的信息。
目錄表 -ROOT- 和 .META. 做爲 HBase 表存在。他們被HBase shell的 list 命令過濾掉了,
-ROOT- 保存 .META. 表存在哪裏的蹤影. -ROOT- 表結構以下:
(序列化.META.的 HRegionInfo 實例 )info:server
( 保存 .META.的RegionServer的server:port)info:serverstartcode
( 保存 .META.的RegionServer進程的啓動時間)
.META. 保存系統中全部region列表。 .META.表結構以下:
[table],[region start key],[region id]
(序列化.META.的 HRegionInfo 實例 )info:server
( 保存 .META.的RegionServer的server:port)info:serverstartcode
( 保存 .META.的RegionServer進程的啓動時間)
當表在分割過程當中,會建立額外的兩列, info:splitA
和 info:splitB
表明兩個女兒 region. 這兩列的值一樣是序列化HRegionInfo 實例. region最終分割完畢後,這行會刪除。
HRegionInfo的備註: 空 key 用於指示表的開始和結束。具備空開始鍵值的region是表內的首region。 若是 region 同時有空起始和結束key,說明它是表內的惟一region。
在須要編程訪問(但願不要)目錄元數據時,參考 Writables 工具.
META 地址首先在ROOT 中設置。META 會更新 server 和 startcode 的值.
須要 region-RegionServer 分配信息, 參考 Section 9.7.2, 「Region-RegionServer 分配」.
HBase客戶端的 HTable類負責尋找相應的RegionServers來處理行。他是先查詢 .META.
目錄表。而後再肯定region的位置。定位到所須要的區域後,客戶端會直接 去訪問相應的region(不通過master),發起讀寫請求。這些信息會緩存在客戶端,這樣就不用每發起一個請求就去查一下。若是一個region已經廢棄(緣由多是master load balance或者RegionServer死了),客戶端就會從新進行這個步驟,決定要去訪問的新的地址。
參考 Section 9.5.2, 「Runtime Impact」 for more information about the impact of the Master on HBase Client communication.
關於鏈接的配置信息,參見Section 3.7, 「鏈接HBase集羣的客戶端配置和依賴」.
HBaseConfiguration conf = HBaseConfiguration.create(); HTable table1 = new HTable(conf, "myTable"); HTable table2 = new HTable(conf, "myTable");
HBaseConfiguration conf1 = HBaseConfiguration.create(); HTable table1 = new HTable(conf1, "myTable"); HBaseConfiguration conf2 = HBaseConfiguration.create(); HTable table2 = new HTable(conf2, "myTable");
若是你想知道的更多的關於HBase客戶端connection的知識,能夠參照: HConnectionManager.
對須要高端多線程訪問的應用 (如網頁服務器或應用服務器須要在一個JVM服務不少應用線程),參考 HTablePool.
若關閉了HTable中的 Section 11.10.4, 「AutoFlush」,Put
, flushCommits()
要想更好的細粒度控制 Put
的批量操做,能夠參考Htable中的batch 方法.
關於非Java客戶端和定製協議信息,在 Chapter 10, 外部 API
Get 和 Scan 實例能夠用 filters 配置,以應用於 RegionServer.
過濾器可能會搞混,由於有不少類型的過濾器, 最好經過理解過濾器功能組來了解他們。
FilterList 表明一個過濾器列表,過濾器間具備 FilterList.Operator.MUST_PASS_ALL
或 FilterList.Operator.MUST_PASS_ONE
關係。下面示例展現兩個過濾器的'或'關係(檢查同一屬性的'my value' 或'my other value' ).
FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE); SingleColumnValueFilter filter1 = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, Bytes.toBytes("my value") ); list.add(filter1); SingleColumnValueFilter filter2 = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, Bytes.toBytes("my other value") ); list.add(filter2); scan.setFilter(list);
SingleColumnValueFilter 用於測試列值相等 (CompareOp.EQUAL
), 不等 (CompareOp.NOT_EQUAL
),或範圍 (e.g., CompareOp.GREATER
). 下面示例檢查列值和字符串'my value' 相等...
SingleColumnValueFilter filter = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, Bytes.toBytes("my value") ); scan.setFilter(filter);
過濾器包內有好幾種比較器類須要特別說起。這些比較器和其餘過濾器一塊兒使用, 如 Section, 「SingleColumnValueFilter」.
RegexStringComparator 支持值比較的正則表達式 。
RegexStringComparator comp = new RegexStringComparator("my."); // any value that starts with 'my' SingleColumnValueFilter filter = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, comp ); scan.setFilter(filter);
參考 Oracle JavaDoc 瞭解 supported RegEx patterns in Java.
SubstringComparator 用於檢測一個子串是否存在於值中。大小寫不敏感。
SubstringComparator comp = new SubstringComparator("y val"); // looking for 'my value' SingleColumnValueFilter filter = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, comp ); scan.setFilter(filter);
參考 BinaryComparator.
因爲HBase 採用鍵值對保存內部數據,鍵值元數據過濾器評估一行的鍵是否存在(如 ColumnFamily:Column qualifiers) , 對應前節所述值的狀況。
FamilyFilter 用於過濾列族。 一般,在Scan中選擇ColumnFamilie優於在過濾器中作。
QualifierFilter 用於基於列名(即 Qualifier)過濾.
ColumnPrefixFilter 可基於列名(即Qualifier)前綴過濾。
A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row and for each involved column family. It can be used to efficiently get a subset of the columns in very wide rows.
Note: The same column qualifier can be used in different column families. This filter returns all matching columns.
Example: Find all columns in a row and family that start with "abc"
HTableInterface t = ...; byte[] row = ...; byte[] family = ...; byte[] prefix = Bytes.toBytes("abc"); Scan scan = new Scan(row, row); // (optional) limit to one row scan.addFamily(family); // (optional) limit to one family Filter f = new ColumnPrefixFilter(prefix); scan.setFilter(f); scan.setBatch(10); // set this if there could be many columns returned ResultScanner rs = t.getScanner(scan); for (Result r = rs.next(); r != null; r = rs.next()) { for (KeyValue kv : r.raw()) { // each kv represents a column } } rs.close();
MultipleColumnPrefixFilter 和 ColumnPrefixFilter 行爲差很少,但能夠指定多個前綴。
Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the first column matching the lowest prefix and also seeks past ranges of columns between prefixes. It can be used to efficiently get discontinuous sets of columns from very wide rows.
Example: Find all columns in a row and family that start with "abc" or "xyz"
HTableInterface t = ...; byte[] row = ...; byte[] family = ...; byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")}; Scan scan = new Scan(row, row); // (optional) limit to one row scan.addFamily(family); // (optional) limit to one family Filter f = new MultipleColumnPrefixFilter(prefixes); scan.setFilter(f); scan.setBatch(10); // set this if there could be many columns returned ResultScanner rs = t.getScanner(scan); for (Result r = rs.next(); r != null; r = rs.next()) { for (KeyValue kv : r.raw()) { // each kv represents a column } } rs.close();
ColumnRangeFilter 能夠進行高效內部掃描。
A ColumnRangeFilter can seek ahead to the first matching column for each involved column family. It can be used to efficiently get a 'slice' of the columns of a very wide row. i.e. you have a million columns in a row but you only want to look at columns bbbb-bbdd.
Note: The same column qualifier can be used in different column families. This filter returns all matching columns.
Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd" (inclusive)
HTableInterface t = ...; byte[] row = ...; byte[] family = ...; byte[] startColumn = Bytes.toBytes("bbbb"); byte[] endColumn = Bytes.toBytes("bbdd"); Scan scan = new Scan(row, row); // (optional) limit to one row scan.addFamily(family); // (optional) limit to one family Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true); scan.setFilter(f); scan.setBatch(10); // set this if there could be many columns returned ResultScanner rs = t.getScanner(scan); for (Result r = rs.next(); r != null; r = rs.next()) { for (KeyValue kv : r.raw()) { // each kv represents a column } } rs.close();
Note: HBase 0.92 引入
一般認爲行選擇時Scan採用 startRow/stopRow 方法比較好。然而 RowFilter 也能夠用。
This is primarily used for rowcount jobs. 參考 FirstKeyOnlyFilter.
is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically runs on the Section 9.9.1, 「NameNode」.
If run in a multi-Master environment, all Masters compete to run the cluster. If the active Master loses it's lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to take over the Master role.
A common dist-list question is what happens to an HBase cluster when the Master goes down. Because the HBase client talks directly to the RegionServers, the cluster can still function in a "steady state." Additionally, per Section 9.2, 「Catalog Tables」 ROOT and META exist as HBase tables (i.e., are not resident in the Master). However, the Master controls critical functions such as RegionServer failover and completing region splits. So while the cluster can still run for a time without the Master, the Master should be restarted as soon as possible.
The methods exposed by HMasterInterface
are primarily metadata-oriented methods:
For example, when the HBaseAdmin
method disableTable
is invoked, it is serviced by the Master server.
Master 後臺運行幾種線程:
Periodically, and when there are not any regions in transition, a load balancer will run and move regions around to balance cluster load. 參考 Section, 「Balancer」 for configuring this property.
參考 Section 9.7.2, 「Region-RegionServer Assignment」 for more information on region assignment.
Periodically checks and cleans up the .META. table. 參考 Section 9.2.2, 「META」 for more information on META.
is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a Section 9.9.2, 「DataNode」.
The methods exposed by HRegionRegionInterface
contain both data-oriented and region-maintenance methods:
For example, when the HBaseAdmin
method majorCompact
is invoked on a table, the client is actually iterating through all regions for the specified table and requesting a major compaction directly to each region.
RegionServer 後臺運行幾種線程:
協處理器在0.92版添加。 有一個詳細帖子 Blog Overview of CoProcessors 供參考。文檔最終會放到本參考手冊,但該blog是當前能獲取的大部分信息。
The Block Cache is an LRU cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies:
For more information, see the LruBlockCache source
Block caching is enabled by default for all the user tables which means that any read operation will load the LRU cache. This might be good for a large number of use cases, but further tunings are usually required in order to achieve better performance. An important concept is the working set size, or WSS, which is: "the amount of memory needed to compute the answer to a problem". For a website, this would be the data that's needed to answer the queries over a short amount of time.
The way to calculate how much memory is available in HBase for caching is:
number of region servers * heap size * hfile.block.cache.size * 0.85
The default value for the block cache is 0.25 which represents 25% of the available heap. The last value (85%) is the default acceptable loading factor in the LRU cache after which eviction is started. The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks. Here are some examples:
Your data isn't the only resident of the block cache, here are others that you may have to take into account:
Currently the recommended way to measure HFile indexes and bloom filters sizes is to look at the region server web UI and checkout the relevant metrics. For keys, sampling can be done by using the HFile command line tool and look for the average key size metric.
It's generally bad to use block caching when the WSS doesn't fit in memory. This is the case when you have for example 40GB available across all your region servers' block caches but you need to process 1TB of data. One of the reasons is that the churn generated by the evictions will trigger more garbage collections unnecessarily. Here are two use cases:
每一個RegionServer會將更新(Puts, Deletes) 先記錄到預寫日誌中(WAL),而後將其更新在Section 9.7.5, 「Store」的Section, 「MemStore」裏面。這樣就保證了HBase的寫的可靠性。若是沒有WAL,當RegionServer宕掉的時候,MemStore尚未flush,StoreFile尚未保存,數據就會丟失。HLog 是HBase的一個WAL實現,一個RegionServer有一個HLog實例。
WAL 保存在HDFS 的/hbase/.logs/
要想知道更多的信息,能夠訪問維基百科 Write-Ahead Log 的文章.
When a RegionServer crashes, it will lose its ephemeral lease in ZooKeeper...TODO
默認設置爲 true
,在split執行中發生的任何錯誤會被記錄,有問題的WAL會被移動到HBase rootdir
設置爲 false
[23] 參考 HBASE-2958 When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it. We need to do more than just fail split if this flag is set.
[24] 要想知道背景知識, 參見HBASE-2643 Figure how to deal with eof splitting logs
(HBase table) (Regions for the table) (Store per ColumnFamily for each Region for the table) (MemStore for each Store for each Region for the table) (StoreFiles for each Store for each Region for the table) (Blocks within a StoreFile within a Store for each Region for the table) TableRegionStoreMemStoreStoreFileBlock
關於HBase文件寫到HDFS的描述,參考 Section 12.7.2, 「瀏覽 HDFS的 HBase 對象」.
HBase經過將region切分在許多機器上實現分佈式。也就是說,你若是有16GB的數據,只分了2個region, 你卻有20臺機器,有18臺就浪費了。
參考 Section, 「更大區域」 獲取配置更多信息.
在META 中查找已經存在的區域分配。LoadBalancerFactory
被調用來分配區域。 DefaultLoadBalancer
區域能夠按期移動,見 Section, 「LoadBalancer」.
Over time, Region-RegionServer locality is achieved via HDFS block replication. The HDFS client does the following by default when choosing locations to write replicas:
Thus, HBase eventually achieves locality for a region after a flush or a compaction. In a RegionServer failover situation a RegionServer may be assigned regions with non-local StoreFiles (because none of the replicas are local), however as new data is written in the region, or the table is compacted and StoreFiles are re-written, they will become "local" to the RegionServer.
For more information, see HDFS Design on Replica Placement and also Lars George's blog on HBase and HDFS locality.
區域服務器的分割操做是不可見的,由於Master不會參與其中。區域服務器切割region的步驟是,先將該region下線,而後切割,將其子region加入到META元信息中,再將他們加入到本來的區域服務器中,最後彙報Master.參見Section, 「管理 Splitting」來手動管理切割操做(以及爲什麼這麼作)。
缺省分割策略能夠被重寫,採用自定義RegionSplitPolicy (HBase 0.94+).通常自定義分割策略應該擴展HBase的缺省分割策略:ConstantSizeRegionSplitPolicy.
策略能夠HBaseConfiguration 全局使用,或基於每張表:
HTableDescriptor myHtd = ...; myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName());
hfile文件格式是基於BigTable [2006]論文中的SSTable。構建在Hadoop的tfile上面(直接使用了tfile的單元測試和壓縮工具)。 Schubert Zhang 的博客HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs詳細介紹了HBases的hfile。Matteo Bertozzi也作了詳細的介紹HBase I/O: HFile。
For more information, see the HFile source code. Also see Appendix E, HFile format version 2 for information about the HFile v2 format that was included in 0.92.
$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile
例如,你想看文件 hdfs://
的內容, 就執行以下的命令:
$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://
For more information of what StoreFiles look like on HDFS with respect to the directory structure, see Section 12.7.2, 「Browsing HDFS for HBase Objects」.
StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis.
Compression happens at the block level within StoreFiles. For more information on compression, see Appendix C, Compression In HBase.
For more information on blocks, see the HFileBlock source code.
The KeyValue class is the heart of data storage in HBase. KeyValue wraps a byte array and takes offsets and lengths into passed array at where to start interpreting the content as KeyValue.
The KeyValue format inside a byte array is:
The Key is further decomposed as:
KeyValue instances are not split across blocks. For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read in as a coherent block. For more information, see the KeyValue source code.
To emphasize the points above, examine what happens with two Puts for two different columns for the same row:
rowkey=row1, cf:attr1=value1
rowkey=row1, cf:attr2=value2
Even though these are for the same row, a KeyValue is created for each column:
Key portion for Put #1:
------------> 4
-----------------> row1
---> 2
--------> cf
------> attr1
-----------> server time of Put
-------------> Put
Key portion for Put #2:
------------> 4
-----------------> row1
---> 2
--------> cf
------> attr2
-----------> server time of Put
-------------> Put
It is critical to understand that the rowkey, ColumnFamily, and column (aka columnqualifier) are embedded within the KeyValue instance. The longer these identifiers are, the bigger the KeyValue is.
有兩種類型的緊縮:次緊縮和主緊縮。minor緊縮一般會將數個小的相鄰的文件合併成一個大的。Minor不會刪除打上刪除標記的數據,也不會刪除過時的數據,Major緊縮會刪除過時的數據。有些時候minor緊縮就會將一個store中的所有文件緊縮,實際上這個時候他自己就是一個major壓縮。對於一個minor緊縮是如何緊縮的,能夠參見ascii diagram in the Store source code.
在執行一個major緊縮以後,一個store只會有一個sotrefile,一般狀況下這樣能夠提供性能。注意:major緊縮將會將store中的數據所有重寫,在一個負載很大的系統中,這個操做是很傷的。因此在大型系統中,一般會本身Section, 「管理 Splitting」。
緊縮 不會 進行分區合併。參考 Section 14.2.2, 「Merge」 獲取更多合併的信息。
To understand the core algorithm for StoreFile selection, there is some ASCII-art in the Store source code that will serve as useful reference. It has been copied below:
/* normal skew: * * older ----> newer * _ * | | _ * | | | | _ * --|-|- |-|- |-|---_-------_------- minCompactSize * | | | | | | | | _ | | * | | | | | | | | | | | | * | | | | | | | | | | | | */
Important knobs:
Ratio used in compaction file selection algorithm (default 1.2f).hbase.hstore.compaction.min
(.90 hbase.hstore.compactionThreshold) (files) Minimum number of StoreFiles per Store to be selected for a compaction to occur (default 2).hbase.hstore.compaction.max
(files) Maximum number of StoreFiles to compact per minor compaction (default 10).hbase.hstore.compaction.min.size
(bytes) Any StoreFile smaller than this setting with automatically be a candidate for compaction. Defaults to hbase.hregion.memstore.flush.size
(128 mb).hbase.hstore.compaction.max.size
(.92) (bytes) Any StoreFile larger than this setting with automatically be excluded from compaction (default Long.MAX_VALUE).
The minor compaction StoreFile selection logic is size based, and selects a file for compaction when the file <= sum(smaller_files) *hbase.hstore.compaction.ratio
This example mirrors an example from the unit test TestCompactSelection
= 1.0fhbase.hstore.compaction.min
= 3 (files)hbase.hstore.compaction.max
= 5 (files)hbase.hstore.compaction.min.size
= 10 (bytes)hbase.hstore.compaction.max.size
= 1000 (bytes)The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 23, 12, and 12.
This example mirrors an example from the unit test TestCompactSelection
= 1.0fhbase.hstore.compaction.min
= 3 (files)hbase.hstore.compaction.max
= 5 (files)hbase.hstore.compaction.min.size
= 10 (bytes)hbase.hstore.compaction.max.size
= 1000 (bytes)
The following StoreFiles exist: 100, 25, 12, and 12 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 23, 12, and 12.
This example mirrors an example from the unit test TestCompactSelection
= 1.0fhbase.hstore.compaction.min
= 3 (files)hbase.hstore.compaction.max
= 5 (files)hbase.hstore.compaction.min.size
= 10 (bytes)hbase.hstore.compaction.max.size
= 1000 (bytes)The following StoreFiles exist: 7, 6, 5, 4, 3, 2, and 1 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 7, 6, 5, 4, 3.
. A large ratio (e.g., 10) will produce a single giant file. Conversely, a value of .25 will produce behavior similar to the BigTable compaction algorithm - resulting in 4 StoreFiles.
. Because this limit represents the "automatic include" limit for all StoreFiles smaller than this value, this value may need to be adjusted downwards in write-heavy environments where many 1 or 2 mb StoreFiles are being flushed, because every file will be targeted for compaction and the resulting files may still be under the min-size and require further compaction, etc.
[25] For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the documentBloomFilters in HBase attached to HBase-1200.
[26] The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.
HBase 有好幾種方法將數據裝載到表。最直接的方式便可以經過MapReduce任務,也能夠經過普通客戶端API。可是這都不是高效方法。
批量裝載特性採用 MapReduce 任務,將表數據輸出爲HBase的內部數據格式,而後能夠將產生的存儲文件直接裝載到運行的集羣中。批量裝載比簡單使用 HBase API 消耗更少的CPU和網絡資源。
HBase 批量裝載過程包含兩個主要步驟。
產生HBase數據文件(StoreFiles) 。輸出數據爲HBase的內部數據格式,以便隨後裝載到集羣更高效。
爲了處理高效, HFileOutputFormat
必須比配置爲每一個HFile適合在一個分區內。爲了作到這一點,輸出將被批量裝載到HBase的任務,使用Hadoop 的TotalOrderPartitioner
包含一個方便的函數, configureIncrementalLoad()
, 能夠基於表當前分區邊界自動設置TotalOrderPartitioner
After the data has been prepared using HFileOutputFormat
, it is loaded into the cluster using completebulkload
. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate Region Server which adopts the HFile, moving it into its storage directory and making the data available to clients.
If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, thecompletebulkloads
utility will automatically split the data files into pieces corresponding to the new boundaries. This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means.
After a data import has been prepared, either by using the importtsv
tool with the "importtsv.bulk.output
" option or by some other MapReduce job using the HFileOutputFormat
, the completebulkload
tool is used to import the data into the running cluster.
The completebulkload
tool simply takes the output path where importtsv
or your MapReduce job put its results, and the table name to import into. For example:
$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
The -c config-file
option can be used to specify a file containing the appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on the CLASSPATH (In addition, the CLASSPATH must contain the directory that has the zookeeper configuration file if zookeeper is NOT managed by HBase).
Note: If the target table does not already exist in HBase, this tool will create the table automatically.
This tool will run quickly, after which point the new data will be visible in the cluster.
For more information about the referenced utilities, see Section 14.1.9, 「ImportTsv」 and Section 14.1.10, 「CompleteBulkLoad」.
Although the importtsv
tool is useful in many cases, advanced users may want to generate data programatically, or import data from other formats. To get started doing so, dig into ImportTsv.java
and check the JavaDoc for HFileOutputFormat.
The import step of the bulk load can also be done programatically. 參考 the LoadIncrementalHFiles
class 獲取更多信息。
因爲 HBase 在 HDFS 上運行(每一個存儲文件也被寫爲HDFS的文件),必須理解 HDFS 結構,特別是它如何存儲文件,處理故障轉移,備份塊。
參考 Hadoop 文檔 HDFS Architecture 獲取更多信息。
[23] 參考 HBASE-2958 When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it. We need to do more than just fail split if this flag is set.
[24] For background, see HBASE-2643 Figure how to deal with eof splitting logs
[25] For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the documentBloomFilters in HBase attached to HBase-1200.
[26] The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.
當前本話題大部分文檔在 HBase Wiki. 參考 Thrift API Javadoc.
當前 REST大部分文檔在 HBase Wiki on REST.
當前 Thrift大部分文檔在 HBase Wiki on Thrift.
注意: 本特性在 HBase 0.92 中加入。
This allows the user to perform server-side filtering when accessing HBase over Thrift. The user specifies a filter via a string. The string is parsed on the server to construct the filter
A simple filter expression is expressed as: 「FilterName (argument, argument, ... , argument)」
You must specify the name of the filter followed by the argument list in parenthesis. Commas separate the individual arguments
If the argument represents a string, it should be enclosed in single quotes.
If it represents a boolean, an integer or a comparison operator like <, >, != etc. it should not be enclosed in quotes
The filter name must be one word. All ASCII characters are allowed except for whitespace, single quotes and parenthesis.
The filter’s arguments can contain any ASCII character. If single quotes are present in the argument, they must be escaped by a preceding single quote
Currently, two binary operators – AND/OR and two unary operators – WHILE/SKIP are supported.
Note: the operators are all in uppercase
AND – as the name suggests, if this operator is used, the key-value must pass both the filters
OR – as the name suggests, if this operator is used, the key-value must pass at least one of the filters
SKIP – For a particular row, if any of the key-values don’t pass the filter condition, the entire row is skipped
WHILE - For a particular row, it continues to emit key-values until a key-value is reached that fails the filter condition
Compound Filters: Using these operators, a hierarchy of filters can be created. For example: 「(Filter1 AND Filter2) OR (Filter3 AND Filter4)」
Parenthesis have the highest precedence. The SKIP and WHILE operators are next and have the same precedence.The AND operator has the next highest precedence followed by the OR operator.
For example:
A filter string of the form:「Filter1 AND Filter2 OR Filter3」
will be evaluated as:「(Filter1 AND Filter2) OR Filter3」
A filter string of the form:「Filter1 AND SKIP Filter2 OR Filter3」
will be evaluated as:「(Filter1 AND (SKIP Filter2)) OR Filter3」
LESS (<)
NO_OP (no operation)
客戶端應該使用 (<, <=, =, !=, >, >=) 來表達比較操做.
BinaryComparator - This lexicographically compares against the specified byte array using Bytes.compareTo(byte[], byte[])
BinaryPrefixComparator - This lexicographically compares against a specified byte array. It only compares up to the length of this byte array.
RegexStringComparator - This compares against the specified byte array using the given regular expression. Only EQUAL and NOT_EQUAL comparisons are valid with this comparator
SubStringComparator - This tests if the given substring appears in a specified byte array. The comparison is case insensitive. Only EQUAL and NOT_EQUAL comparisons are valid with this comparator
The general syntax of a comparator is: ComparatorType:ComparatorValue
The ComparatorType for the various comparators is as follows:
BinaryComparator - binary
BinaryPrefixComparator - binaryprefix
RegexStringComparator - regexstring
SubStringComparator - substring
The ComparatorValue can be any value.
Example1: >, 'binary:abc'
will match everything that is lexicographically greater than "abc"
Example2: =, 'binaryprefix:abc'
will match everything whose first 3 characters are lexicographically equal to "abc"
Example3: !=, 'regexstring:ab*yz'
will match everything that doesn't begin with "ab" and ends with "yz"
Example4: =, 'substring:abc123'
will match everything that begins with the substring "abc123"
<? $_SERVER['PHP_ROOT'] = realpath(dirname(__FILE__).'/..'); require_once $_SERVER['PHP_ROOT'].'/flib/__flib.php'; flib_init(FLIB_CONTEXT_SCRIPT); require_module('storage/hbase'); $hbase = new HBase('<server_name_running_thrift_server>', <port on which thrift server is running>); $hbase->open(); $client = $hbase->getClient(); $result = $client->scannerOpenWithFilterString('table_name', "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123, 456))"); $to_print = $client->scannerGetList($result,1); while ($to_print) { print_r($to_print); $to_print = $client->scannerGetList($result,1); } $client->scannerClose($result); ?>
「PrefixFilter (‘Row’) AND PageFilter (1) AND FirstKeyOnlyFilter ()」
will return all key-value pairs that match the following conditions:
1) The row containing the key-value should have prefix 「Row」
2) The key-value must be located in the first row of the table
3) The key-value pair must be the first key-value in the row
「(RowFilter (=, ‘binary:Row 1’) AND TimeStampsFilter (74689, 89734)) OR ColumnRangeFilter (‘abc’, true, ‘xyz’, false))」
will return all key-value pairs that match both the following conditions:
1) The key-value is in a row having row key 「Row 1」
2) The key-value must have a timestamp of either 74689 or 89734.
Or it must match the following condition:
1) The key-value pair must be in a column that is lexicographically >= abc and < xyz
「SKIP ValueFilter (0)」
will skip the entire row if any of the values in the row is not 0
Description: This filter doesn’t take any arguments. It returns only the key component of each key-value.
Syntax: KeyOnlyFilter ()
Example: "KeyOnlyFilter ()"
Description: This filter doesn’t take any arguments. It returns only the first key-value from each row.
Syntax: FirstKeyOnlyFilter ()
Example: "FirstKeyOnlyFilter ()"
Description: This filter takes one argument – a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix
Syntax: PrefixFilter (‘<row_prefix>’)
Example: "PrefixFilter (‘Row’)"
Description: This filter takes one argument – a column prefix. It returns only those key-values present in a column that starts with the specified column prefix. The column prefix must be of the form: 「qualifier」
Example: "ColumnPrefixFilter(‘Col’)"
Description: This filter takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes. Each of the column prefixes must be of the form: 「qualifier」
Syntax:MultipleColumnPrefixFilter(‘<column_prefix>’, ‘<column_prefix>’, …, ‘<column_prefix>’)
Example: "MultipleColumnPrefixFilter(‘Col1’, ‘Col2’)"
Description: This filter takes one argument – a limit. It returns the first limit number of columns in the table
Syntax: ColumnCountGetFilter (‘<limit>’)
Example: "ColumnCountGetFilter (4)"
Description: This filter takes one argument – a page size. It returns page size number of rows from the table.
Syntax: PageFilter (‘<page_size>’)
Example: "PageFilter (2)"
Description: This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows
Syntax: ColumnPaginationFilter(‘<limit>’, ‘<offest>’)
Example: "ColumnPaginationFilter (3, 5)"
Description: This filter takes one argument – a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row
Syntax: InclusiveStopFilter(‘<stop_row_key>’)
Example: "InclusiveStopFilter ('Row2')"
Description: This filter takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps
Syntax: TimeStampsFilter (<timestamp>, <timestamp>, ... ,<timestamp>)
Example: "TimeStampsFilter (5985489, 48895495, 58489845945)"
Description: This filter takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that row
Syntax: RowFilter (<compareOp>, ‘<row_comparator>’)
Example: "RowFilter (<=, ‘xyz)"
Family Filter
Description: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column
Syntax: QualifierFilter (<compareOp>, ‘<qualifier_comparator>’)
Example: "QualifierFilter (=, ‘Column1’)"
Description: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column
Syntax: QualifierFilter (<compareOp>,‘<qualifier_comparator>’)
Example: "QualifierFilter (=,‘Column1’)"
Description: This filter takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value
Syntax: ValueFilter (<compareOp>,‘<value_comparator>’)
Example: "ValueFilter (!=, ‘Value’)"
Description: This filter takes two arguments – a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp. If the row doesn’t contain the specified column – none of the key-values in that row will be returned.
The filter can also take an optional boolean argument – dropDependentColumn. If set to true, the column we were depending on doesn’t get returned.
The filter can also take two more additional optional arguments – a compare operator and a value comparator, which are further checks in addition to the family and qualifier. If the dependent column is found, its value should also pass the value check and then only is its timestamp taken into consideration
Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>, <compare operator>, ‘<value comparator’)
Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>)
Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’)
Example: "DependentColumnFilter (‘conf’, ‘blacklist’, false, >=, ‘zebra’)"
Example: "DependentColumnFilter (‘conf’, 'blacklist', true)"
Example: "DependentColumnFilter (‘conf’, 'blacklist')"
Description: This filter takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found – all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted. If the condition fails, the row will not be emitted.
This filter also takes two additional optional boolean arguments – filterIfColumnMissing and setLatestVersionOnly
If the filterIfColumnMissing flag is set to true the columns of the row will not be emitted if the specified column to check is not found in the row. The default value is false.
If the setLatestVersionOnly flag is set to false, it will test previous versions (timestamps) too. The default value is true.
These flags are optional and if you must set neither or both
Syntax: SingleColumnValueFilter(<compare operator>, ‘<comparator>’, ‘<family>’, ‘<qualifier>’,<filterIfColumnMissing_boolean>, <latest_version_boolean>)
Syntax: SingleColumnValueFilter(<compare operator>, ‘<comparator>’, ‘<family>’, ‘<qualifier>)
Example: "SingleColumnValueFilter (<=, ‘abc’,‘FamilyA’, ‘Column1’, true, false)"
Example: "SingleColumnValueFilter (<=, ‘abc’,‘FamilyA’, ‘Column1’)"
Description: This filter takes the same arguments and behaves same as SingleColumnValueFilter – however, if the column is found and the condition passes, all the columns of the row will be emitted except for the tested column value.
Syntax: SingleColumnValueExcludeFilter(<compare operator>, '<comparator>', '<family>', '<qualifier>',<latest_version_boolean>, <filterIfColumnMissing_boolean>)
Syntax: SingleColumnValueExcludeFilter(<compare operator>, '<comparator>', '<family>', '<qualifier>')
Example: "SingleColumnValueExcludeFilter (‘<=’, ‘abc’,‘FamilyA’, ‘Column1’, ‘false’, ‘true’)"
Example: "SingleColumnValueExcludeFilter (‘<=’, ‘abc’, ‘FamilyA’, ‘Column1’)"
Description: This filter is used for selecting only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.
If you don’t want to set the minColumn or the maxColumn – you can pass in an empty argument.
Syntax: ColumnRangeFilter (‘<minColumn>’, <minColumnInclusive_bool>, ‘<maxColumn>’, <maxColumnInclusive_bool>)
Example: "ColumnRangeFilter (‘abc’, true, ‘xyz’, false)"
Facebook的 Chip Turner 寫了個純 C/C++ 客戶端。 Check it out.
多交換機在系統結構中是潛在陷阱。低價硬件的最經常使用配置是1Gbps上行鏈接到另外一個交換機。 該常被忽略的窄點很容易成爲集羣通信的瓶頸。特別是MapReduce任務經過該上行鏈接同時讀寫大量數據時,會致使飽和。
將交換機上的多個端口在物理上鍊接起來,在邏輯上捆綁在一塊兒,造成一個擁有較大帶寬的端口,組成一個幹路,以達到平衡負載和提供備份線路,擴充帶寬的目的。 )
若是機架上的交換機有合適交換容量,能夠處理全部主機全速通訊,那麼下一個問題就是如何自動導航更多的交錯在機架中的集羣。最簡單的避免橫跨多機架問題的辦法,是採用端口聚合來建立到其餘機架的捆綁的上行的鏈接。然而該方法下行側,是潛在被使用的端口開銷。舉例:從機架A到機架B建立 8Gbps 端口通道,採用24端口中的8個來和其餘機架互通,ROI(投資回報率)很低。採用太少端口意味着不能從集羣中傳出最多的東西。
機架間採用10Gbe 連接將極大增長性能,確保交換機都支持10Gbe 上行鏈接或支持擴展卡,後者相對上行鏈接,容許你節省機器端口。
全部網絡接口功能正常嗎?你肯定?參考故障診斷用例:Section 13.3.1, 「Case Study #1 (Performance Issue On A Single Node)」.
能夠從 wiki Performance Tuning看起。這個文檔講了一些主要的影響性能的方面:RAM, 壓縮, JVM 設置, 等等。而後,能夠看看下面的補充內容。
在區域服務器打開RPC-level的日誌對於深度的優化是有好處的。一旦打開,日誌將噴涌而出。因此不建議長時間打開,只能看一小段時間。要想啓用RPC-level的職責,能夠使用區域服務器 UI點擊Log Level。將 org.apache.hadoop.ipc
。而後tail 區域服務器的日誌,進行分析。
在這個PPT Avoiding Full GCs with MemStore-Local Allocation Buffers, Todd Lipcon描述了在HBase中常見的兩種「世界中止」式的GC操做,尤爲是在加載的時候。一種是CMS失敗的模式(譯者注:CMS是一種GC的算法),另外一種是老一代的堆碎片致使的。要想定位第一種,只要將CMS執行的時間提早就能夠了,加入-XX:CMSInitiatingOccupancyFraction
參數,把值調低。能夠先從60%和70%開始(這個值調的越低,觸發的GC次數就越多,消耗的CPU時間就越長)。要想定位第二種錯誤,Todd加入了一個實驗性的功能,在HBase 0.90.x中這個是要明確指定的(在0.92.x中,這個是默認項),將你的Configuration
設置爲true。詳細信息,能夠看這個PPT. [27]. Be aware that when enabled, each MemStore instance will occupy at least an MSLAB instance of memory. If you have thousands of regions or lots of regions each with many column families, this allocation of MSLAB may be responsible for a good portion of your heap allocation and in an extreme case cause you to OOME. Disable MSLAB in this case, or lower the amount of memory it uses or float less regions per server.
GC日誌的更多信息,參考 Section 12.2.3, 「JVM 垃圾收集日誌」.
HBase中region的數目能夠根據Section 3.6.5, 「更大的 Regions」調整.也能夠參見 Section 12.3.1, 「Region大小」
.這個參數的本質是設置一個RegsionServer能夠同時處理多少請求。 若是定的過高,吞吐量反而會下降;若是定的過低,請求會被阻塞,得不到響應。你能夠打開RPC-level日誌讀Log,來決定對於你的集羣什麼值是合適的。(請求隊列也是會消耗內存的)
參見 hfile.block.cache.size
. 對於區域服務器進程的內存設置。
參見 hbase.regionserver.global.memstore.upperLimit
. 這個內存設置是根據區域服務器的須要來設定。
參見 hbase.regionserver.global.memstore.lowerLimit
. 這個內存設置是根據區域服務器的須要來設定。
. 若是在區域服務器的Log中block,提升這個值是有幫助的。
參見 hbase.hregion.memstore.block.multiplier
. 若是有足夠的RAM,提升這個值。
配置ZooKeeper信息,請參考 Section 2.5, 「ZooKeeper」 , 參看關於使用專用磁盤部分。
參考 Section 6.3.2, 「Try to minimize row and column sizes」. 參考 Section, 「However...」 獲取壓縮申請終止( compression caveats)
區域大小能夠經過基於每張表設置,當某些表須要與缺省設置的區域大小不一樣時,經過 HTableDescriptor 的setFileSize 的事件設置。
參考 Section 11.4.1, 「Number of Regions」 獲取更多信息。
布隆過濾能夠每列族單獨啓用。使用 HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) 對列族單獨啓用布隆。 Default = NONE 沒有布隆過濾。對 ROW,行鍵的哈希在每次插入行時將被添加到布隆。對 ROWCOL,行鍵 + 列族 + 列族修飾的哈希將在每次插入行時添加到布隆。
參考 HColumnDescriptor 和 Section 9.7.6, 「布隆過濾(Bloom Filters)」 獲取更多信息。
The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved).
參考 HColumnDescriptor and Section 9.7.5, 「Store」獲取更多信息。
ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the Section 9.6.4, 「Block Cache」, but it is not a guarantee that the entire table will be in memory.
參考 HColumnDescriptor 獲取更多信息。
生產系統應該採用列族壓縮定義。 參考 Appendix C, Compression In HBase 獲取更多信息。
Compression deflates data on disk. When it's in-memory (e.g., in the MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated. So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names.
參考 Section 6.3.2, 「Try to minimize row and column sizes」 on for schema design tips, and Section, 「KeyValue」 for more information on HBase stores data internally.
Get get = new Get(rowkey); Result r = htable.get(get); byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
然而,特別是在循環內(和 MapReduce 工做內), 將列族和列名轉爲字節數組代價昂貴。最好使用字節數組常量,以下:
public static final byte[] CF = "cf".getBytes(); public static final byte[] ATTR = "attr".getBytes(); ... Get get = new Get(rowkey); Result r = htable.get(get); byte[] b = r.getValue(CF, ATTR); // returns current version of value
若是能夠的話,儘可能使用批量導入工具,參見 Section 9.8, 「批量裝載」.不然就要詳細看看下面的內容。
默認狀況下HBase建立表會新建一個區域。執行批量導入,意味着全部的client會寫入這個區域,直到這個區域足夠大,以致於分裂。一個有效的提升批量導入的性能的方式,是預建立空的區域。最好稍保守一點,由於過多的區域會實實在在的下降性能。下面是一個預建立區域的例子。 (注意:這個例子裏須要根據應用的key進行調整。):
public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits) throws IOException { try { admin.createTable( table, splits ); return true; } catch (TableExistsException e) { logger.info("table " + table.getNameAsString() + " already exists"); // the table already exists... return false; } } public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) { byte[][] splits = new byte[numRegions-1][]; BigInteger lowestKey = new BigInteger(startKey, 16); BigInteger highestKey = new BigInteger(endKey, 16); BigInteger range = highestKey.subtract(lowestKey); BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions)); lowestKey = lowestKey.add(regionIncrement); for(int i=0; i < numRegions-1;i++) { BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i))); byte[] b = String.format("%016x", key).getBytes(); splits[i] = b; } return splits; }
Puts的缺省行爲使用 Write Ahead Log (WAL),會致使 HLog
延遲log刷寫能夠經過 HTableDescriptor 在表上設置,hbase.regionserver.optionallogflushinterval
和 htable.add( <List> Put)
來將Put添加到寫緩衝中。若是 autoFlush = false
s上增長吞吐量的選項是調用 writeToWAL(false)
。關閉它意味着 RegionServer 再也不將 Put
寫到 Write Ahead Log, 僅寫到內存。然而後果是若是出現 RegionServer 失敗,將致使數據丟失。若是調用 writeToWAL(false)
一般而言,最好對Puts使用WAL, 而增長負載吞吐量與使用 bulk loading 替代技術有關。
In addition to using the writeBuffer, grouping Put
s by RegionServer can reduce the number of client RPC calls per writeBuffer flush. There is a utility HTableUtil
currently on TRUNK that does this, but you can either copy that or implement your own verison for those still on 0.90.x or earlier.
When writing a lot of data to an HBase table from a MR job (e.g., with TableOutputFormat), and specifically where Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase.
For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). This is a different processing problem than from the the above case.
If all your data is being written to one region at a time, then re-read the section on processing timeseries data.
Also, if you are pre-splitting regions and all your data is still winding up in a single region even though your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a variety of reasons that regions may appear "well split" but won't work with your data. As the HBase client communicates directly with the RegionServers, this can be obtained via HTable.getRegionLocation.
參考 Section 11.8.2, 「 Table Creation: Pre-Creating Regions 」, as well as Section 11.4, 「HBase Configurations」
若是HBase的輸入源是一個MapReduce Job,要確保輸入的Scan的setCaching
Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower.
Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.
當Scan用來處理大量的行的時候(尤爲是做爲MapReduce的輸入),要注意的是選擇了什麼字段。若是調用了 scan.addFamily
For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to have the same Input Split (i.e., the RegionServer serving the data), see the Troubleshooting Case Study in Section 13.3.1, 「Case Study #1 (Performance Issue On A Single Node)」.
這與其說是提升性能,倒不如說是避免發生性能問題。若是你忘記了關閉ResultScanners,會致使RegionServer出現問題。因此必定要把ResultScanner包含在try/catch 塊中...
Scan scan = new Scan(); // set attrs... ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // process result... } finally { rs.close(); // always close the ResultScanner! } htable.close();
方法控制。若是Scan是MapReduce的輸入源,要將這個值設置爲 false
當scan一個表的時候, 若是僅僅須要行鍵(不須要no families, qualifiers, values 和 timestamps),在加入FilterList的時候,要使用Scanner的setFilter
操做參數(譯者注:至關於And操做符)。一個FilterList要包含一個 FirstKeyOnlyFilter 和一個KeyOnlyFilter.經過這樣的filter組合,就算在最壞的狀況下,RegionServer只會從磁盤讀一個值,同時最小化客戶端的網絡帶寬佔用。
參考 Section 11.9.2, 「 Table Creation: Pre-Creating Regions 」, 及 Section 11.4, 「HBase Configurations」
Bloom filters 在 HBase-1200 Add bloomfilters中開發。[28][29]
參考 Section 11.6.4, 「布隆過濾(Bloom Filters)」.
Bloom filters add an entry to the StoreFile general FileInfo data structure and then two extra entries to the StoreFilemetadata section.
FileInfo has a BLOOM_FILTER_TYPE entry which is set to NONE, ROW or ROWCOL.
BLOOM_FILTER_META holds Bloom Size, Hash Function used, etc. Its small in size and is cached on StoreFile.Reader load
BLOOM_FILTER_DATA is the actual bloomfilter data. Obtained on-demand. Stored in the LRU cache, if it is enabled (Its enabled by default).
io.hfile.bloom.error.rate = 平均誤報率( average false positive rate ). 缺省 = 1%. 下降率爲 ½ (如 .5%) == +1 位每布隆入口。
io.hfile.bloom.max.fold = 保證最小摺疊速率(guaranteed minimum fold rate). 大多時候不要管. Default = 7, 或壓縮到原來大小的至少 1/128. 想獲取更多本選項的意義,參看本文檔 開發進程 節 BloomFilters in HBase
HBase tables are sometimes used as queues. In this case, special care must be taken to regularly perform major compactions on tables used in this manner. As is documented in Chapter 5, Data Model, marking rows as deleted creates additional StoreFiles which then need to be processed on reads. Tombstones only get cleaned up with major compactions.
Be aware that htable.delete(Delete) doesn't use the writeBuffer. It will execute an RegionServer RPC with each invocation. For a large number of deletes, consider htable.delete(List).
因爲 HBase 在 Section 9.9, 「HDFS」 上運行,it is important to understand how it works and how it affects HBase.
The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority. With the increased adoption of HBase this is changing, and several improvements are already in development. 參考 the Umbrella Jira Ticket for HDFS Improvements for HBase.
Since Hadoop 1.0.0 (also 0.22.1, 0.23.1, CDH3u3 and HDP 1.0) via HDFS-2246, it is possible for the DFSClient to take a "short circuit" and read directly from disk instead of going through the DataNode when the data is local. What this means for HBase is that the RegionServers can read directly off their machine's disks instead of having to open a socket to talk to the DataNode, the former being generally much faster[30]. Also see HBase, mail # dev - read short circuit thread for more discussion around short circuit reads.
To enable "short circuit" reads, you must set two configurations. First, the hdfs-site.xml needs to be amended. Set the property dfs.block.local-path-access.user to be the only user that can use the shortcut. This has to be the user that started HBase. Then in hbase-site.xml, set dfs.client.read.shortcircuit to be true
For optimal performance when short-circuit reads are enabled, it is recommended that HDFS checksums are disabled. To maintain data integrity with HDFS checksums disabled, HBase can be configured to write its own checksums into its datablocks and verify against these. See Section 11.4.9, 「hbase.regionserver.checksum.verify」.
The DataNodes need to be restarted in order to pick up the new configuration. Be aware that if a process started under another username than the one configured here also has the shortcircuit enabled, it will get an Exception regarding an unauthorized access but the data will still be read.
A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS will always be faster in this use-case.
Performance questions are common on Amazon EC2 environments because it is a shared environment. You will not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same reason (i.e., it's a shared environment and you don't know what else is happening on the server).
If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that because EC2 issues are practically a separate class of performance issues.
For Performance and Troubleshooting Case Studies, see Chapter 13, Case Studies.
[27] The latest jvms do better regards fragmentation so make sure you are running a recent release. Read down in the message,Identifying concurrent mode failures caused by fragmentation.
[28] For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the document BloomFilters in HBase attached to HBase-1200.
[29] The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.
[30] See JD's Performance Talk
老是先從主服務器的日誌開始(TODO: 哪些行?)。一般狀況下,他老是一行一行的重複信息。若是不是這樣,說明有問題,能夠Google或是用search-hadoop.com來搜索遇到的異常。
錯誤不多僅僅單獨出如今HBase中,一般是某一個地方出了問題,引發各處大量異常和調用棧跟蹤信息。遇到這樣的錯誤,最好的辦法是往上查日誌,找到最初的異常。例如區域服務器會在退出的時候打印一些度量信息。Grep這個轉儲 應該能夠找到最初的異常信息。
區域服務器的自殺是很「正常」的。當一些事情發生錯誤的,他們就會自殺。若是ulimit和xcievers(最重要的兩個設定,詳見Section 2.2.5, 「ulimit
和 nproc
」)沒有修改,HDFS將沒法運轉正常,在HBase看來,HDFS死掉了。假想一下,你的MySQL忽然沒法訪問它的文件系統,他會怎麼作。一樣的事情會發生在HBase和HDFS上。還有一個形成區域服務器切腹自殺的常見的緣由是,他們執行了一個長時間的GC操做,這個時間超過了ZooKeeper的會話時長。關於GC停頓的詳細信息,參見Todd Lipcon的 3 part blog post by Todd Lipcon 和上面的 Section, 「長時間GC停頓」.
重要日誌的位置( <user>是啓動服務的用戶,<hostname> 是機器的名字)
NameNode: $HADOOP_HOME/logs/hadoop-<user>-namenode-<hostname>.log
DataNode: $HADOOP_HOME/logs/hadoop-<user>-datanode-<hostname>.log
JobTracker: $HADOOP_HOME/logs/hadoop-<user>-jobtracker-<hostname>.log
TaskTracker: $HADOOP_HOME/logs/hadoop-<user>-jobtracker-<hostname>.log
HMaster: $HBASE_HOME/logs/hbase-<user>-master-<hostname>.log
RegionServer: $HBASE_HOME/logs/hbase-<user>-regionserver-<hostname>.log
ZooKeeper: TODO
NameNode的日誌在NameNode server上。HBase Master 一般也運行在NameNode server上,ZooKeeper一般也是這樣。
對於小一點的機器,JobTracker也一般運行在NameNode server上面。
Enabling the RPC-level logging on a RegionServer can often given insight on timings at the server. Once enabled, the amount of log spewed is voluminous. It is not recommended that you leave this logging on for more than short bursts of time. To enable RPC-level logging, browse to the RegionServer UI and click on Log Level. Set the log level to DEBUG
for the package org.apache.hadoop.ipc
(Thats right, for hadoop.ipc
, NOT, hbase.ipc
). Then tail the RegionServers log. Analyze.
To disable, set the logging level back to INFO
HBase is memory intensive, and using the default GC you can see long pauses in all threads including the Juliet Pause aka "GC of Death". To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine.
To enable, in hbase-env.sh
export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hbase/logs/gc-hbase.log"
Adjust the log directory to wherever you log. Note: The GC log does NOT roll automatically, so you'll have to keep an eye on it so it doesn't fill up the disk.
At this point you should see logs like so:
64898.952: [GC [1 CMS-initial-mark: 2811538K(3055704K)] 2812179K(3061272K), 0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 64898.953: [CMS-concurrent-mark-start] 64898.971: [GC 64898.971: [ParNew: 5567K->576K(5568K), 0.0101110 secs] 2817105K->2812715K(3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
In this section, the first line indicates a 0.0007360 second pause for the CMS to initially mark. This pauses the entire VM, all threads for that period of time.
The third line indicates a "minor GC", which pauses the VM for 0.0101110 seconds - aka 10 milliseconds. It has reduced the "ParNew" from about 5.5m to 576k. Later on in this cycle we see:
64901.445: [CMS-concurrent-mark: 1.542/2.492 secs] [Times: user=10.49 sys=0.33, real=2.49 secs] 64901.445: [CMS-concurrent-preclean-start] 64901.453: [GC 64901.453: [ParNew: 5505K->573K(5568K), 0.0062440 secs] 2868746K->2864292K(3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.476: [GC 64901.476: [ParNew: 5563K->575K(5568K), 0.0072510 secs] 2869283K->2864837K(3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] 64901.500: [GC 64901.500: [ParNew: 5517K->573K(5568K), 0.0120390 secs] 2869780K->2865267K(3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] 64901.529: [GC 64901.529: [ParNew: 5507K->569K(5568K), 0.0086240 secs] 2870200K->2865742K(3061272K), 0.0087180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.554: [GC 64901.555: [ParNew: 5516K->575K(5568K), 0.0107130 secs] 2870689K->2866291K(3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 64901.578: [CMS-concurrent-preclean: 0.070/0.133 secs] [Times: user=0.48 sys=0.01, real=0.14 secs] 64901.578: [CMS-concurrent-abortable-preclean-start] 64901.584: [GC 64901.584: [ParNew: 5504K->571K(5568K), 0.0087270 secs] 2871220K->2866830K(3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.609: [GC 64901.609: [ParNew: 5512K->569K(5568K), 0.0063370 secs] 2871771K->2867322K(3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 64901.615: [CMS-concurrent-abortable-preclean: 0.007/0.037 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] 64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 64901.621: [CMS-concurrent-sweep-start]
The first line indicates that the CMS concurrent mark (finding garbage) has taken 2.4 seconds. But this is a _concurrent_ 2.4 seconds, Java has not been paused at any point in time.
There are a few more minor GCs, then there is a pause at the 2nd last line:
64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
The pause here is 0.0049380 seconds (aka 4.9 milliseconds) to 'remark' the heap.
At this point the sweep starts, and you can watch the heap size go down:
64901.637: [GC 64901.637: [ParNew: 5501K->569K(5568K), 0.0097350 secs] 2871958K->2867441K(3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] ... lines removed ... 64904.936: [GC 64904.936: [ParNew: 5532K->568K(5568K), 0.0070720 secs] 1365024K->1360689K(3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64904.953: [CMS-concurrent-sweep: 2.030/3.332 secs] [Times: user=9.57 sys=0.26, real=3.33 secs]
At this point, the CMS sweep took 3.332 seconds, and heap went from about ~ 2.8 GB to 1.3 GB (approximate).
The key points here is to keep all these pauses low. CMS pauses are always low, but if your ParNew starts growing, you can see minor GC pauses approach 100ms, exceed 100ms and hit as high at 400ms.
This can be due to the size of the ParNew, which should be relatively small. If your ParNew is very large after running HBase for a while, in one example a ParNew was about 150MB, then you might have to constrain the size of ParNew (The larger it is, the longer the collections take but if its too small, objects are promoted to old gen too quickly). In the below we constrain new gen size to 64m.
Add this to HBASE_OPTS:
export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m <cms options from above> <gc logging options from above>"
For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Section, 「Long GC pauses」 above.
search-hadoop.com 索引了所有郵件列表,很適合作歷史檢索。有問題時先在這裏查詢,由於別人可能已經遇到過你的問題。
Ask a question on the HBase mailing lists. The 'dev' mailing list is aimed at the community of developers actually building HBase and for features currently under development, and 'user' is generally used for questions on released versions of HBase. Before going to the mailing list, make sure your question has not already been answered by searching the mailing list archives first. Use Section 12.3.1, 「search-hadoop.com」. Take some time crafting your question[31]; a quality question that includes all context and exhibits evidence the author has tried to find answers in the manual and out on lists is more likely to get a prompt response.
JIRA 在處理 Hadoop/HBase相關問題時也頗有幫助。
主服務器啓動了一個缺省端口是 60010的web接口。
The Master web UI lists created tables and their definition (e.g., ColumnFamilies, blocksize, etc.). Additionally, the available RegionServers in the cluster are listed along with selected high-level metrics (requests, number of regions, usedHeap, maxHeap). The Master web UI allows navigation to each RegionServer's web UI.
區域服務器啓動了一個缺省端口是 60030的web接口。
The RegionServer web UI lists online regions and their start/end keys, as well as point-in-time RegionServer metrics (requests, regions, storeFileIndexSize, compactionQueueSize, etc.).
參考 Section 14.4, 「HBase Metrics」 獲取更多度量信息。
是一個研究 ZooKeeper相關問題的有用工具。調用:
./hbase zkcli -server host:port <cmd> <args>
命令 (和參數) :
connect host:port get path [watch] ls path [watch] set path data [version] delquota [-n|-b] path quit printwatches on|off create [-s] [-e] path data acl stat path [watch] close ls2 path [watch] history listquota path setAcl path acl getAcl path sync path redo cmdno addauth scheme auth delete path [version] setquota -n|-b val path
是一個命令行工具,能夠用來看日誌的尾巴。加入的"-f"參數後,就會在數據更新的時候本身刷新。用它來看日誌很方便。例如,一個機器須要花不少時間來啓動或關閉,你能夠tail他的master log(也能夠是region server的log)。
top - 14:46:59 up 39 days, 11:55, 1 user, load average: 3.75, 3.57, 3.84 Tasks: 309 total, 1 running, 308 sleeping, 0 stopped, 0 zombie Cpu(s): 4.5%us, 1.6%sy, 0.0%ni, 91.7%id, 1.4%wa, 0.1%hi, 0.6%si, 0.0%st Mem: 24414432k total, 24296956k used, 117476k free, 7196k buffers Swap: 16008732k total, 14348k used, 15994384k free, 11106908k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15558 hadoop 18 -2 3292m 2.4g 3556 S 79 10.4 6523:52 java 13268 hadoop 18 -2 8967m 8.2g 4104 S 21 35.1 5170:30 java 8895 hadoop 18 -2 1581m 497m 3420 S 11 2.1 4002:32 java …這裏你能夠看到系統的load average在最近5分鐘是3.75,意思就是說這5分鐘裏面平均有3.75個線程在CPU時間的等待隊列裏面。一般來講,最完美的狀況是這個值和CPU和核數相等,比這個值低意味着資源閒置,比這個值高就是過載了。這是一個重要的概念,要想理解的更多,能夠看這篇文章 http://www.linuxjournal.com/article/9001.
處理負載,咱們能夠看到系統已經幾乎使用了他的所有RAM,其中大部分都是用於OS cache(這是一件好事).Swap只使用了一點點KB,這正是咱們指望的,若是數值很高的話,就意味着在進行交換,這對Java程序的性能是致命的。另外一種檢測交換的方法是看Load average是否太高(load average太高還多是磁盤損壞或者其它什麼緣由致使的)。
hadoop@sv4borg12:~$ jps 1322 TaskTracker 17789 HRegionServer 27862 Child 1158 DataNode 25115 HQuorumPeer 2950 Jps 19750 ThriftServer 18776 jmx
hadoop@sv4borg12:~$ ps aux | grep HRegionServer hadoop 17789 155 35.2 9067824 8604364 ? S<l Mar04 9855:48 /usr/java/jdk1.6.0_14/bin/java -Xmx8000m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/export1/hadoop/logs/gc-hbase.log -Dcom.sun.management.jmxremote.port=10102 -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/home/hadoop/hbase/conf/jmxremote.password -Dcom.sun.management.jmxremote -Dhbase.log.dir=/export1/hadoop/logs -Dhbase.log.file=hbase-hadoop-regionserver-sv4borg12.log -Dhbase.home.dir=/home/hadoop/hbase -Dhbase.id.str=hadoop -Dhbase.root.logger=INFO,DRFA -Djava.library.path=/home/hadoop/hbase/lib/native/Linux-amd64-64 -classpath /home/hadoop/hbase/bin/../conf:[many jars]:/home/hadoop/hadoop/conf org.apache.hadoop.hbase.regionserver.HRegionServer start
"regionserver60020" prio=10 tid=0x0000000040ab4000 nid=0x45cf waiting on condition [0x00007f16b6a96000..0x00007f16b6a96a70] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f16cd5c2f30> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:647) at java.lang.Thread.run(Thread.java:619) The MemStore flusher thread that is currently flushing to a file: "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000040f4e000 nid=0x45eb in Object.wait() [0x00007f16b5b86000..0x00007f16b5b87af0] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:803) - locked <0x00007f16cb14b3a8> (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221) at $Proxy1.complete(Unknown Source) at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.complete(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3390) - locked <0x00007f16cb14b470> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3304) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.close(HFile.java:650) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:853) at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:467) - locked <0x00007f16d00e6f08> (a java.lang.Object) at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:427) at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:80) at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1359) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:907) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:834) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:786) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:250) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:224) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:146)
一個處理線程是在等一些東西(例如put, delete, scan...):
"IPC Server handler 16 on 60020" daemon prio=10 tid=0x00007f16b011d800 nid=0x4a5e waiting on condition [0x00007f16afefd000..0x00007f16afefd9f0] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f16cd3f8dd8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1013)
"IPC Server handler 66 on 60020" daemon prio=10 tid=0x00007f16b006e800 nid=0x4a90 runnable [0x00007f16acb77000..0x00007f16acb77cf0] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.regionserver.KeyValueHeap.<init>(KeyValueHeap.java:56) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:79) at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2209) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1063) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1055) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1039) at org.apache.hadoop.hbase.regionserver.HRegion.getLastIncrement(HRegion.java:2875) at org.apache.hadoop.hbase.regionserver.HRegion.incrementColumnValue(HRegion.java:2978) at org.apache.hadoop.hbase.regionserver.HRegionServer.incrementColumnValue(HRegionServer.java:2433) at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:560) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1027)還有一個線程在從HDFS獲取數據。
"IPC Client (47) connection to sv4borg9/ from hadoop" daemon prio=10 tid=0x00007f16a02d0000 nid=0x4fa3 runnable [0x00007f16b517d000..0x00007f16b517dbf0] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked <0x00007f17d5b68c00> (a sun.nio.ch.Util$1) - locked <0x00007f17d5b68be8> (a java.util.Collections$UnmodifiableSet) - locked <0x00007f1877959b50> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:304) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x00007f1808539178> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:569) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:477)
"LeaseChecker" daemon prio=10 tid=0x00000000407ef800 nid=0x76cd waiting on condition [0x00007f6d0eae2000..0x00007f6d0eae2a70] -- java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:726) - locked <0x00007f6d1cd28f80> (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy1.recoverBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2636) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2832) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:529) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530) at org.apache.hadoop.hbase.util.FSUtils.recoverFileLease(FSUtils.java:619) at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1322) at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1210) at org.apache.hadoop.hbase.master.HMaster.splitLogAfterStartup(HMaster.java:648) at org.apache.hadoop.hbase.master.HMaster.joinCluster(HMaster.java:572) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:503)
這裏有一個例子,集羣正在同時進行上百個緊縮,嚴重影響了IO性能。(TODO: 在這裏插入compactionQueueSize的圖片)
HBase 客戶端的更多信息, 參考 Section 9.3, 「Client」.
操做獲取一次。由於數據是以大塊的形式傳到客戶端的,就可能形成超時。將這個 serCacheing的值調小是一個解決辦法,可是這個值要是設的過小就會影響性能。
Since 0.20.0 the default log level for org.apache.hadoop.hbase.*
On your clients, edit $HBASE_HOME/conf/log4j.properties
and change this: log4j.logger.org.apache.hadoop.hbase=DEBUG
to this:log4j.logger.org.apache.hadoop.hbase=INFO
, or even log4j.logger.org.apache.hadoop.hbase=WARN
參考 Section 11.10.2, 「 Table Creation: Pre-Creating Regions 」 ,關於預先建立區域的模式部分,並確認表沒有在單個區域中啓動。
參考 Section 11.4, 「HBase Configurations」 獲取集羣配置相關信息, 特別是 hbase.hstore.blockingStoreFiles
(region size), 和 MEMSTORE_FLUSHSIZE.
A slightly longer explanation of why pauses can happen is as follows: Puts are sometimes blocked on the MemStores which are blocked by the flusher thread which is blocked because there are too many files to compact because the compactor is given too many small files to compact and has to compact the same data repeatedly. This situation can occur even with minor compactions. Compounding this situation, HBase doesn't compress data in memory. Thus, the 64MB that lives in the MemStore could become a 6MB file after compression - which results in a smaller StoreFile. The upside is that more data is packed into the same region, but performance is achieved by being able to write larger files - which is why HBase waits until the flushize before writing a new StoreFile. And smaller StoreFiles become targets for compaction. Without compression the files are much bigger and don't need as much compaction, however this is at the expense of I/O.
For additional information, see this thread on Long client pauses with compression.
11/07/05 11:26:41 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 11/07/05 11:26:43 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/ 11/07/05 11:26:44 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 11/07/05 11:26:45 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/
... 要麼是 ZooKeeper 不在了,或網絡不可達問題。
工具 Section, 「zkcli」 能夠幫助調查 ZooKeeper 問題。
You are likely running into the issue that is described and worked through in the mail thread HBase, mail # user - Suspected memory leak and continued over in HBase, mail # dev - FeedbackRe: Suspected memory leak. A workaround is passing your client-side JVM a reasonable value for -XX:MaxDirectMemorySize
. By default, the MaxDirectMemorySize
is equal to your -Xmx
max heapsize setting (if -Xmx
is set). Try seting it to something smaller (for example, one user had success setting it to 1g
when they had a client-side heap of 12g
). If you set it too small, it will bring on FullGCs
so keep it a bit hefty. You want to make this setting client-side only especially if you are running the new experiemental server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep separate client-side and server-side config dirs).
該客戶端問題在 HBASE-5073 版本 0.90.6中修訂。 客戶端的ZooKeeper 內存泄露,而客戶端被管理API的額外調用產生的ZooKeeper事件連續調用。
There can be several causes that produce this symptom.
First, check that you have a valid Kerberos ticket. One is required in order to set up communication with a secure Apache HBase cluster. Examine the ticket currently in the credential cache, if any, by running the klist command line utility. If no ticket is listed, you must obtain a ticket by running the kinit command with either a keytab specified, or by interactively entering a password for the desired principal.
Then, consult the Java Security Guide troubleshooting section. The most common problem addressed there is resolved by setting javax.security.auth.useSubjectCredsOnly system property value to false.
Because of a change in the format in which MIT Kerberos writes its credentials cache, there is a bug in the Oracle JDK 6 Update 26 and earlier that causes Java to be unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1 or higher. If you have this problematic combination of components in your environment, to work around this problem, first log in with kinit and then immediately refresh the credential cache with kinit -R. The refresh will rewrite the credential cache without the problematic formatting.
Finally, depending on your Kerberos configuration, you may need to install the Java Cryptography Extension, or JCE. Insure the JCE jars are on the classpath on both server and client systems.
You may also need to download the unlimited strength JCE policy files. Uncompress and extract the downloaded file, and install the policy jars into <java-home>/lib/security.
以下的調用棧在使用 ImportTsv
WARN mapred.LocalJobRunner: job_local_0001 java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.hbase.mapreduce.hadoopbackport.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:111) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) Caused by: java.io.FileNotFoundException: File _partition.lst does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:776) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419) at org.apache.hadoop.hbase.mapreduce.hadoopbackport.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:296)
.. 看到調用棧的關鍵部分了嗎?就是...
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
LocalJobRunner 意思就是任務跑在本地,不在集羣。
參考 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath for more information on HBase MapReduce jobs and classpaths.
NameNode 更多信息, 參考 Section 9.9, 「HDFS」.
要肯定HBase 用了HDFS多大空間,可在NameNode使用 hadoop
hadoop fs -dus /hbase/
hadoop fs -dus /hbase/myTable
hadoop fs -du /hbase/myTable
更多關於 HDFS shell 命令的信息,參考 HDFS 文件系統 Shell 文檔.
有時須要瀏覽HDFS上的 HBase對象 。對象包括WALs (Write Ahead Logs), 表,區域,存儲文件等。最簡易的方法是在NameNode web應用中查看,端口 50070。NameNode web 應用提供到集羣中全部 DataNode 的連接,能夠無縫瀏覽。
(Tables in the cluster) (Regions for the table) (ColumnFamilies for the Region for the table) (StoreFiles for the ColumnFamily for the Regions for the table) /hbase/<Table>/<Region>/<ColumnFamiy>/<StoreFile>
HDFS 中的HBase WAL目錄結構是..
(RegionServers) (WAL HLog files for the RegionServer) /hbase/.logs/<RegionServer>/<HLog>
參考HDFS User Guide 獲取其餘非Shell診斷工具如fsck
參考 Section, 「Managed Compactions」 ,獲取更多管理緊縮的信息。
HBase 但願迴環 IP 地址是 參考開始章節 Section 2.2.3, 「Loopback IP」.
全部網絡接口是否正常?你肯定嗎?參考故障診斷用例研究 Section 12.14, 「Case Studies」.
RegionServer 的更多信息,參考 Section 9.6, 「RegionServer」.
主服務器相信區域服務器有IP地址127.0.0.1 - 這是 localhost 並被解析到主服務器本身的localhost.
修改區域服務器的 /etc/hosts
# Do not remove the following line, or various programs # that require network functionality will fail. fully.qualified.regionservername regionservername localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6
... 改到 (將主名稱從localhost中移掉)...
# Do not remove the following line, or various programs # that require network functionality will fail. localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6
11/02/20 01:32:15 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1028)
就意味着你的壓縮庫出現了問題。參見配置章節的 LZO compression configuration.
Are you running an old JVM (< 1.6.0_u21?)? When you look at a thread dump, does it look like threads are BLOCKED but no one holds the lock all are blocked on? 參考 HBASE 3622 Deadlock in HBaseServer (JVM bug?). Adding -XX:+UseMembar
to the HBase HBASE_OPTS
in conf/hbase-env.sh
may fix it.
Also, are you using Section 9.3.4, 「RowLocks」? These are discouraged because they can lock up the RegionServers if not managed properly.
2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:883)
... 參見快速入門的章節 ulimit and nproc configuration.
參見快速入門章節的 ulimit and nproc configuration.. 最新的Linux發行版缺省值是 1024 - 這對HBase實在過小了。
2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 10000 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 15000 2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying
... 或者看到了全GC壓縮操做,你可能正在執行一個全GC。
參見快速入門章節 ulimit and nproc configuration,檢查你的網絡。
Master or RegionServers shutting down with messages like those in the logs:
WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT] INFO org.apache.zookeeper.ClientCnxn: Server connection successful WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
The JVM is doing a long running garbage collecting which is pausing every threads (aka "stop the world"). Since the RegionServer's local ZooKeeper client cannot send heartbeats, the session times out. By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.
), the default of 1GB won't be able to sustain long running imports.If you wish to increase the session timeout, add the following to your hbase-site.xml
to increase the timeout from the default of 60 seconds to 120 seconds.
<property> <name>zookeeper.session.timeout</name> <value>1200000</value> </property> <property> <name>hbase.zookeeper.property.tickTime</name> <value>6000</value> </property>
Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having less garbage to collect per machine).
If this is happening during an upload which only happens once (like initially loading all your data into HBase), consider bulk loading.
參考 Section 12.11.2, 「ZooKeeper, The Cluster Canary」 for other general information about ZooKeeper troubleshooting.This exception is "normal" when found in the RegionServer logs at DEBUG level. This exception is returned back to the client and then the client goes back to .META. to find the new location of the moved region.
However, if the NotServingRegionException is logged ERROR, then the client ran out of retries and something probably wrong.
修復 DNS. HBase 0.92.x之前的版本,反向 DNS 須要和正向查詢相同答案。 參考 HBASE 3431 RegionServer is not using the name given it by the master; double entry in master listing of servers 獲取詳細細節。
沒有采用本地壓縮庫版本。 參考 HBASE-1900 Put back native support when hadoop 0.21 is released。從hadoop的HBase庫目錄複製本地庫或創建連接到正確位置,該消息將消失。
If you see this type of message it means that the region server was trying to read/send data from/to a client but it already went away. Typical causes for this are if the client was killed (you see a storm of messages like this when a MapReduce job is killed or fails) or if the client receives a SocketTimeoutException. It's harmless, but you should consider digging in a bit more if you aren't doing something to trigger them.
Master 更多信息, 參考 Section 9.5, 「Master」.
Upon running that, the hbase migrations script says no files in root directory.
HBase expects the root directory to either not exist, or to have already been initialized by hbase running a previous time. If you create a new directory for HBase using Hadoop DFS, this error will occur. Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself.
A ZooKeeper server wasn't able to start, throws that error. xyz is the name of your server.
This is a name lookup problem. HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the hbase.zookeeper.quorumconfiguration.
Use the hostname presented in the error message instead of the value you used. If you have a DNS server, you can set hbase.zookeeper.dns.interface andhbase.zookeeper.dns.nameserver in hbase-site.xml to make sure it resolves to the correct FQDN.
ZooKeeper is the cluster's "canary in the mineshaft". It'll be the first to notice issues if any so making sure its happy is the short-cut to a humming cluster.
參考 ZooKeeper Operating Environment Troubleshooting 頁。 It has suggestions and tools for checking disk and networking performance; i.e. the operating environment your ZooKeeper and HBase are running in.
Additionally, the utility Section, 「zkcli」 may help investigate ZooKeeper issues.
HBase does not start when deployed as Amazon EC2 instances. Exceptions like the below appear in the Master and/or RegionServer logs:
2009-10-19 11:52:27,030 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server ec2-174-129-15-236.compute-1.amazonaws.com/ 2009-10-19 11:52:27,032 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x0 to sun.nio.ch.SelectionKeyImpl@656dc861 java.net.ConnectException: Connection refused
Security group policy is blocking the ZooKeeper port on a public address. Use the internal EC2 host names when configuring the ZooKeeper quorum peer list.
關於 HBase 和Amazon EC2 的問題,常常在HBase 討論列表上被問起。搜索舊線索,使用 Search Hadoop
參考 Andrew回覆,更新在用戶列表:Remote Java client connection into EC2 instance.
HBase 0.90.x does not ship with hadoop-0.20.205.x, etc. To make it run, you need to replace the hadoop jars that HBase shipped with in its lib
directory with those of the Hadoop you want to run HBase on. If even after replacing Hadoop jars you get the below exception:
sv4r6s38: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration sv4r6s38: at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:37) sv4r6s38: at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:34) sv4r6s38: at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:229) sv4r6s38: at org.apache.hadoop.security.KerberosName.<clinit>(KerberosName.java:83) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:202) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177)
you need to copy under hbase/lib
, the commons-configuration-X.jar
you find in your Hadoop's lib
directory. That should fix the above complaint.
This chapter will describe a variety of performance and troubleshooting case studies that can provide a useful blueprint on diagnosing cluster issues.
For more information on Performance and Troubleshooting, see Chapter 11, Performance Tuning and Chapter 12, Troubleshooting and Debugging HBase.
The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in HBase.
*** QUESTION ***
We're looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like:
<FixedWidthUserName><FixedWidthValueId1>:"" (no value) <FixedWidthUserName><FixedWidthValueId2>:"" (no value) <FixedWidthUserName><FixedWidthValueId3>:"" (no value)The other option we had was to do this entirely using:
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
where each row would contain multiple values. So in one case reading the first thirty values would be:
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}And in the second case it would be
get 'FixedWidthUserName\x00\x00\x00\x00'
The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have <= 30 total values in these lists, and some users would have millions (i.e. power-law distribution)
The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?
My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we'll always need the same page size. I've ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we'd need to update all subsequent rows).
Thanks for help / suggestions / follow-up questions.
*** ANSWER ***
If I understand you correctly, you're ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:
"user123, firstname, Paul", "user234, lastname, Smith"
(But the usernames are fixed width, and the valueids are fixed width).
And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?
The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you're really sure it is needed.
Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you're giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn't sound like you need that. Doing it this way is generally recommended (see here #schema.smackdown).
Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I'm guessing you jumped to the "paginated" version because you're assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you're not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn't be fundamentally worse. The client has methods that allow you to get specific slices of columns.
Note that neither case fundamentally uses more disk space than the other; you're just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George's excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).
A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don't have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! :)
Following a scheduled reboot, one data node began exhibiting unusual behavior. Routine MapReduce jobs run against HBase tables which regularly completed in five or six minutes began taking 30 or 40 minutes to finish. These jobs were consistently found to be waiting on map and reduce tasks assigned to the troubled data node (e.g., the slow map tasks all had the same Input Split). The situation came to a head during a distributed copy, when the copy was severely prolonged by the lagging node.
We hypothesized that we were experiencing a familiar point of pain: a "hot spot" region in an HBase table, where uneven key-space distribution can funnel a huge number of requests to a single HBase region, bombarding the RegionServer process and cause slow response time. Examination of the HBase Master status page showed that the number of HBase requests to the troubled node was almost zero. Further, examination of the HBase logs showed that there were no region splits, compactions, or other region transitions in progress. This effectively ruled out a "hot spot" as the root cause of the observed slowness.
Our next hypothesis was that one of the MapReduce tasks was requesting data from HBase that was not local to the datanode, thus forcing HDFS to request data blocks from other servers over the network. Examination of the datanode logs showed that there were very few blocks being requested over the network, indicating that the HBase region was correctly assigned, and that the majority of the necessary data was located on the node. This ruled out the possibility of non-local data causing a slowdown.
After concluding that the Hadoop and HBase were not likely to be the culprits, we moved on to troubleshooting the datanode's hardware. Java, by design, will periodically scan its entire memory space to do garbage collection. If system memory is heavily overcommitted, the Linux kernel may enter a vicious cycle, using up all of its resources swapping Java heap back and forth from disk to RAM as Java tries to run garbage collection. Further, a failing hard disk will often retry reads and/or writes many times before giving up and returning an error. This can manifest as high iowait, as running processes wait for reads and writes to complete. Finally, a disk nearing the upper edge of its performance envelope will begin to cause iowait as it informs the kernel that it cannot accept any more data, and the kernel queues incoming data into the dirty write pool in memory. However, using vmstat(1)
and free(1)
, we could see that no swap was being used, and the amount of disk IO was only a few kilobytes per second.
Next, we checked to see whether the system was performing slowly simply due to very high computational load. top(1)
showed that the system load was higher than normal, but vmstat(1)
and mpstat(1)
showed that the amount of processor being used for actual computation was low.
Since neither the disks nor the processors were being utilized heavily, we moved on to the performance of the network interfaces. The datanode had two gigabit ethernet adapters, bonded to form an active-standby interface. ifconfig(8)
showed some unusual anomalies, namely interface errors, overruns, framing errors. While not unheard of, these kinds of errors are exceedingly rare on modern hardware which is operating as it should:
$ /sbin/ifconfig bond0 bond0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 inet addr:10.x.x.x Bcast:10.x.x.255 Mask: UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:2990700159 errors:12 dropped:0 overruns:1 frame:6 <--- Look Here! Errors! TX packets:3443518196 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2416328868676 (2.4 TB) TX bytes:3464991094001 (3.4 TB)
These errors immediately lead us to suspect that one or more of the ethernet interfaces might have negotiated the wrong line speed. This was confirmed both by running an ICMP ping from an external host and observing round-trip-time in excess of 700ms, and by runningethtool(8)
on the members of the bond interface and discovering that the active interface was operating at 100Mbs/, full duplex.
$ sudo ethtool eth0 Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Link partner advertised link modes: Not reported Link partner advertised pause frame use: No Link partner advertised auto-negotiation: No Speed: 100Mb/s <--- Look Here! Should say 1000Mb/s! Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: Unknown Supports Wake-on: umbg Wake-on: g Current message level: 0x00000003 (3) Link detected: yes
In normal operation, the ICMP ping round trip time should be around 20ms, and the interface speed and duplex should read, "1000MB/s", and, "Full", respectively.
After determining that the active ethernet adapter was at the incorrect speed, we used the ifenslave(8)
command to make the standby interface the active interface, which yielded an immediate improvement in MapReduce performance, and a 10 times improvement in network throughput:
On the next trip to the datacenter, we determined that the line speed issue was ultimately caused by a bad network cable, which was replaced.
Investigation results of a self-described "we're not sure what's wrong, but it seems slow" problem.http://gbif.blogspot.com/2012/03/hbase-performance-evaluation-continued.html
Investigation results of general cluster performance from 2010. Although this research is on an older version of the codebase, this writeup is still very useful in terms of approach. http://hstack.org/hbase-performance-testing/
Case study of configuring xceivers
, and diagnosing errors from mis-configurations. http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html
There is a Driver
class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example,
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar
... will return...
An example program must be given as the first argument. Valid program names are: completebulkload: Complete a bulk data load. copytable: Export a table from local cluster to peer cluster export: Write table data to HDFS. import: Import data written by Export. importtsv: Import data in TSV format. rowcounter: Count rows in HBase table verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan
... for allowable program names.
To run hbck against your HBase cluster run
$ ./bin/hbase hbck
At the end of the commands output it prints OK or INCONSISTENCY. If your cluster reports inconsistencies, pass -details to see more detail emitted. If inconsistencies, run hbck a few times because the inconsistency may be transient (e.g. cluster is starting up or a region is splitting). Passing -fix may correct the inconsistency (This latter is an experimental feature).
For more information, see Appendix B, hbck In Depth.
The main method on HLog
offers manual split and dump facilities. Pass it WALs or the product of a split, the content of therecovered.edits
. directory.
You can get a textual dump of a WAL file content by doing the following:
$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/
The return code will be non-zero if issues with the file so you can test wholesomeness of file by redirecting STDOUT
to /dev/null
and testing the program return.
Similarly you can force a split of a log file directory by doing:
$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/
CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename
Beginning of the time range. Without endtime means starttime to forever.endtime
End of the time range. Without endtime means starttime to forever.versions
Number of cell versions to copy.new.name
New table's name.peer.adr
Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parentfamilies
Comma-separated list of ColumnFamilies to copy.all.cells
Also copy delete markers and uncollected deleted cells (advanced option).Args:
Example of copying 'TestTable' to a cluster that uses replication for a 1 hour window:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase TestTable
Caching for the input Scan is configured via hbase.client.scanner.caching
in the job configuration.
參考 Jonathan Hsieh's Online HBase Backups with CopyTable blog post for more on CopyTable.
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
Note: caching for the input Scan is configured via hbase.client.scanner.caching
in the job configuration.
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the completebulkload
To load data via Puts (i.e., non-bulk loading):
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
To generate StoreFiles for bulk-loading:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>
These generated StoreFiles can be loaded into HBase via Section 14.1.10, 「CompleteBulkLoad」.
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir> Imports the given input directory of TSV data into the specified table. The column names of the TSV data must be specified using the -Dimporttsv.columns option. This option takes the form of comma-separated column names, where each column name is either a simple column family, or a columnfamily:qualifier. The special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and you must specify a column name for every column that exists in the input data. By default importtsv will load data directly into HBase. To instead generate HFiles of data to prepare for a bulk data load, pass the option: -Dimporttsv.bulk.output=/path/for/output Note: if you do not use this option, then the target table must already exist in HBase Other options that may be specified with -D include: -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
Assume that an input file exists as follows:
row1 c1 c2 row2 c1 c2 row3 c1 c2 row4 c1 c2 row5 c1 c2 row6 c1 c2 row7 c1 c2 row8 c1 c2 row9 c1 c2 row10 c1 c2
For ImportTsv to use this imput file, the command line needs to look like this:
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used. The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
實用工具能夠將產生的存儲文件移動到HBase表。該工具常常和Section 14.1.9, 「ImportTsv」 的輸出聯合使用。
$ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>
.. 經過驅動..
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
批量導入 HFiles 到 HBase的更多信息 ,參考 Section 9.8, 「Bulk Loading」.
WALPlayer 實用工具能夠重放 WAL 文件到 HBase.
The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables. The output can optionally be mapped to another set of tables.
WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.
Invoke via:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <wal inputdir> <tables> [<tableMappings>]>
For example:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2
主緊縮能夠經過HBase shell 或 HBaseAdmin.majorCompact 進行。
注意:主緊縮不進行區域合併。更多關於緊縮的信息,參考 Section, 「Compaction」
Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).
$ bin/hbase org.apache.hbase.util.Merge <tablename> <region1> <region2>
If you feel you have too many regions and want to consolidate them, Merge is the utility you need. Merge must run be done when the cluster is down. 參考 the O'Reilly HBase Book for an example of usage.
Additionally, there is a Ruby script attached to HBASE-1621 for region merging.
$ ./bin/hbase-daemon.sh stop regionserver
若是在運行load balancer的時候,一個節點要關閉, 則Load Balancer和Master的recovery可能會爭奪這個要下線的Regionserver。爲了不這個問題,先將load balancer中止,參見下面的 Load Balancer.
RegionServer下線有一個缺點就是其中的Region會有好一會離線。Regions是被按順序關閉的。若是一個server上有不少region,從第一個region會被下線,到最後一個region被關閉,而且Master確認他已經死了,該region才能夠上線,整個過程要花很長時間。在HBase 0.90.2中,咱們加入了一個功能,可讓節點逐漸的擺脫他的負載,最後關閉。HBase 0.90.2加入了 graceful_stop.sh
$ ./bin/graceful_stop.sh Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname> thrift If we should stop/start thrift before/after the hbase stop/start rest If we should stop/start rest before/after the hbase stop/start restart If we should restart after graceful stop reload Move offloaded regions back on to the stopped server debug Move offloaded regions back on to the stopped server hostname Hostname of server we are to stop
$ ./bin/graceful_stop.sh HOSTNAME
是RegionServer的host you would decommission.
必須和hbase使用的hostname一致,hbase用它來區分RegionServers。能夠用master的UI來檢查RegionServers的id。一般是hostname,也多是FQDN。無論HBase使用的哪個,你能夠將它傳到 graceful_stop.sh
腳本會讓RegionServer stop.,Master會注意到RegionServer已經下線了,這個時候全部的region已經從新部署好。RegionServer就能夠乾乾淨淨的結束,沒有WAL日誌須要分割。
當執行graceful_stop腳本的時候,要將Region Load Balancer關掉(不然balancer和下線腳本會在region部署的問題上存在衝突):
hbase(main):001:0> balance_switch false true 0 row(s) in 0.3590 seconds
hbase(main):001:0> balance_switch true false 0 row(s) in 0.3590 seconds
你還可讓這個腳本重啓一個RegionServer,不改變上面的Region的位置。要想保留數據的位置,你能夠依次重啓(Rolling Restart),就像這樣:
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
Tail /tmp/log.txt
來看腳本的運行過程.上面的腳本只對RegionServer進行操做。要確認load balancer已經關掉。還須要在以前更新master。下面是一段依次重啓的僞腳本,你能夠借鑑它:
確認你的版本,保證配置已經rsync到整個集羣中。若是版本是0.90.2,須要打上HBASE-3744 和 HBASE-3756兩個補丁。
$ ./bin/hbase hbck
$ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master
關閉region balancer:
$ echo "balance_switch false" | ./bin/hbase
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
若是你在RegionServer還開起來thrift和rest server。還須要加上--thrift or --rest 選項 (參見 graceful_stop.sh
運行 hbck 保證集羣是一直的
參見 Metrics 能夠得到一個enable Metrics emission的指導。
Number of blocks that had to be evicted from the block cache due to heap size constraints.
Block cache hit caching ratio (0 to 100). The cache-hit ratio for reads configured to look in the cache (i.e., cacheBlocks=true).
Number of blocks of StoreFiles (HFiles) read from the cache.
Block cache 命中率(0 到 100). Includes all read requests, although those with cacheBlocks=false will always read from disk and be counted as a "cache miss"
內存中的Block cache 大小 (單位 bytes). i.e., memory in use by the BlockCache
Number of enqueued regions in the MemStore awaiting flush.
文件系統同步延遲(ms). Latency to sync the write-ahead log records to the filesystem.
Number of operations to sync the write-ahead log records to the filesystem.
文件系統寫延遲(ms). Total latency for all writers, including StoreFiles and write-head log.
Number of filesystem write operations, including StoreFiles and write-ahead log.
RegionServer打開的stores數量。一個stores對應一個列族。例如,一個包含列族的表有3個region在這個RegionServer上,對應一個 列族就會有3個store.
下面的度量方法對每一個區域服務器的宏觀監控被證實是是最重要的,特別是在像 OpenTSDB這樣的系統中。若是你的集羣具備性能問題,你可能得參考本組信息。
HBase度量的更多信息,參考 Section 14.4, 「HBase Metrics」.
HBase查詢太慢的日誌由可分析的 JSON結構描述。客戶端操做 (Gets, Puts, Deletes, 等)的屬性, 要麼運行過久,或產生輸出太多。「運行過久」和「輸出太多」的門限可配置,如後面所述。輸出產生在主區域服務器日誌中,以便和其餘日誌事件一塊兒發現更多細節。它也前置區分標籤(responseTooSlow)
, (responseTooLarge)
, (operationTooSlow)
不被記錄太慢日誌的查詢執行的最大毫秒數(millisecond) 。缺省10000, 即 10 秒。可設 -1 禁止經過時間長短記入日誌。hbase.ipc.warn.response.size
不被記錄日誌的查詢可返回的最大字節數。缺省 100 MB,可設爲 -1 禁止經過大小記入日誌。查詢太慢日誌暴露給了JMX 度量。
輸出以操做作標籤,如 (operationTooSlow)
。若是調用是客戶端操做,如 Put, Get, 或 Delete,會暴露詳細指紋信息。不然,標籤爲 (responseTooSlow)
,也一樣提供可分析的JSON 輸出,但具備較少細節信息,徹底依賴於RPC自身的超時和超量設置。 TooLarge
代替 TooSlow
若是響應大小引發日誌記錄。在大小和時長都引發日誌記錄時,也是 TooLarge
2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}
注意,在"tables"結構裏的全部東西,是MultiPut的指紋打印的輸出。其他的信息是RPC相關的,如處理時間和客戶端IP/port。 客戶端的其餘操做的模式和通用結構與此相同,但根據單個操做的類型會有必定的不一樣。若是調用不是客戶端操做,則指紋細節信息將徹底沒有。
對本示例而言,指出了緩慢的緣由多是簡單的超大 ( 100MB) multiput,經過 "vlen" 即 value length告知,multiPut中的每一個put的域有該信息。
參見 集羣複製.
有兩種一般策略進行 HBase 備份:中止整個集羣再備份,和在正在使用的集羣上備份。 每一種途徑都有優缺點。
更多信息,參考Sematext的Blog HBase Backup Options.
一些環境能夠容忍暫時中止 HBase 集羣,如用於後臺容量分析,並不提供前臺頁面。 好處是 NameNode/Master 和 RegionServers是中止的, 不會有任何機會丟失正在處理改變的保存文件或元數據。明顯的壞處是集羣被 關閉。步驟包括:
Distcp 既用於將HDFS裏面的HBase 目錄下的內容拷貝到當前集羣的另外一個目錄,也能夠拷貝到另外一個集羣。
注意: Distcp 工做的情形是集羣關閉,沒有正在改變的文件。Distcp 不推薦用於正工做着的集羣。
這種方法假設有另外一個集羣。參考HBase的 replication 。
14.1.6節, 「CopyTable」 工具,便可用於將一個表複製到同集羣的另外一個表,也可將表複製到另外一個集羣的另外一個表。
14.1.7節, 「Export」 是一種將 HDFS 內容導出到同一集羣的方法。 恢復數據, 14.1.8節, 「Import」 工具能夠使用。
一個常見問題是HBase管理員須要估算一個HBase集羣要用多大存儲量。能夠經過幾個方面去考慮,最重要的是集羣要加載什麼數據。 開始於對HBase內部處理數據(KeyValue)的可靠了解。
HBase storage will be dominated by KeyValues. 參考 Section, 「KeyValue」 and Section 6.3.2, 「Try to minimize row and column sizes」 for how HBase stores data internally.
It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other factor.
KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis. Blocks are aggregated into StoreFile's. 參考 Section 9.7, 「Regions」.
Another common question for HBase administrators is determining the right number of regions per RegionServer. This affects both storage and hardware planning. 參考 Section 11.4.1, 「Number of Regions」.
參考 HBASE-3678 Add Eclipse-based Apache Formatter to HBase Wiki for an Eclipse formatter to help ensure your code conforms to HBase'y coding convention. The issue includes instructions for loading the attached formatter.
In addition to the automatic formatting, make sure you follow the style guidelines explained in Section 15.10.5, 「Common Patch Feedback」
Also, no @author tags - that's a rule. Quality Javadoc comments are appreciated. And include the Apache license.
Download and install the Subversive plugin.
Set up an SVN Repository target from Section 15.1.1, 「SVN」, then check out the code.
If you cloned the project via git, download and install the Git plugin (EGit). Attach to your local git repo (via the Git Repositories window) and you'll be able to see file revision history, generate patches, etc.
The easiest way is to use the m2eclipse plugin for Eclipse. Eclipse Indigo or newer has m2eclipse built-in, or it can be found here:http://www.eclipse.org/m2e/. M2Eclipse provides Maven integration for Eclipse - it even lets you use the direct Maven commands from within Eclipse to compile and test your project.
To import the project, you merely need to go to File->Import...Maven->Existing Maven Projects and then point Eclipse at the HBase root directory; m2eclipse will automatically find all the hbase modules for you.
If you install m2eclipse and import HBase in your workspace, you will have to fix your eclipse Build Path. Remove target
folder, addtarget/generated-jamon
and target/generated-sources/java
folders. You may also remove from your Build Path the exclusions on thesrc/main/resources
and src/test/resources
to avoid error message in the console 'Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (default) on project hbase: 'An Ant BuildException has occured: Replace: source file .../target/classes/hbase-default.xml doesn't exist'. This will also reduce the eclipse build cycles and make your life easier when developing.
For those not inclined to use m2eclipse, you can generate the Eclipse files from the command line. First, run (you should only have to do this once):
mvn clean install -DskipTests
and then close Eclipse and execute...
mvn eclipse:eclipse
... from your local HBase project directory in your workspace to generate some new .project
and .classpath
files. Then reopen Eclipse, and import the .project file in the HBase directory to a workspace.
classpath variable needs to be set up for the project. This needs to be set to your local Maven repository, which is usually ~/.m2/repository
Description Resource Path Location Type The project cannot be built until build path errors are resolved hbase Unknown Java Problem Unbound classpath variable: 'M2_REPO/asm/asm/3.1/asm-3.1.jar' in project 'hbase' hbase Build path Build Path Problem Unbound classpath variable: 'M2_REPO/com/github/stephenc/high-scale-lib/high-scale-lib/1.1.1/high-scale-lib-1.1.1.jar' in project 'hbase' hbase Build path Build Path Problem Unbound classpath variable: 'M2_REPO/com/google/guava/guava/r09/guava-r09.jar' in project 'hbase' hbase Build path Build Path Problem Unbound classpath variable: 'M2_REPO/com/google/protobuf/protobuf-java/2.3.0/protobuf-java-2.3.0.jar' in project 'hbase' hbase Build path Build Path Problem Unbound classpath variable:
Eclipse will currently complain about Bytes.java
. It is not possible to turn these errors off.
Description Resource Path Location Type Access restriction: The method arrayBaseOffset(Class) from the type Unsafe is not accessible due to restriction on required library /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Classes/classes.jar Bytes.java /hbase/src/main/java/org/apache/hadoop/hbase/util line 1061 Java Problem Access restriction: The method arrayIndexScale(Class) from the type Unsafe is not accessible due to restriction on required library /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Classes/classes.jar Bytes.java /hbase/src/main/java/org/apache/hadoop/hbase/util line 1064 Java Problem Access restriction: The method getLong(Object, long) from the type Unsafe is not accessible due to restriction on required library /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Classes/classes.jar Bytes.java /hbase/src/main/java/org/apache/hadoop/hbase/util line 1111 Java Problem
For additional information on setting up Eclipse for HBase development on Windows, see Michael Morello's blog on the topic.
This section will be of interest only to those building HBase from source.
Pass -Dsnappy
to trigger the snappy maven profile for building snappy native libs into hbase.
Do the following to build the HBase tarball. Passing the -Drelease will generate javadoc and run the RAT plugin to verify licenses on source.
% MAVEN_OPTS="-Xmx2g" mvn clean site install assembly:single -Dmaven.test.skip -Prelease
Follow the instructions at Publishing Maven Artifacts. The 'trick' to making it all work is answering the questions put to you by the mvn release plugin properly, making sure it is using the actual branch AND before doing the mvn release:perform step, VERY IMPORTANT, check and if necessary hand edit the release.properties file that was put under ${HBASE_HOME}
by the previous step,release:perform. You need to edit it to make it point at right locations in SVN.
Use maven 3.0.x.
At the mvn release:perform step, before starting, if you are for example releasing hbase 0.92.0, you need to make sure the pom.xml version is 0.92.0-SNAPSHOT. This needs to be checked in. Since we do the maven release after actual release, I've been doing this checkin into a particular tag rather than into the actual release tag. So, say we released hbase 0.92.0 and now we want to do the release to the maven repository, in svn, the 0.92.0 release will be tagged 0.92.0. Making the maven release, copy the 0.92.0 tag to 0.92.0mvn. Check out this tag and change the version therein and commit.
Here is how I'd answer the questions at release:prepare time:
What is the release version for "HBase"? (org.apache.hbase:hbase) 0.92.0: : What is SCM release tag or label for "HBase"? (org.apache.hbase:hbase) hbase-0.92.0: : 0.92.0mvnrelease What is the new development version for "HBase"? (org.apache.hbase:hbase) 0.92.1-SNAPSHOT: : [INFO] Transforming 'HBase'...
A strange issue I ran into was the one where the upload into the apache repository was being sprayed across multiple apache machines making it so I could not release. 參考 INFRA-4482 Why is my upload to mvn spread across multiple repositories?.
Here is my ~/.m2/settings.xml
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd"> <servers> <!- To publish a snapshot of some part of Maven --> <server> <id>apache.snapshots.https</id> <username>YOUR_APACHE_ID </username> <password>YOUR_APACHE_PASSWORD </password> </server> <!-- To publish a website using Maven --> <!-- To stage a release of some part of Maven --> <server> <id>apache.releases.https</id> <username>YOUR_APACHE_ID </username> <password>YOUR_APACHE_PASSWORD </password> </server> </servers> <profiles> <profile> <id>apache-release</id> <properties> <gpg.keyname>YOUR_KEYNAME</gpg.keyname> <!--Keyname is something like this ... 00A5F21E... do gpg --list-keys to find it--> <gpg.passphrase>YOUR_KEY_PASSWORD </gpg.passphrase> </properties> </profile> </profiles> </settings>
When you run release:perform, pass -Papache-release else it will not 'sign' the artifacts it uploads.
If you see run into the below, its because you need to edit version in the pom.xml and add -SNAPSHOT
to the version (and commit).
[INFO] Scanning for projects... [INFO] Searching repository for plugin with prefix: 'release'. [INFO] ------------------------------------------------------------------------ [INFO] Building HBase [INFO] task-segment: [release:prepare] (aggregator-style) [INFO] ------------------------------------------------------------------------ [INFO] [release:prepare {execution: default-cli}] [INFO] ------------------------------------------------------------------------ [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] You don't have a SNAPSHOT project in the reactor projects list. [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 3 seconds [INFO] Finished at: Sat Mar 26 18:11:07 PDT 2011 [INFO] Final Memory: 35M/423M [INFO] -----------------------------------------------------------------------
If you see Unable to find resource 'VM_global_library.vm'
, ignore it. Its not an error. It is officially ugly though.
Follow the instructions at Publishing Maven Artifacts after reading the below miscellaney.
You must use maven 3.0.x (Check by running mvn -version).
Let me list out the commands I used first. The sections that follow dig in more on what is going on. In this example, we are releasing the 0.92.2 jar to the apache maven repository.
# First make a copy of the tag we want to release; presumes the release has been tagged already # We do this because we need to make some commits for the mvn release plugin to work. 853 svn copy -m "Publishing 0.92.2 to mvn" https://svn.apache.org/repos/asf/hbase/tags/0.92.2 https://svn.apache.org/repos/asf/hbase/tags/0.92.2mvn 857 svn checkout https://svn.apache.org/repos/asf/hbase/tags/0.92.2mvn 858 cd 0.92.2mvn/ # Edit the version making it release version with a '-SNAPSHOT' suffix (See below for more on this) 860 vi pom.xml 861 svn commit -m "Add SNAPSHOT to the version" pom.xml 862 ~/bin/mvn/bin/mvn release:clean 865 ~/bin/mvn/bin/mvn release:prepare 866 # Answer questions and then ^C to kill the build after the last question. See below for more on this. 867 vi release.properties # Change the references to trunk svn to be 0.92.2mvn; the release plugin presumes trunk # Then restart the release:prepare -- it won't ask questions # because the properties file exists. 868 ~/bin/mvn/bin/mvn release:prepare # The apache-release profile comes from the apache parent pom and does signing of artifacts published 869 ~/bin/mvn/bin/mvn release:perform -Papache-release # When done copying up to apache staging repository, # browse to repository.apache.org, login and finish # the release as according to the above # "Publishing Maven Artifacts.
Below is more detail on the commmands listed above.
At the mvn release:perform step, before starting, if you are for example releasing hbase 0.92.2, you need to make sure the pom.xml version is 0.92.2-SNAPSHOT. This needs to be checked in. Since we do the maven release after actual release, I've been doing this checkin into a copy of the release tag rather than into the actual release tag itself (presumes the release has been properly tagged in svn). So, say we released hbase 0.92.2 and now we want to do the release to the maven repository, in svn, the 0.92.2 release will be tagged 0.92.2. Making the maven release, copy the 0.92.2 tag to 0.92.2mvn. Check out this tag and change the version therein and commit.
Currently, the mvn release wants to go against trunk. I haven't figured how to tell it to do otherwise so I do the below hack. The hack comprises answering the questions put to you by the mvn release plugin properly, then immediately control-C'ing the build after the last question asked as the build release step starts to run. After control-C'ing it, You'll notice a release.properties in your build dir. Review it. Make sure it is using the proper branch -- it tends to use trunk rather than the 0.92.2mvn or whatever that you want it to use -- so hand edit the release.properties file that was put under ${HBASE_HOME} by the release:perform invocation. When done, resstart the release:perform.
Here is how I'd answer the questions at release:prepare time:
What is the release version for "HBase"? (org.apache.hbase:hbase) 0.92.2: : What is SCM release tag or label for "HBase"? (org.apache.hbase:hbase) hbase-0.92.2: : 0.92.2mvn What is the new development version for "HBase"? (org.apache.hbase:hbase) 0.92.3-SNAPSHOT: : [INFO] Transforming 'HBase'...
When you run release:perform, pass -Papache-release else it will not 'sign' the artifacts it uploads.
A strange issue I ran into was the one where the upload into the apache repository was being sprayed across multiple apache machines making it so I could not release. SeeINFRA-4482 Why is my upload to mvn spread across multiple repositories?.
Here is my ~/.m2/settings.xml. This is read by the release plugin. The apache-release profile will pick up your gpg key setup from here if you've specified it into the file. The password can be maven encrypted as suggested in the "Publishing Maven Artifacts" but plain text password works too (just don't let anyone see your local settings.xml).
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd"> <servers> <!- To publish a snapshot of some part of Maven --> <server> <id>apache.snapshots.https</id> <username>YOUR_APACHE_ID </username> <password>YOUR_APACHE_PASSWORD </password> </server> <!-- To publish a website using Maven --> <!-- To stage a release of some part of Maven --> <server> <id>apache.releases.https</id> <username>YOUR_APACHE_ID </username> <password>YOUR_APACHE_PASSWORD </password> </server> </servers> <profiles> <profile> <id>apache-release</id> <properties> <gpg.keyname>YOUR_KEYNAME</gpg.keyname> <!--Keyname is something like this ... 00A5F21E... do gpg --list-keys to find it--> <gpg.passphrase>YOUR_KEY_PASSWORD </gpg.passphrase> </properties> </profile> </profiles> </settings>
If you see run into the below, its because you need to edit version in the pom.xml and add -SNAPSHOT to the version (and commit).
[INFO] Scanning for projects... [INFO] Searching repository for plugin with prefix: 'release'. [INFO] ------------------------------------------------------------------------ [INFO] Building HBase [INFO] task-segment: [release:prepare] (aggregator-style) [INFO] ------------------------------------------------------------------------ [INFO] [release:prepare {execution: default-cli}] [INFO] ------------------------------------------------------------------------ [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] You don't have a SNAPSHOT project in the reactor projects list. [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 3 seconds [INFO] Finished at: Sat Mar 26 18:11:07 PDT 2011 [INFO] Final Memory: 35M/423M [INFO] -----------------------------------------------------------------------
The manual is marked up using docbook. We then use the docbkx maven plugin to transform the markup to html. This plugin is run when you specify the site goal as in when you run mvn site or you can call the plugin explicitly to just generate the manual by doing mvn docbkx:generate-html (TODO: It looks like you have to run mvn sitefirst because docbkx wants to include a transformed hbase-default.xml. Fix). When you run mvn site, we do the document generation twice, once to generate the multipage manual and then again for the single page manual (the single page version is easier to search).
The Apache HBase apache web site (including this reference guide) is maintained as part of the main Apache HBase source tree, under /src/main/docbkx and /src/main/site [30]. The former -- docbkx -- is this reference guide as a bunch of xml marked up using docbook; the latter is the hbase site (the navbars, the header, the layout, etc.), and some of the documentation, legacy pages mostly that are in the process of being merged into the docbkx tree that is converted to html by a maven plugin by the site build.
To contribute to the reference guide, edit these files under site or docbkx and submit them as a patch (see Section 15.11, 「Submitting Patches」). Your Jira should contain a summary of the changes in each section (see HBASE-6081 for an example).
To generate the site locally while you're working on it, run:
mvn site
Then you can load up the generated HTML files in your browser (file are under /target/site).
As of INFRA-5680 Migrate apache hbase website, to publish the website, build it, and then deploy it over a checkout ofhttps://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk. Finally, check it in. For example, if trunk is checked out out at /Users/stack/checkouts/trunk and the hbase website, hbase.apache.org, is checked out at /Users/stack/checkouts/hbase.apache.org/trunk, to update the site, do the following:
# Build the site and deploy it to the checked out directory # Getting the javadoc into site is a little tricky. You have to build it before you invoke 'site'. $ MAVEN_OPTS=" -Xmx3g" mvn clean install -DskipTests javadoc:aggregate site site:stage -DstagingDirectory=/Users/stack/checkouts/hbase.apache.org/trunk
Now check the deployed site by viewing in a brower, browse to file:////Users/stack/checkouts/hbase.apache.org/trunk/index.html and check all is good. If all checks out, commit it and your new build will show up immediately at http://hbase.apache.org
$ cd /Users/stack/checkouts/hbase.apache.org/trunk $ svn status # Do an svn add of any new content... $ svn add .... $ svn commit -m 'Committing latest version of website...'
Developers, at a minimum, should familiarize themselves with the unit test detail; unit tests in HBase have a character not usually seen in other projects.
As of 0.96, HBase is split into multiple modules which creates "interesting" rules for how and where tests are written. If you are writting code for hbase-server
, see Section 15.7.2, 「Unit Tests」 for how to write your tests; these tests can spin up a minicluster and will need to be categorized. For any other module, for example hbase-common
, the tests must be strict unit tests and just test the class under test - no use of the HBaseTestingUtility or minicluster is allowed (or even possible given the dependency tree).
mvn testwhich will just run the tests IN THAT MODULE. If there are other dependencies on other modules, then you will have run the command from the ROOT HBASE DIRECTORY. This will run the tests in the other modules, unless you specify to skip the tests in that module. For instance, to skip the tests in the hbase-server module, you would run:
mvn clean test -Dskip-server-testsfrom the top level directory to run all the tests in modules other than hbase-server. Note that you can specify to skip tests in multiple modules as well as just for a single module. For example, to skip the tests in
, you would run:
mvn clean test -Dskip-server-tests -Dskip-common-tests
Also, keep in mind that if you are running tests in the hbase-server
module you will need to apply the maven profiles discussed inSection, 「Running tests」 to get the tests to run properly.
Apache HBase unit tests are subdivided into four categories: small, medium, large, and integration with corresponding JUnit categories: SmallTests, MediumTests, LargeTests,IntegrationTests. JUnit categories are denoted using java annotations and look like this in your unit test code.
... @Category(SmallTests.class) public class TestHRegionInfo { @Test public void testCreateHRegionInfoName() throws Exception { // ... } }
The above example shows how to mark a unit test as belonging to the small category. All unit tests in HBase have a categorization.
The first three categories, small, medium, and large are for tests run when you type $ mvn test; i.e. these three categorizations are for HBase unit tests. The integration category is for not for unit tests but for integration tests. These are run when you invoke $ mvn verify. Integration tests are described in Section 15.7.5, 「Integration Tests」 and will not be discussed further in this section on HBase unit tests.
Apache HBase uses a patched maven surefire plugin and maven profiles to implement its unit test characterizations.
Read the below to figure which annotation of the set small, medium, and large to put on your new HBase unit test.
Small tests are executed in a shared JVM. We put in this category all the tests that can be executed quickly in a shared JVM. The maximum execution time for a small test is 15 seconds, and small tests should not use a (mini)cluster.
Medium tests represent tests that must be executed before proposing a patch. They are designed to run in less than 30 minutes altogether, and are quite stable in their results. They are designed to last less than 50 seconds individually. They can use a cluster, and each of them is executed in a separate JVM.
Large tests are everything else. They are typically large-scale tests, regression tests for specific bugs, timeout tests, performance tests. They are executed before a commit on the pre-integration machines. They can be run on the developer machine as well.
Integration tests are system level tests. See Section 15.7.5, 「Integration Tests」 for more info.
Below we describe how to run the Apache HBase junit categories.
mvn test
will execute all small tests in a single JVM (no fork) and then medium tests in a separate JVM for each test instance. Medium tests are NOT executed if there is an error in a small test. Large tests are NOT executed. There is one report for small tests, and one report for medium tests if they are executed.
mvn test -P runAllTests
will execute small tests in a single JVM then medium and large tests in a separate JVM for each test. Medium and large tests are NOT executed if there is an error in a small test. Large tests are NOT executed if there is an error in a small or medium test. There is one report for small tests, and one report for medium and large tests if they are executed.
To run an individual test, e.g. MyTest, do
mvn test -Dtest=MyTest
You can also pass multiple, individual tests as a comma-delimited list:
mvn test -Dtest=MyTest1,MyTest2,MyTest3
You can also pass a package, which will run all tests under the package:
mvn test -Dtest=org.apache.hadoop.hbase.client.*
When -Dtest is specified, localTests profile will be used. It will use the official release of maven surefire, rather than our custom surefire plugin, and the old connector (The HBase build uses a patched version of the maven surefire plugin). Each junit tests is executed in a separate JVM (A fork per test class). There is no parallelization when tests are running in this mode. You will see a new message at the end of the -report: "[INFO] Tests are skipped". It's harmless. While you need to make sure the sum of Tests run: in the Results : section of test reports matching the number of tests you specified because no error will be reported when a non-existent test case is specified.
mvn test -P runSmallTests
will execute "small" tests only, using a single JVM.
mvn test -P runMediumTests
will execute "medium" tests only, launching a new JVM for each test-class.
mvn test -P runLargeTests
will execute "large" tests only, launching a new JVM for each test-class.
For convenience, you can run
mvn test -P runDevTests
to execute both small and medium tests, using a single JVM.
By default, $ mvn test -P runAllTests runs 5 tests in parallel. It can be increased on a developer's machine. Allowing that you can have 2 tests in parallel per core, and you need about 2Gb of memory per test (at the extreme), if you have an 8 core, 24Gb box, you can have 16 tests in parallel. but the memory available limits it to 12 (24/2), To run all tests with 12 tests in parallell, do this: mvn test -P runAllTests -Dsurefire.secondPartThreadCount=12. To increase the speed, you can as well use a ramdisk. You will need 2Gb of memory to run all tests. You will also need to delete the files between two test run. The typical way to configure a ramdisk on Linux is:
$ sudo mkdir /ram2G sudo mount -t tmpfs -o size=2048M tmpfs /ram2G
You can then use it to run all HBase tests with the command: mvn test -P runAllTests -Dsurefire.secondPartThreadCount=12 -Dtest.build.data.basedirectory=/ram2G
It's also possible to use the script hbasetests.sh. This script runs the medium and large tests in parallel with two maven instances, and provides a single report. This script does not use the hbase version of surefire so no parallelization is being done other than the two maven instances the script sets up. It must be executed from the directory which contains the pom.xml.
For example running
will execute small and medium tests. Running
./dev-support/hbasetests.sh runAllTests
will execute all tests. Running
./dev-support/hbasetests.sh replayFailed
will rerun the failed tests a second time, in a separate jvm and without parallelisation.
A custom Maven SureFire plugin listener checks a number of resources before and after each HBase unit test runs and logs its findings at the end of the test output files which can be found in target/surefire-reports per Maven module (Tests write test reports named for the test class into this directory. Check the *-out.txt files). The resources counted are the number of threads, the number of file descriptors, etc. If the number has increased, it adds a LEAK? comment in the logs. As you can have an HBase instance running in the background, some threads can be deleted/created without any specific action in the test. However, if the test does not work as expected, or if the test should not impact these resources, it's worth checking these log lines ...hbase.ResourceChecker(157): before... and ...hbase.ResourceChecker(157): after.... For example: 2012-09-26 09:22:15,315 INFO [pool-1-thread-1] hbase.ResourceChecker(157): after: regionserver.TestColumnSeeking#testReseeking Thread=65 (was 65), OpenFileDescriptor=107 (was 107), MaxFileDescriptor=10240 (was 10240), ConnectionCount=1 (was 1)
Whenever possible, tests should not use Thread.sleep, but rather waiting for the real event they need. This is faster and clearer for the reader. Tests should not do aThread.sleep without testing an ending condition. This allows understanding what the test is waiting for. Moreover, the test will work whatever the machine performance is. Sleep should be minimal to be as fast as possible. Waiting for a variable should be done in a 40ms sleep loop. Waiting for a socket operation should be done in a 200 ms sleep loop.
Tests using a HRegion do not have to start a cluster: A region can use the local file system. Start/stopping a cluster cost around 10 seconds. They should not be started per test method but per test class. Started cluster must be shutdown using HBaseTestingUtility#shutdownMiniCluster, which cleans the directories. As most as possible, tests should use the default settings for the cluster. When they don't, they should document it. This will allow to share the cluster later.
HBase integration/system tests are tests that are beyond HBase unit tests. They are generally long-lasting, sizeable (the test can be asked to 1M rows or 1B rows), targetable (they can take configuration that will point them at the ready-made cluster they are to run against; integration tests do not include cluster start/stop code), and verifying success, integration tests rely on public APIs only; they do not attempt to examine server internals asserting success/fail. Integration tests are what you would run when you need to more elaborate proofing of a release candidate beyond what unit tests can do. They are not generally run on the Apache Continuous Integration build server, however, some sites opt to run integration tests as a part of their continuous testing on an actual cluster.
Integration tests currently live under the src/test directory in the hbase-it submodule and will match the regex: **/IntegrationTest*.java. All integration tests are also annotated with @Category(IntegrationTests.class).
Integration tests can be run in two modes: using a mini cluster, or against an actual distributed cluster. Maven failsafe is used to run the tests using the mini cluster. IntegrationTestsDriver class is used for executing the tests against a distributed cluster. Integration tests SHOULD NOT assume that they are running against a mini cluster, and SHOULD NOT use private API's to access cluster state. To interact with the distributed or mini cluster uniformly, IntegrationTestingUtility, and HBaseClusterclasses, and public client API's can be used.
HBase 0.92 added a verify maven target. Invoking it, for example by doing mvn verify, will run all the phases up to and including the verify phase via the maven failsafe plugin, running all the above mentioned HBase unit tests as well as tests that are in the HBase integration test group. After you have completed
mvn install -DskipTests
You can run just the integration tests by invoking:
cd hbase-it mvn verify
If you just want to run the integration tests in top-level, you need to run two commands. First:
mvn failsafe:integration-test
This actually runs ALL the integration tests.
This command will always output BUILD SUCCESS even if there are test failures.
At this point, you could grep the output by hand looking for failed tests. However, maven will do this for us; just use:
mvn failsafe:verify
The above command basically looks at all the test results (so don't remove the 'target' directory) for test failures and reports the results.
This is very similar to how you specify running a subset of unit tests (see above), but use the property it.test instead of test. To just run IntegrationTestClassXYZ.java, use:
mvn failsafe:integration-test -Dit.test=IntegrationTestClassXYZ
The next thing you might want to do is run groups of integration tests, say all integration tests that are named IntegrationTestClassX*.java:
mvn failsafe:integration-test -Dit.test=*ClassX*
This runs everything that is an integration test that matches *ClassX*. This means anything matching: "**/IntegrationTest*ClassX*". You can also run multiple groups of integration tests using comma-delimited lists (similar to unit tests). Using a list of matches still supports full regex matching for each of the groups.This would look something like:
mvn failsafe:integration-test -Dit.test=*ClassX*, *ClassY
If you have an already-setup HBase cluster, you can launch the integration tests by invoking the class IntegrationTestsDriver. You may have to run test-compile first. The configuration will be picked by the bin/hbase script.
mvn test-compile
Then launch the tests with:
bin/hbase [--config config_dir] org.apache.hadoop.hbase.IntegrationTestsDriver [-test=class_regex]
This execution will launch the tests under hbase-it/src/test, having @Category(IntegrationTests.class) annotation, and a name starting with IntegrationTests. If specified, class_regex will be used to filter test classes. The regex is checked against full class name; so, part of class name can be used. IntegrationTestsDriver uses Junit to run the tests. Currently there is no support for running integration tests against a distributed cluster using maven (see HBASE-6201).
The tests interact with the distributed cluster by using the methods in the DistributedHBaseCluster (implementing HBaseCluster) class, which in turn uses a pluggableClusterManager. Concrete implementations provide actual functionality for carrying out deployment-specific and environment-dependent tasks (SSH, etc). The defaultClusterManager is HBaseClusterManager, which uses SSH to remotely execute start/stop/kill/signal commands, and assumes some posix commands (ps, etc). Also assumes the user running the test has enough "power" to start/stop servers on the remote machines. By default, it picks up HBASE_SSH_OPTS, HBASE_HOME, HBASE_CONF_DIR from the env, and usesbin/hbase-daemon.sh to carry out the actions. Currently tarball deployments, deployments which uses hbase-daemons.sh, and Apache Ambari deployments are supported. /etc/init.d/ scripts are not supported for now, but it can be easily added. For other deployment options, a ClusterManager can be implemented and plugged in.
In 0.96, a tool named ChaosMonkey has been introduced. It is modeled after the same-named tool by Netflix. Some of the tests use ChaosMonkey to simulate faults in the running cluster in the way of killing random servers, disconnecting servers, etc. ChaosMonkey can also be used as a stand-alone tool to run a (misbehaving) policy while you are running other tests.
ChaosMonkey defines Action's and Policy's. Actions are sequences of events. We have at least the following actions:
Policies on the other hand are responsible for executing the actions based on a strategy. The default policy is to execute a random action every minute based on predefined action weights. ChaosMonkey executes predefined named policies until it is stopped. More than one policy can be active at any time.
To run ChaosMonkey as a standalone tool deploy your HBase cluster as usual. ChaosMonkey uses the configuration from the bin/hbase script, thus no extra configuration needs to be done. You can invoke the ChaosMonkey by running:
bin/hbase org.apache.hadoop.hbase.util.ChaosMonkey
This will output smt like:
12/11/19 23:21:57 INFO util.ChaosMonkey: Using ChaosMonkey Policy: class org.apache.hadoop.hbase.util.ChaosMonkey$PeriodicRandomActionPolicy, period:60000 12/11/19 23:21:57 INFO util.ChaosMonkey: Sleeping for 26953 to add jitter 12/11/19 23:22:24 INFO util.ChaosMonkey: Performing action: Restart active master 12/11/19 23:22:24 INFO util.ChaosMonkey: Killing master:master.example.com,60000,1353367210440 12/11/19 23:22:24 INFO hbase.HBaseCluster: Aborting Master: master.example.com,60000,1353367210440 12/11/19 23:22:24 INFO hbase.ClusterManager: Executing remote command: ps aux | grep master | grep -v grep | tr -s ' ' | cut -d ' ' -f2 | xargs kill -s SIGKILL , hostname:master.example.com 12/11/19 23:22:25 INFO hbase.ClusterManager: Executed remote command, exit code:0 , output: 12/11/19 23:22:25 INFO hbase.HBaseCluster: Waiting service:master to stop: master.example.com,60000,1353367210440 12/11/19 23:22:25 INFO hbase.ClusterManager: Executing remote command: ps aux | grep master | grep -v grep | tr -s ' ' | cut -d ' ' -f2 , hostname:master.example.com 12/11/19 23:22:25 INFO hbase.ClusterManager: Executed remote command, exit code:0 , output: 12/11/19 23:22:25 INFO util.ChaosMonkey: Killed master server:master.example.com,60000,1353367210440 12/11/19 23:22:25 INFO util.ChaosMonkey: Sleeping for:5000 12/11/19 23:22:30 INFO util.ChaosMonkey: Starting master:master.example.com 12/11/19 23:22:30 INFO hbase.HBaseCluster: Starting Master on: master.example.com 12/11/19 23:22:30 INFO hbase.ClusterManager: Executing remote command: /homes/enis/code/hbase-0.94/bin/../bin/hbase-daemon.sh --config /homes/enis/code/hbase-0.94/bin/../conf start master , hostname:master.example.com 12/11/19 23:22:31 INFO hbase.ClusterManager: Executed remote command, exit code:0 , output:starting master, logging to /homes/enis/code/hbase-0.94/bin/../logs/hbase-enis-master-master.example.com.out .... 12/11/19 23:22:33 INFO util.ChaosMonkey: Started master: master.example.com,60000,1353367210440 12/11/19 23:22:33 INFO util.ChaosMonkey: Sleeping for:51321 12/11/19 23:23:24 INFO util.ChaosMonkey: Performing action: Restart random region server 12/11/19 23:23:24 INFO util.ChaosMonkey: Killing region server:rs3.example.com,60020,1353367027826 12/11/19 23:23:24 INFO hbase.HBaseCluster: Aborting RS: rs3.example.com,60020,1353367027826 12/11/19 23:23:24 INFO hbase.ClusterManager: Executing remote command: ps aux | grep regionserver | grep -v grep | tr -s ' ' | cut -d ' ' -f2 | xargs kill -s SIGKILL , hostname:rs3.example.com 12/11/19 23:23:25 INFO hbase.ClusterManager: Executed remote command, exit code:0 , output: 12/11/19 23:23:25 INFO hbase.HBaseCluster: Waiting service:regionserver to stop: rs3.example.com,60020,1353367027826 12/11/19 23:23:25 INFO hbase.ClusterManager: Executing remote command: ps aux | grep regionserver | grep -v grep | tr -s ' ' | cut -d ' ' -f2 , hostname:rs3.example.com 12/11/19 23:23:25 INFO hbase.ClusterManager: Executed remote command, exit code:0 , output: 12/11/19 23:23:25 INFO util.ChaosMonkey: Killed region server:rs3.example.com,60020,1353367027826. Reported num of rs:6 12/11/19 23:23:25 INFO util.ChaosMonkey: Sleeping for:60000 12/11/19 23:24:25 INFO util.ChaosMonkey: Starting region server:rs3.example.com 12/11/19 23:24:25 INFO hbase.HBaseCluster: Starting RS on: rs3.example.com 12/11/19 23:24:25 INFO hbase.ClusterManager: Executing remote command: /homes/enis/code/hbase-0.94/bin/../bin/hbase-daemon.sh --config /homes/enis/code/hbase-0.94/bin/../conf start regionserver , hostname:rs3.example.com 12/11/19 23:24:26 INFO hbase.ClusterManager: Executed remote command, exit code:0 , output:starting regionserver, logging to /homes/enis/code/hbase-0.94/bin/../logs/hbase-enis-regionserver-rs3.example.com.out 12/11/19 23:24:27 INFO util.ChaosMonkey: Started region server:rs3.example.com,60020,1353367027826. Reported num of rs:6
As you can see from the log, ChaosMonkey started the default PeriodicRandomActionPolicy, which is configured with all the available actions, and ran RestartActiveMaster and RestartRandomRs actions. ChaosMonkey tool, if run from command line, will keep on running until the process is killed.
All commands executed from the local HBase project directory.
Note: use Maven 3 (Maven 2 may work but we suggest you use Maven 3).
See the Section 15.7.3, 「Running tests」 section above in Section 15.7.2, 「Unit Tests」
As of 0.96, Apache HBase supports building against Apache Hadoop versions: 1.0.3, 2.0.0-alpha and 3.0.0-SNAPSHOT. By default, we will build with Hadoop-1.0.3. To change the version to run with Hadoop-2.0.0-alpha, you would run:
mvn -Dhadoop.profile=2.0 ...
That is, designate build with hadoop.profile 2.0. Pass 2.0 for hadoop.profile to build against hadoop 2.0. Tests may not all pass as of this writing so you may need to pass -DskipTests unless you are inclined to fix the failing tests.
Similarly, for 3.0, you would just replace the profile value. Note that Hadoop-3.0.0-SNAPSHOT does not currently have a deployed maven artificat - you will need to build and install your own in your local maven repository if you want to run against this profile.
In earilier verions of Apache HBase, you can build against older versions of Apache Hadoop, notably, Hadoop 0.22.x and 0.23.x. If you are running, for example HBase-0.94 and wanted to build against Hadoop 0.23.x, you would run with:
mvn -Dhadoop.profile=22 ...
Apache HBase gets better only when people contribute!
As Apache HBase is an Apache Software Foundation project, see Appendix H, HBase and the Apache Software Foundation for more information about how the ASF functions.
Sign up for the dev-list and the user-list. See the mailing lists page. Posing questions - and helping to answer other people's questions - is encouraged! There are varying levels of experience on both lists so patience and politeness are encouraged (and please stay on topic.)
Check for existing issues in Jira. If it's either a new feature request, enhancement, or a bug, file a ticket.
The following is a guideline on setting Jira issue priorities:
Most development is done on TRUNK. However, there are branches for minor releases (e.g., 0.90.1, 0.90.2, and 0.90.3 are on the 0.90 branch).
If you have any questions on this just send an email to the dev dist-list.
In HBase we use JUnit 4. If you need to run miniclusters of HDFS, ZooKeeper, HBase, or MapReduce testing, be sure to checkout the HBaseTestingUtility. Alex Baranau of Sematext describes how it can be used in HBase Case-Study: Using HBaseTestingUtility for Local Testing and Development (2010).
Sometimes you don't need a full running server unit testing. For example, some methods can make do with a a org.apache.hadoop.hbase.Server instance or aorg.apache.hadoop.hbase.master.MasterServices Interface reference rather than a full-blown org.apache.hadoop.hbase.master.HMaster. In these cases, you maybe able to get away with a mocked Server instance. For example:
See Section, 「Code Formatting」 and Section 15.11.5, 「Common Patch Feedback」.
Also, please pay attention to the interface stability/audience classifications that you will see all over our code base. They look like this at the head of the class:
@InterfaceAudience.Public @InterfaceStability.Stable
If the InterfaceAudience is Private, we can change the class (and we do not need to include a InterfaceStability mark). If a class is marked Public but its InterfaceStability is marked Unstable, we can change it. If it's marked Public/Evolving, we're allowed to change it but should try not to. If it's Public and Stable we can't change it without a deprecation path or with a really GREAT reason.
When you add new classes, mark them with the annotations above if publically accessible. If you are not cleared on how to mark your additions, ask up on the dev list.
This convention comes from our parent project Hadoop.
We don't have many but what we have we list below. All are subject to challenge of course but until then, please hold to the rules of the road.
ZooKeeper state should transient (treat it like memory). If deleted, hbase should be able to recover and essentially be in the same state[31].
If you are developing Apache HBase, frequently it is useful to test your changes against a more-real cluster than what you find in unit tests. In this case, HBase can be run directly from the source in local-mode. All you need to do is run:
This will spin up a full local-cluster, just as if you had packaged up HBase and installed it on your machine.
Keep in mind that you will need to have installed HBase into your local maven repository for the in-situ cluster to work properly. That is, you will need to run:
mvn clean install -DskipTests
to ensure that maven can find the correct classpath and dependencies. Generally, the above command is just a good thing to try running first, if maven is acting oddly.
If you are new to submitting patches to open source or new to submitting patches to Apache, I'd suggest you start by reading the On Contributing Patches page from Apache Commons Project. Its a nice overview that applies equally to the Apache HBase Project.
See the aforementioned Apache Commons link for how to make patches against a checked out subversion repository. Patch files can also be easily generated from Eclipse, for example by selecting "Team -> Create Patch". Patches can also be created by git diff and svn diff.
Please submit one patch-file per Jira. For example, if multiple files are changed make sure the selected resource when generating the patch is a directory. Patch files can reflect changes in multiple files.
Make sure you review Section, 「Code Formatting」 for code style.
The patch file should have the Apache HBase Jira ticket in the name. For example, if a patch was submitted for Foo.java, then a patch file called Foo_HBASE_XXXX.patch would be acceptable where XXXX is the Apache HBase Jira number.
If you generating from a branch, then including the target branch in the filename is advised, e.g., HBASE-XXXX-0.90.patch.
Yes, please. Please try to include unit tests with every code patch (and especially new classes and large changes). Make sure unit tests pass locally before submitting the patch.
Also, see Section, 「Mockito」.
If you are creating a new unit test class, notice how other unit test classes have classification/sizing annotations at the top and a static method on the end. Be sure to include these in any new unit test files you generate. See Section 15.7, 「Tests」 for more on how the annotations work.
The patch should be attached to the associated Jira ticket "More Actions -> Attach Files". Make sure you click the ASF license inclusion, otherwise the patch can't be considered for inclusion.
Once attached to the ticket, click "Submit Patch" and the status of the ticket will change. Committers will review submitted patches for inclusion into the codebase. Please understand that not every patch may get committed, and that feedback will likely be provided on the patch. Fear not, though, because the Apache HBase community is helpful!
The following items are representative of common patch feedback. Your patch process will go faster if these are taken into account before submission.
See the Java coding standards for more information on coding conventions in Java.
Rather than do this...
if ( foo.equals( bar ) ) { // don't do this
... do this instead...
if (foo.equals(bar)) {
Also, rather than do this...
foo = barArray[ i ]; // don't do this
... do this instead...
foo = barArray[i];
Auto-generated code in Eclipse often looks like this...
public void readFields(DataInput arg0) throws IOException { // don't do this foo = arg0.readUTF(); // don't do this
... do this instead ...
public void readFields(DataInput di) throws IOException { foo = di.readUTF();
See the difference? 'arg0' is what Eclipse uses for arguments by default.
Keep lines less than 100 characters.
Bar bar = foo.veryLongMethodWithManyArguments(argument1, argument2, argument3, argument4, argument5, argument6, argument7, argument8, argument9); // don't do this
... do something like this instead ...
Bar bar = foo.veryLongMethodWithManyArguments( argument1, argument2, argument3,argument4, argument5, argument6, argument7, argument8, argument9);
This happens more than people would imagine.
Bar bar = foo.getBar(); <--- imagine there's an extra space(s) after the semicolon instead of a line break.
Make sure there's a line-break after the end of your code, and also avoid lines that have nothing but whitespace.
In 0.96, HBase moved to protobufs. The below section on Writables applies to 0.94.x and previous, not to 0.96 and beyond.
Every class returned by RegionServers must implement Writable. If you are creating a new class that needs to implement this interface, don't forget the default constructor.
This is also a very common feedback item. Don't forget Javadoc!
Javadoc warnings are checked during precommit. If the precommit tool gives you a '-1', please fix the javadoc issue. Your patch won't be committed if it adds such warnings.
Findbugs is used to detect common bugs pattern. As Javadoc, it is checked during the precommit build up on Apache's Jenkins, and as with Javadoc, please fix them. You can run findbugs locally with 'mvn findbugs:findbugs': it will generate the findbugs files locally. Sometimes, you may have to write code smarter than Findbugs. You can annotate your code to tell Findbugs you know what you're doing, by annotating your class with:
@edu.umd.cs.findbugs.annotations.SuppressWarnings( value="HE_EQUALS_USE_HASHCODE", justification="I know what I'm doing")
Note that we're using the apache licensed version of the annotations.
Don't just leave the @param arguments the way your IDE generated them. Don't do this...
/** * * @param bar <---- don't do this!!!! * @return <---- or this!!!! */ public Foo getFoo(Bar bar);
... either add something descriptive to the @param and @return lines, or just remove them. But the preference is to add something descriptive and useful.
If you submit a patch for one thing, don't do auto-reformatting or unrelated reformatting of code on a completely different area of code.
Likewise, don't add unrelated cleanup or refactorings outside the scope of your Jira.
Larger patches should go through ReviewBoard.
For more information on how to use ReviewBoard, see the ReviewBoard documentation.
Committers do this. See How To Commit in the Apache HBase wiki.
Commiters will also resolve the Jira, typically after the patch passes a build.
If a committer commits a patch it is their responsibility to make sure it passes the test suite. It is helpful if contributors keep an eye out that their patch does not break the hbase build and/or tests but ultimately, a contributor cannot be expected to be up on the particular vagaries and interconnections that occur in a project like hbase. A committer should.
來配置zookeeper,一個更加簡單的方法是在 conf/hbase-site.xml
裏面修改zookeeper的配置。Zookeep的配置是做爲property寫在 hbase-site.xml
裏面的。option的名字是 hbase.zookeeper.property
. 打個比方, clientPort
. 全部的默認值都是HBase決定的,包括zookeeper, 參見 Section, 「HBase 默認配置」. 能夠查找hbase.zookeeper.property
前綴,找到關於zookeeper的配置。 [33]
對於zookeepr的配置,你至少要在 hbase-site.xml
中列出zookeepr的ensemble servers,具體的字段是 hbase.zookeeper.quorum
. 該這個字段的默認值是localhost
,這個值對於分佈式應用顯然是不能夠的. (遠程鏈接沒法使用).
你運行一個zookeeper也是能夠的,可是在生產環境中,你最好部署3,5,7個節點。部署的越多,可靠性就越高,固然只能部署奇數個,偶數個是不能夠的。你須要給每一個zookeeper 1G左右的內存,若是可能的話,最好有獨立的磁盤。 (獨立磁盤能夠確保zookeeper是高性能的。).若是你的集羣負載很重,不要把Zookeeper和RegionServer運行在同一臺機器上面。就像DataNodes 和 TaskTrackers同樣
舉個例子,HBase管理着的ZooKeeper集羣在節點 rs{1,2,3,4,5}.example.com, 監聽2222 端口(默認是2181),並確保conf/hbase-env.sh
的值是 true
,再編輯 conf/hbase-site.xml
設置 hbase.zookeeper.property.clientPort
和 hbase.zookeeper.quorum
屬性來把ZooKeeper保存數據的目錄地址改掉。默認值是 /tmp
<configuration> ... <property> <name>hbase.zookeeper.property.clientPort</name> <value>2222</value> <description>Property from ZooKeeper's config zoo.cfg. The port at which the clients will connect. </description> </property> <property> <name>hbase.zookeeper.quorum</name> <value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value> <description>Comma separated list of servers in the ZooKeeper Quorum. For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". By default this is set to localhost for local and pseudo-distributed modes of operation. For a fully-distributed setup, this should be set to a full list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which we will start/stop ZooKeeper on. </description> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/usr/local/zookeeper</value> <description>Property from ZooKeeper's config zoo.cfg. The directory where the snapshot is stored. </description> </property> ... </configuration>
Be sure to set up the data dir cleaner described under Zookeeper Maintenance else you could have 'interesting' problems a couple of months in; i.e. zookeeper could start dropping sessions if it has to run through a directory of hundreds of thousands of logs which is wont to do around leader reelection time -- a process rare but run on occasion whether because a machine is dropped or happens to hiccup.
讓HBase使用一個已有的不被HBase託管的Zookeep集羣,須要設置 conf/hbase-env.sh
屬性爲 false
... # Tell HBase whether it should manage it's own instance of Zookeeper or not. export HBASE_MANAGES_ZK=false
接下來,指明Zookeeper的host和端口。能夠在 hbase-site.xml
中設置, 也能夠在HBase的CLASSPATH
配置文件。 HBase 會優先加載zoo.cfg
${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
你能夠用這條命令啓動ZooKeeper而不啓動HBase. HBASE_MANAGES_ZK
的值是 false
, 若是你想在HBase重啓的時候不重啓ZooKeeper,你能夠這樣作
對於獨立Zoopkeeper的問題,你能夠在 Zookeeper啓動獲得幫助.
新版 HBase (>= 0.92)將支持鏈接到 ZooKeeper Quorum 進行SASL 認證。 ( Zookeeper 3.4.0 以上可用).
這裏描述如何設置 HBase,以同 ZooKeeper Quorum實現互相認證. ZooKeeper/HBase 互相認證 (HBASE-2418) 是 HBase安全配置所必不可少的一部分 (HBASE-3025)。 爲簡化說明, 本節忽略所必需的額外配置 ( HDFS 安全和 Coprocessor 配置)。 推薦使用HBase內置 Zookeeper 配置 (相對獨立 Zookeeper quorum) 以簡化學習.
須要一個工做 Kerberos KDC 配置. 每一個 $HOST
運行一個 ZooKeeper 服務器, 應該有個主要的 zookeeper/$HOST
. 對每一個主機,爲 zookeeper/$HOST
添加一個 key (使用 kadmin
或 kadmin.local
工具的 ktadd
命令) ,將該key文件複製到 $HOST
, 並設置僅對該 $HOST
上運行 zookeeper 的用戶只讀。注意文件位置, 咱們將在下面以 $PATH_TO_ZOOKEEPER_KEYTAB
Similarly, for each $HOST
that will run an HBase server (master or regionserver), you should have a principle: hbase/$HOST
. For each host, add a keytab file called hbase.keytab
containing a service key for hbase/$HOST
, copy this file to $HOST
, and make it readable only to the user that will run an HBase service on $HOST
. Note the location of this file, which we will use below as $PATH_TO_HBASE_KEYTAB
Each user who will be an HBase client should also be given a Kerberos principal. This principal should usually have a password assigned to it (as opposed to, as with the HBase servers, a keytab file) which only this user knows. The client's principal's maxrenewlife
should be set so that it can be renewed enough so that the user can complete their HBase client processes. For example, if a user runs a long-running HBase client process that takes at most 3 days, we might create this user's principal within kadmin
with: addprinc -maxrenewlife 3days
. The Zookeeper client and server libraries manage their own ticket refreshment by running threads that wake up periodically to do the refreshment.
On each host that will run an HBase client (e.g. hbase shell
), add the following file to the HBase home directory's conf
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=false useTicketCache=true; };
We'll refer to this JAAS configuration file as $CLIENT_CONF
每一個節點要運行一個 zookeeper, 一個主服務, 或一個 regionserver,在HBASE_HOME
的conf目錄中建立以下所示的 JAAS 配置文件:
Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" storeKey=true useTicketCache=false principal="zookeeper/$HOST"; }; Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache=false keyTab="$PATH_TO_HBASE_KEYTAB" principal="hbase/$HOST"; };
由 Zookeeper quorum 服務器使用, Client
節由 HBase master 和 regionserver使用。 The path to this file should be substituted for the text $HBASE_SERVER_CONF
in the hbase-env.sh
listing below.
The path to this file should be substituted for the text $CLIENT_CONF
in the hbase-env.sh
listing below.
Modify your hbase-env.sh
to include the following:
export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF" export HBASE_MANAGES_ZK=true export HBASE_ZOOKEEPER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"where
are the full paths to the JAAS configuration files created above.
Modify your hbase-site.xml
on each node that will run zookeeper, master or regionserver to contain:
<configuration> <property> <name>hbase.zookeeper.quorum</name> <value>$ZK_NODES</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.property.authProvider.1</name> <value>org.apache.zookeeper.server.auth.SASLAuthenticationProvider</value> </property> <property> <name>hbase.zookeeper.property.kerberos.removeHostFromPrincipal</name> <value>true</value> </property> <property> <name>hbase.zookeeper.property.kerberos.removeRealmFromPrincipal</name> <value>true</value> </property> </configuration>
where $ZK_NODES
is the comma-separated list of hostnames of the Zookeeper Quorum hosts.
Start your hbase cluster by running one or more of the following set of commands on the appropriate hosts:
bin/hbase zookeeper start bin/hbase master start bin/hbase regionserver start
增長 JAAS 配置文件:
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache=false keyTab="$PATH_TO_HBASE_KEYTAB" principal="hbase/$HOST"; };
是上面建立的keytab,以便 HBase 服務能夠在本主機運行 , $HOST
是該節點的 hostname . 將該配置放到 HBase home配置目錄.在下面的 $HBASE_SERVER_CONF
修改 hbase-env.sh 增長以下項:
export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF" export HBASE_MANAGES_ZK=false export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
修改每一個節點的 hbase-site.xml
<configuration> <property> <name>hbase.zookeeper.quorum</name> <value>$ZK_NODES</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> </configuration>
是逗號分隔的Zookeeper Quorum主機名列表。
每一個Zookeeper Quorum節點增長 zoo.cfg
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider kerberos.removeHostFromPrincipal=true kerberos.removeRealmFromPrincipal=true
在每一個主機建立 JAAS 配置幷包含:
Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" storeKey=true useTicketCache=false principal="zookeeper/$HOST"; };
每一個Quorum 的主機名. 咱們會在下面的$ZK_SERVER_CONF
在每一個 Zookeeper Quorum主機啓動 Zookeeper:
SERVER_JVMFLAGS="-Djava.security.auth.login.config=$ZK_SERVER_CONF" bin/zkServer start
啓動HBase 集羣。在適當的節點運行下面的一到多個命令:
bin/hbase master start bin/hbase regionserver start
若是上面配置成功, 你應該能夠看到以下所示的Zookeeper 服務器日誌:
11/12/05 22:43:39 INFO zookeeper.Login: successfully logged in. 11/12/05 22:43:39 INFO server.NIOServerCnxnFactory: binding to port 11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh thread started. 11/12/05 22:43:39 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:39 UTC 2011 11/12/05 22:43:39 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:39 UTC 2011 11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:36:42 UTC 2011 .. 11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: Successfully authenticated client: authenticationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN; authorizationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN. 11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: Setting authorizedID: hbase 11/12/05 22:43:59 INFO server.ZooKeeperServer: adding SASL authorization for authorizationID: hbase
Zookeeper 客戶端側 (HBase master 或 regionserver), 應該能夠看到以下所示的東西:
11/12/05 22:43:59 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ip-10-166-175-249.us-west-1.compute.internal:2181 sessionTimeout=180000 watcher=master:60000 11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Opening socket connection to server / 11/12/05 22:43:59 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 14851@ip-10-166-175-249 11/12/05 22:43:59 INFO zookeeper.Login: successfully logged in. 11/12/05 22:43:59 INFO client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism. 11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh thread started. 11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Socket connection established to ip-10-166-175-249.us-west-1.compute.internal/, initiating session 11/12/05 22:43:59 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:59 UTC 2011 11/12/05 22:43:59 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:59 UTC 2011 11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:30:37 UTC 2011 11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Session establishment complete on server ip-10-166-175-249.us-west-1.compute.internal/, sessionid = 0x134106594320000, negotiated timeout = 180000
git clone git://git.apache.org/hbase.git cd hbase mvn -Psecurity,localTests clean test -Dtest=TestZooKeeperACL再按上面描述配置 HBase. 手工編輯target/cached_classpath.txt (以下)..
bin/hbase zookeeper & bin/hbase master & bin/hbase regionserver &
必須在 target/cached_classpath.txt
重寫標準 hadoop-core jar 文件爲包含 HADOOP-7070 修改的版本。能夠使用下列腳本完成:
echo `find ~/.m2 -name "*hadoop-core*7070*SNAPSHOT.jar"` ':' `cat target/cached_classpath.txt` | sed 's/ //g' > target/tmp.txt mv target/tmp.txt target/cached_classpath.txt
[33] 要查看所有 ZooKeeper 配置項列表,參考 ZooKeeper的 zoo.cfg. HBase 沒有附帶 zoo.cfg ,因此你須要查看合適的ZooKeeper下載的conf目錄。
特徵分支很容易製做。你能夠不是代碼提交者。只需提供須要添加到JIRA上的分支名稱到開發者郵件列表,代碼提交者會爲你添加。而後你能夠在你的特徵分支下提交發行文件到Apache HBase (TM) JIRA。你的代碼保留在別處——必定要公開以便觀察到——你能夠在開發郵件列表更新進度。當特徵已經準備好提交時,3+1 位代碼提交者將合併你的特徵。[34]
The below policy is something we put in place 09/2012. It is a suggested policy rather than a hard requirement. We want to try it first to see if it works before we cast it in stone.
Apache HBase is made of components. Components have one or more Section 17.2.1, 「Component Owner」s. See the 'Description' field on the components JIRA page for who the current owners are by component.
Patches that fit within the scope of a single Apache HBase component require, at least, a +1 by one of the component's owners before commit. If owners are absent -- busy or otherwise -- two +1s by non-owners will suffice.
Patches that span components need at least two +1s before they can be committed, preferably +1s by owners of components touched by the x-component patch (TODO: This needs tightening up but I think fine for first pass).
Any -1 on a patch by anyone vetos a patch; it cannot be committed until the justification for the -1 is addressed.
組件全部者列在Apache HBase JIRA components頁的描述域內。全部者列在描述域而不是「組件領導者」域,由於後者只容許列一我的,而咱們鼓勵組件由多人全部。
Owners are volunteers who are (usually, but not necessarily) expert in their component domain and may have an agenda on how they think their Apache HBase component should evolve.
Duties include:
Owners will try and review patches that land within their component's scope.
If applicable, if an owner has an agenda, they will publish their goals or the design toward which they are driving their component
If you would like to be volunteer as a component owner, just write the dev list and we'll sign you up. Owners do not need to be committers.
A.1. 通用 |
什麼狀況下應該用HBase? |
參考 the Section 9.1, 「概述」 in the Architecture chapter. |
還有別的 HBase FAQ嗎? |
參考 the FAQ that is up on the wiki, HBase Wiki FAQ. |
HBase 支持SQL嗎? |
事實上不支持。 SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests. 參考 the Chapter 5, Data Model section for examples on the HBase client. |
到哪裏找到NoSQL/HBase的例子呢? |
參考附錄中 BigTable 論文連接 Appendix F, Other Information About HBase ,及其餘論文. |
HBase歷史如何? |
A.2. 結構 |
HBase 如何處理 Region-RegionServer 分配和本地化? |
A.3. 配置 |
How can I get started with my first cluster? |
Where can I learn about the rest of the configuration options? |
A.4. 模式設計 / 數據訪問 |
How should I design my schema in HBase? |
參考 Chapter 5, Data Model and Chapter 6, HBase and Schema Design |
How can I store (fill in the blank) in HBase? |
How can I handle secondary indexes in HBase? |
參考 Section 6.9, 「 Secondary Indexes and Alternate Query Paths 」 |
Can I change a table's rowkeys? |
This is a very common quesiton. You can't. 參考 Section 6.3.5, 「Immutability of Rowkeys」. |
What APIs does HBase support? |
參考 Chapter 5, Data Model, Section 9.3, 「Client」 and Section 10.1, 「非Java 語言和 JVM 通話」. |
A.5. MapReduce |
How can I use MapReduce with HBase? |
A.6. 性能和問題定位 |
How can I improve HBase cluster performance? |
How can I troubleshoot my HBase cluster? |
A.7. Amazon EC2 |
I am running HBase on Amazon EC2 and... |
EC2 issues are a special case. 參考 Troubleshooting Section 12.12, 「Amazon EC2」 and Performance Section 11.11, 「Amazon EC2」sections. |
A.8. 操做 |
How do I manage my HBase cluster? |
How do I back up my HBase cluster? |
A.9. HBase 實踐 |
到哪裏找到感興趣的 HBase 相關視頻和幻燈? |
HBaseFsck (hbck) 是一個檢查區域一致性和表完整性問題的工具,能夠修復損壞的HBase. 它工做在兩個基本模式-- 只讀不一致性定位模式和多相讀寫修復模式。
$ ./bin/hbase hbck
命令輸出結束的地方,會打印OK 或告訴你多少損壞(INCONSISTENCIES)出現。你也許想運行hbck幾回,由於一些損壞多是暫時的。(如集羣正啓動或區域正分裂)。 從運維來講,你可能但願有規律運行hbck 並在反覆報告不一致性時設置報警(如經過 nagios) 。 A run of hbck will report a list of inconsistencies along with a brief description of the regions and tables affected. The using the -details
option will report more details including a representative listing of all the splits present in all the tables.
$ ./bin/hbase hbck -details
If after several runs, inconsistencies continue to be reported, you may have encountered a corruption. These should be rare, but in the event they occur newer versions of HBase include the hbck tool enabled with automatic repair options.
There are two invariants that when violated create inconsistencies in HBase:
Repairs generally work in three phases -- a read-only information gathering phase that identifies inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then finally a region consistency repair phase that restores the region consistency invariant. Starting from version 0.90.0, hbck could detect region consistency problems report on a subset of possible table integrity problems. It also included the ability to automatically fix the most common inconsistency, region assignment and deployment consistency problems. This repair could be done by using the -fix
command line option. These problems close regions if they are open on the wrong server or on multiple region servers and also assigns regions to region servers if they are not open.
Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are introduced to aid repairing a corrupted HBase. This hbck sometimes goes by the nickname 「uberhbck」. Each particular version of uber hbck is compatible with the HBase’s of the same major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions <=0.92.1 may require restarting the master or failing over to a backup master.
When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first. These are generally region consistency repairs -- localized single region repairs, that only modify in-memory data, ephemeral zookeeper data, or patch holes in the META table. Region consistency requires that the HBase instance has the state of the region’s data in HDFS (.regioninfo files), the region’s row in the .META. table., and region’s deployment/assignments on region servers and the master in accordance. Options for repairing region consistency include:
(equivalent to the 0.90 -fix
option) repairs unassigned, incorrectly assigned or multiply assigned regions.-fixMeta
which removes meta rows when corresponding regions are not present in HDFS and adds new meta rows if they regions are present in HDFS while not in META.To fix deployment and assignment problems you can run this command:
$ ./bin/hbase hbck -fixAssignmentsTo fix deployment and assignment problems as well as repairing incorrect meta rows you can run this command:.
$ ./bin/hbase hbck -fixAssignments -fixMetaThere are a few classes of table integrity problems that are low risk repairs. The first two are degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are automatically handled by sidelining the data to a temporary directory (/hbck/xxxx). The third low-risk class is hdfs region holes. This can be repaired by using the:
option for fabricating new empty regions on the file system. If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent.$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHolesSince this is a common operation, we’ve added a the
flag that is equivalent to the previous command:
$ ./bin/hbase hbck -repairHolesIf inconsistencies still remain after these steps, you most likely have table integrity problems related to orphaned or overlapping regions.
hbck -details
run so that you isolate repairs attempts only upon problems the checks identify. Because this is riskier, there are safeguard that should be used to limit the scope of the repairs. WARNING: This is a relatively new and have only been tested on online but idle HBase instances (no reads/writes). Use at your own risk in an active production environment! The options for repairing table integrity violations include:
option for 「adopting」 a region directory that is missing a region metadata file (the .regioninfo file).-fixHdfsOverlaps
ability for fixing overlapping regions-maxMerge <n>
maximum number of overlapping regions to merge-sidelineBigOverlaps
if more than maxMerge regions are overlapping, sideline attempt to sideline the regions overlapping with the most other regions.-maxOverlapsToSideline <n>
if sidelining large overlapping regions, sideline at most n regions.-repair
includes all the region consistency options and only the hole repairing table integrity options.$ ./bin/hbase/ hbck -repair TableFoo TableBar
option that can try to fix meta assignments.
$ ./bin/hbase hbck -fixMetaOnly -fixAssignments
option to fabricating a new HBase version file. This assumes that the version of hbck you are running is the appropriate version for the HBase cluster.
$ ./bin/hbase org.apache.hadoop.hbase.util.OfflineMetaRepairNOTE: This tool is not as clever as uberhbck but can be used to bootstrap repairs that uberhbck can complete. If the tool succeeds you should be able to start hbase and run online repairs if necessary.
HBase有一個用來測試壓縮新的工具。要想運行它,輸入/bin/hbase org.apache.hadoop.hbase.util.CompressionTest
. 就會有提示這個工具的具體用法
加上配置 hbase.regionserver.codecs
的值是 lzo,gz
同時lzo不存在或者沒有正確安裝, RegionServer在啓動的時候會提示配置錯誤。
很不幸,HBase是Apache的協議,而LZO是GPL的協議。HBase不能自帶LZO,所以LZO須要在安裝HBase以前安裝。參見 使用 LZO 壓縮介紹瞭如何在HBase中使用LZO
參考 Section C.2, 「 hbase.regionserver.codecs 」 for a feature to help protect against failed LZO install.
相對於LZO,GZIP的壓縮率更高可是速度更慢。在某些特定狀況下,壓縮率是優先考量的。Java會使用Java自帶的GZIP,除非Hadoop的本地庫在CLASSPATH中。在這種狀況下,最好使用本地壓縮器。(若是本地庫不存在,能夠在Log看到不少Got brand-new compressor。參見Q: )
If snappy is installed, HBase can make use of it (courtesy of hadoop-snappy [29]).
Build and install snappy on all nodes of your cluster.
Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy
Create a column family with snappy compression and verify it in the hbase shell:
$ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' } hbase> describe 't1'
In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'"
A frequent question on the dist-list is how to change compression schemes for ColumnFamilies. This is actually quite simple, and can be done via an alter command. Because the compression scheme is encoded at the block-level in StoreFiles, the table does not need to be re-created and the data does not copied somewhere else. Just make sure the old codec is still available until you are sure that all of the old StoreFiles have been compacted.
TODO: YCSB不能不少的增長集羣負載.
TODO: 若是給HBase安裝
Ted Dunning重作了YCSV,這個是用maven管理了,加入了覈實工做量的功能。參見 Ted Dunning's YCSB.
Note: this feature was introduced in HBase 0.92
We found it necessary to revise the HFile format after encountering high memory usage and slow startup times caused by large Bloom filters and block indexes in the region server. Bloom filters can get as large as 100 MB per HFile, which adds up to 2 GB when aggregated over 20 regions. Block indexes can grow as large as 6 GB in aggregate size over the same set of regions. A region is not considered opened until all of its block index data is loaded. Large Bloom filters produce a different performance problem: the first get request that requires a Bloom filter lookup will incur the latency of loading the entire Bloom filter bit array.
To speed up region server startup we break Bloom filters and block indexes into multiple blocks and write those blocks out as they fill up, which also reduces the HFile writer’s memory footprint. In the Bloom filter case, 「filling up a block」 means accumulating enough keys to efficiently utilize a fixed-size bit array, and in the block index case we accumulate an 「index block」 of the desired size. Bloom filter blocks and index blocks (we call these 「inline blocks」) become interspersed with data blocks, and as a side effect we can no longer rely on the difference between block offsets to determine data block length, as it was done in version 1.
HFile is a low-level file format by design, and it should not deal with application-specific details such as Bloom filters, which are handled at StoreFile level. Therefore, we call Bloom filter blocks in an HFile "inline" blocks. We also supply HFile with an interface to write those inline blocks.
Another format modification aimed at reducing the region server startup time is to use a contiguous 「load-on-open」 section that has to be loaded in memory at the time an HFile is being opened. Currently, as an HFile opens, there are separate seek operations to read the trailer, data/meta indexes, and file info. To read the Bloom filter, there are two more seek operations for its 「data」 and 「meta」 portions. In version 2, we seek once to read the trailer and seek again to read everything else we need to open the file from a contiguous block.
As we will be discussing the changes we are making to the HFile format, it is useful to give a short overview of the previous (HFile version 1) format. An HFile in the existing format is structured as follows: [30]
The block index in version 1 is very straightforward. For each entry, it contains:
Offset (long)
Uncompressed size (int)
Key (a serialized byte array written using Bytes.writeByteArray)
Key length as a variable-length integer (VInt)
Key bytes
The number of entries in the block index is stored in the fixed file trailer, and has to be passed in to the method that reads the block index. One of the limitations of the block index in version 1 is that it does not provide the compressed size of a block, which turns out to be necessary for decompression. Therefore, the HFile reader has to infer this compressed size from the offset difference between blocks. We fix this limitation in version 2, where we store on-disk block size instead of uncompressed size, and get uncompressed size from the block header.
The version of HBase introducing the above features reads both version 1 and 2 HFiles, but only writes version 2 HFiles. A version 2 HFile is structured as follows:
In the version 2 every block in the data section contains the following fields:
8 bytes: Block type, a sequence of bytes equivalent to version 1's "magic records". Supported block types are:
DATA – data blocks
LEAF_INDEX – leaf-level index blocks in a multi-level-block-index
BLOOM_CHUNK – Bloom filter chunks
META – meta blocks (not used for Bloom filters in version 2 anymore)
INTERMEDIATE_INDEX – intermediate-level index blocks in a multi-level blockindex
ROOT_INDEX – root>level index blocks in a multi>level block index
FILE_INFO – the 「file info」 block, a small key>value map of metadata
BLOOM_META – a Bloom filter metadata block in the load>on>open section
TRAILER – a fixed>size file trailer. As opposed to the above, this is not an HFile v2 block but a fixed>size (for each HFile version) data structure
INDEX_V1 – this block type is only used for legacy HFile v1 block
Compressed size of the block's data, not including the header (int).
Can be used for skipping the current data block when scanning HFile data.
Uncompressed size of the block's data, not including the header (int)
This is equal to the compressed size if the compression algorithm is NON
File offset of the previous block of the same type (long)
Can be used for seeking to the previous data/index block
Compressed data (or uncompressed data if the compression algorithm is NONE).
The above format of blocks is used in the following HFile sections:
Scanned block section. The section is named so because it contains all data blocks that need to be read when an HFile is scanned sequentially. Also contains leaf block index and Bloom chunk blocks.
Non-scanned block section. This section still contains unified-format v2 blocks but it does not have to be read when doing a sequential scan. This section contains 「meta」 blocks and intermediate-level index blocks.
We are supporting 「meta」 blocks in version 2 the same way they were supported in version 1, even though we do not store Bloom filter data in these blocks anymore.
There are three types of block indexes in HFile version 2, stored in two different formats (root and non-root):
Data index — version 2 multi-level block index, consisting of:
Version 2 root index, stored in the data block index section of the file
Optionally, version 2 intermediate levels, stored in the non%root format in the data index section of the file. Intermediate levels can only be present if leaf level blocks are present
Optionally, version 2 leaf levels, stored in the non%root format inline with data blocks
Meta index — version 2 root index format only, stored in the meta index section of the file
Bloom index — version 2 root index format only, stored in the 「load-on-open」 section as part of Bloom filter metadata.
This format applies to:
Root level of the version 2 data index
Entire meta and Bloom indexes in version 2, which are always single-level.
A version 2 root index block is a sequence of entries of the following format, similar to entries of a version 1 block index, but storing on-disk size instead of uncompressed size.
Offset (long)
This offset may point to a data block or to a deeper>level index block.
On-disk size (int)
Key (a serialized byte array stored using Bytes.writeByteArray)
Key (VInt)
Key bytes
A single-level version 2 block index consists of just a single root index block. To read a root index block of version 2, one needs to know the number of entries. For the data index and the meta index the number of entries is stored in the trailer, and for the Bloom index it is stored in the compound Bloom filter metadata.
For a multi-level block index we also store the following fields in the root index block in the load-on-open section of the HFile, in addition to the data structure described above:
Middle leaf index block offset
Middle leaf block on-disk size (meaning the leaf index block containing the reference to the 「middle」 data block of the file)
The index of the mid-key (defined below) in the middle leaf-level block.
These additional fields are used to efficiently retrieve the mid-key of the HFile used in HFile splits, which we define as the first key of the block with a zero-based index of (n – 1) / 2, if the total number of blocks in the HFile is n. This definition is consistent with how the mid-key was determined in HFile version 1, and is reasonable in general, because blocks are likely to be the same size on average, but we don’t have any estimates on individual key/value pair sizes.
When writing a version 2 HFile, the total number of data blocks pointed to by every leaf-level index block is kept track of. When we finish writing and the total number of leaf-level blocks is determined, it is clear which leaf-level block contains the mid-key, and the fields listed above are computed. When reading the HFile and the mid-key is requested, we retrieve the middle leaf index block (potentially from the block cache) and get the mid-key value from the appropriate position inside that leaf block.
This format applies to intermediate-level and leaf index blocks of a version 2 multi-level data block index. Every non-root index block is structured as follows.
numEntries: the number of entries (int).
entryOffsets: the 「secondary index」 of offsets of entries in the block, to facilitate a quick binary search on the key (numEntries + 1 int values). The last value is the total length of all entries in this index block. For example, in a non-root index block with entry sizes 60, 80, 50 the 「secondary index」 will contain the following int array: {0, 60, 140, 190}.
Entries. Each entry contains:
Offset of the block referenced by this entry in the file (long)
On>disk size of the referenced block (int)
Key. The length can be calculated from entryOffsets.
In contrast with version 1, in a version 2 HFile Bloom filter metadata is stored in the load-on-open section of the HFile for quick startup.
A compound Bloom filter.
Bloom filter version = 3 (int). There used to be a DynamicByteBloomFilter class that had the Bloom filter version number 2
The total byte size of all compound Bloom filter chunks (long)
Number of hash functions (int
Type of hash functions (int)
The total key count inserted into the Bloom filter (long)
The maximum total number of keys in the Bloom filter (long)
The number of chunks (int)
Comparator class used for Bloom filter keys, a UTF>8 encoded string stored using Bytes.writeByteArray
Bloom block index in the version 2 root block index format
The file info block is a serialized HBaseMapWritable (essentially a map from byte arrays to byte arrays) with the following keys, among others. StoreFile-level logic adds more keys to this.
hfile.LASTKEY |
The last key of the file (byte array) |
hfile.AVG_KEY_LEN |
The average key length in the file (int) |
The average value length in the file (int) |
File info format did not change in version 2. However, we moved the file info to the final section of the file, which can be loaded as one block at the time the HFile is being opened. Also, we do not store comparator in the version 2 file info anymore. Instead, we store it in the fixed file trailer. This is because we need to know the comparator at the time of parsing the load-on-open section of the HFile.
The following table shows common and different fields between fixed file trailers in versions 1 and 2. Note that the size of the trailer is different depending on the version, so it is 「fixed」 only within one version. However, the version is always stored as the last four-byte integer in the file.
Version 1 |
Version 2 |
File info offset (long) |
Data index offset (long) |
loadOnOpenOffset (long) The offset of the section that we need toload when opening the file. |
Number of data index entries (int) |
metaIndexOffset (long) This field is not being used by the version 1 reader, so we removed it from version 2. |
uncompressedDataIndexSize (long) The total uncompressed size of the whole data block index, including root-level, intermediate-level, and leaf-level blocks. |
Number of meta index entries (int) |
Total uncompressed bytes (long) |
numEntries (int) |
numEntries (long) |
Compression codec: 0 = LZO, 1 = GZ, 2 = NONE (int) |
The number of levels in the data block index (int) |
firstDataBlockOffset (long) The offset of the first first data block. Used when scanning. |
lastDataBlockEnd (long) The offset of the first byte after the last key/value data block. We don't need to go beyond this offset when scanning. |
Version: 1 (int) |
Version: 2 (int) |
Introduction to HBase
Building Real Time Services at Facebook with HBase by Jonathan Gray (Hadoop World 2011).
HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon by JD Cryans (Hadoop World 2010).
Advanced HBase Schema Design by Lars George (Hadoop World 2011).
Introduction to HBase by Todd Lipcon (Chicago Data Summit 2011).
Getting The Most From Your HBase Install by Ryan Rawson, Jonathan Gray (Hadoop World 2009).
BigTable by Google (2006).
HBase and HDFS Locality by Lars George (2010).
No Relation: The Mixed Blessings of Non-Relational Databases by Ian Varley (2009).
Cloudera's HBase Blog has a lot of links to useful HBase information.
HBase Wiki has a page with a number of presentations.
HBase: The Definitive Guide by Lars George.
Hadoop: The Definitive Guide by Tom White.
HBase is a project in the Apache Software Foundation and as such there are responsibilities to the ASF to ensure a healthy project.
參考 the Apache Development Process page for all sorts of information on how the ASF is structured (e.g., PMC, committers, contributors), to tips on contributing and getting involved, and how open-source works at ASF.
Once a quarter, each project in the ASF portfolio submits a report to the ASF board. This is done by the HBase project lead and the committers. 參考 ASF board reporting 獲取更多信息。
HBASE-6449 added support for tracing requests through HBase, using the open source tracing library, HTrace. Setting up tracing is quite simple, however it currently requires some very minor changes to your client code (it would not be very difficult to remove this requirement).
The tracing system works by collecting information in structs called ‘Spans’. It is up to you to choose how you want to receive this information by implementing the SpanReceiver interface, which defines one method:
public void receiveSpan(Span span);
This method serves as a callback whenever a span is completed. HTrace allows you to use as many SpanReceivers as you want so you can easily send trace information to multiple destinations.
Configure what SpanReceivers you’d like to use by putting a comma separated list of the fully-qualified class name of classes implementing SpanReceiver in hbase-site.xml property: hbase.trace.spanreceiver.classes.
HBase includes a HBaseLocalFileSpanReceiver that writes all span information to local files in a JSON-based format. TheHBaseLocalFileSpanReceiver looks in hbase-site.xml for a hbase.trace.spanreceiver.localfilespanreceiver.filename property with a value describing the name of the file to which nodes should write their span information.
If you do not want to use the included HBaseLocalFileSpanReceiver, you are encouraged to write your own receiver (take a look at HBaseLocalFileSpanReceiver for an example). If you think others would benefit from your receiver, file a JIRA or send a pull request to HTrace.
Currently, you must turn on tracing in your client code. To do this, you simply turn on tracing for requests you think are interesting, and turn it off when the request is done.
For example, if you wanted to trace all of your get operations, you change this:
HTable table = new HTable(...); Get get = new Get(...);
Span getSpan = Trace.startSpan(「doing get」, Sampler.ALWAYS); try { HTable table = new HTable(...); Get get = new Get(...); ... } finally { getSpan.stop(); }
If you wanted to trace half of your ‘get’ operations, you would pass in:
new ProbabilitySampler(0.5)
in lieu of Sampler.ALWAYS to Trace.startSpan(). See the HTrace README for more information on Samplers.
Table of Contents
For what RPC is like in 0.94 and previous, see Benoît/Tsuna’s Unofficial Hadoop / HBase RPC protocol documentation. For more background on how we arrived at this spec., see HBase RPC: WIP
A wire-format we can evolve
A format that does not require our rewriting server core or radically changing its current architecture (for later).
List of problems with currently specified format and where we would like to go in a version2, etc. For example, what would we have to change if anything to move server async or to support streaming/chunking?
Diagram on how it works
A grammar that succinctly describes the wire-format. Currently we have these words and the content of the rpc protobuf idl but a grammar for the back and forth would help with groking rpc. Also, a little state machine on client/server interactions would help with understanding (and ensuring correct implementation).
The client will send setup information on connection establish. Thereafter, the client invokes methods against the remote server sending a protobuf Message and receiving a protobuf Message in response. Communication is synchronous. All back and forth is preceded by an int that has the total length of the request/response. Optionally, Cells(KeyValues) can be passed outside of protobufs in follow-behind Cell blocks (because we can’t protobuf megabytes of KeyValues or Cells). These CellBlocks are encoded and optionally compressed.
For more detail on the protobufs involved, see the RPC.proto file in trunk.
Client initiates connection.
On connection setup, client sends a preamble followed by a connection header.
<MAGIC 4 byte integer> <1 byte RPC Format Version> <1 byte auth type>[36]
E.g.: HBas0x000x80 -- 4 bytes of MAGIC -- ‘HBas’ -- plus one-byte of version, 0 in this case, and one byte, 0x80 (SIMPLE). of an auth type.
Has user info, and 「protocol」, as well as the encoders and compression the client will use sending CellBlocks. CellBlock encoders and compressors are for the life of the connection. CellBlock encoders implement org.apache.hadoop.hbase.codec.Codec. CellBlocks may then also be compressed. Compressors implement org.apache.hadoop.io.compress.CompressionCodec. This protobuf is written using writeDelimited so is prefaced by a pb varint with its serialized length
After client sends preamble and connection header, server does NOT respond if successful connection setup. No response means server is READY to accept requests and to give out response. If the version or authentication in the preamble is not agreeable or the server has trouble parsing the preamble, it will throw a org.apache.hadoop.hbase.ipc.FatalConnectionException explaining the error and will then disconnect. If the client in the connection header -- i.e. the protobuf’d Message that comes after the connection preamble -- asks for for a Service the server does not support or a codec the server does not have, again we throw a FatalConnectionException with explanation.
After a Connection has been set up, client makes requests. Server responds.
A request is made up of a protobuf RequestHeader followed by a protobuf Message parameter. The header includes the method name and optionally, metadata on the optional CellBlock that may be following. The parameter type suits the method being invoked: i.e. if we are doing a getRegionInfo request, the protobuf Message param will be an instance of GetRegionInfoRequest. The response will be a GetRegionInfoResponse. The CellBlock is optionally used ferrying the bulk of the RPC data: i.e Cells/KeyValues.
The request is prefaced by an int that holds the total length of what follows.
Will have call.id, trace.id, and method name, etc. including optional Metadata on the Cell block IFF one is following. Data is protobuf’d inline in this pb Message or optionally comes in the following CellBlock
If the method being invoked is getRegionInfo, if you study the Service descriptor for the client to regionserver protocol, you will find that the request sends a GetRegionInfoRequest protobuf Message param in this position.
Same as Request, it is a protobuf ResponseHeader followed by a protobuf Message response where the Message response type suits the method invoked. Bulk of the data may come in a following CellBlock.
The response is prefaced by an int that holds the total length of what follows.
Will have call.id, etc. Will include exception if failed processing. Optionally includes metadata on optional, IFF there is a CellBlock following.
Return or may be nothing if exception. If the method being invoked is getRegionInfo, if you study the Service descriptor for the client to regionserver protocol, you will find that the response sends a GetRegionInfoResponse protobuf Message param in this position.
There are two distinct types. There is the request failed which is encapsulated inside the response header for the response. The connection stays open to receive new requests. The second type, the FatalConnectionException, kills the connection.
Exceptions can carry extra information. See the ExceptionResponse protobuf type. It has a flag to indicate do-no-retry as well as other miscellaneous payload to help improve client responsiveness.
In some part, current wire-format -- i.e. all requests and responses preceeded by a length -- has been dictated by current server non-async architecture.
We went with pb header followed by pb param making a request and a pb header followed by pb response for now. Doing header+param rather than a single protobuf Message with both header and param content:
Is closer to what we currently have
Having a single fat pb requires extra copying putting the already pb’d param into the body of the fat request pb (and same making result)
We can decide whether to accept the request or not before we read the param; for example, the request might be low priority. As is, we read header+param in one go as server is currently implemented so this is a TODO.
The advantages are minor. If later, fat request has clear advantage, can roll out a v2 later.