HBase 數據模型（Data Model）

時間 2019-11-08

標籤 hbase 數據模型 data model 欄目 Hadoop 简体版

原文原文鏈接

HBase Data Model——HBase 數據模型（翻譯）

在HBase中，數據是存儲在有行有列的表格中。這是與關係型數據庫重複的術語，並非有用的類比。相反，HBase能夠被認爲是一個多維度的映射。php

HBase數據模型術語

Table（表格）css

一個HBase表格由多行組成。html

Row（行）java

HBase中的行裏面包含一個key和一個或者多個包含值的列。行按照行的key字母順序存儲在表格中。由於這個緣由，行的key的設計就顯得很是重要。數據的存儲目標是相近的數據存儲到一塊兒。一個經常使用的行的key的格式是網站域名。若是你的行的key是域名，你應該將域名進行反轉(org.apache.www, org.apache.mail, org.apache.jira)再存儲。這樣的話，全部Apache域名將會存儲在一塊兒，好過基於子域名的首字母分散在各處。web

Column（列）數據庫

HBase中的列包含用：分隔開的列族和列的限定符。express

Column Family（列族）apache

由於性能的緣由，列族物理上包含一組列和它們的值。每個列族擁有一系列的存儲屬性，例如值是否緩存在內存中，數據是否要壓縮或者他的行key是否要加密等等。表格中的每一行擁有相同的列族，儘管一個給定的行可能沒有存儲任何數據在一個給定的列族中。api

Column Qualifier（列的限定符）緩存

列的限定符是列族中數據的索引。例如給定了一個列族content，那麼限定符多是content:html，也能夠是content:pdf。列族在建立表格時是肯定的了，可是列的限定符是動態地而且行與行之間的差異也多是很是大的。

Cell（單元）

單元是由行、列族、列限定符、值和表明值版本的時間戳組成的。

Timestamp（時間戳）

時間戳是寫在值旁邊的一個用於區分值的版本的數據。默認狀況下，時間戳表示的是當數據寫入時RegionSever的時間點，但你也能夠在寫入數據時指定一個不一樣的時間戳。

19.概念視圖

你能夠讀一下 Jim R寫的Understanding HBase and BigTable 博客來簡單瞭解一下HBase的數據模型，另外一個好的解釋是Amandeep Khurana.的 Introduction to Basic Schema Design 。

學習不一樣的方面的資料可能會幫助你更透徹地瞭解HBase的設計。所連接的文章覆蓋本部分所講的信息。

接下來的例子是取自BigTable 中第二頁中的例子，在此基礎上作了些許的改變。一個名爲webable的表格，表格中有兩行（com.cnn.www 和 com.example.www）和三個列族（contents, anchor, 和 people）。在這個例子當中，第一行(com.cnn.www)中anchor包含兩列（anchor:cssnsi.com, anchor:my.look.ca）和content包含一列（contents:html）。這個例子中com.cnn.www擁有5個版本而com.example.www有一個版本。contents:html列中包含給定網頁的整個HTML。anchor限定符包含可以表示行的站點以及連接中文本。People列族表示跟站點有關的人。

	列名按照所定義好的，一個列名的格式爲列族名前綴加限定符。例如，列contents:html由列族contents和html限定符。冒號（:）用於將列族和列限定符分開。
Table 4. Table webtable
Row Key		Time Stamp	ColumnFamily contents	ColumnFamily anchor	ColumnFamily people
"com.cnn.www"		t9		anchor:cnnsi.com = "CNN"
"com.cnn.www"		t8		anchor:my.look.ca = "CNN.com"
"com.cnn.www"		t6	contents:html = "<html>…"
"com.cnn.www"		t5	contents:html = "<html>…"
"com.cnn.www"		t3	contents:html = "<html>…"
com.example.www		t5	contents:html: "<html>..."		people:author: "John Doe"

在HBase中，表格中的單元若是是空將不佔用空間或者事實上不存在。這就使得HBase看起來「稀疏」。表格視圖不是惟一方式來查看HBase中數據，甚至不是最精確的。下面的方式以多維度映射的方式來表達相同的信息。下面只是一個用於說明目的的模型可能不是百分百的精確。

{

"com.cnn.www": {

contents: {

t6: contents:html: "<html>..."

t5: contents:html: "<html>..."

t3: contents:html: "<html>..."

}

anchor: {

t9: anchor:cnnsi.com = "CNN"

t8: anchor:my.look.ca = "CNN.com"

}

people: {}

}

"com.example.www": {

contents: {

t5: contents:html: "<html>..."

}

anchor: {}

people: {

t5: people:author: "John Doe"

}

20. 物理視圖

儘管一個概念層次的表格可能看起來是由一些列稀疏的行組成，但他們是經過列族來存儲的。一個新建的限定符(column_family:column_qualifier)能夠隨時地添加到已存在的列族中。

Table 5. ColumnFamily anchor
Row Key	Time Stamp	Column Family anchor
"com.cnn.www"	t9	anchor:cnnsi.com = "CNN"
"com.cnn.www"	t8	anchor:my.look.ca = "CNN.com"

Table 6. ColumnFamily contents
Row Key	Time Stamp	ColumnFamily contents:
"com.cnn.www"	t6	contents:html = "<html>…"
"com.cnn.www"	t5	contents:html = "<html>…"
"com.cnn.www"	t3	contents:html = "<html>…"

概念視圖中的空單元其實是沒有進行存儲的。所以對於返回時間戳爲t8的contents:html的值的請求，結果爲空。一樣的，一個返回時間戳爲t9的anchor:my.look.ca的值的請求，結果也爲空。然而，若是沒有指定時間戳的話，那麼會返回特定列的最新值。對有多個版本的列，優先返回最新的值，由於時間戳是按照遞減順序存儲的。所以對於一個返回com.cnn.www裏面全部的列的值而且沒有指定時間戳的請求，返回的結果會是時間戳爲t6的contents:html 的值、時間戳 t9的anchor:cnnsi.com f的值和時間戳t8的 anchor:my.look.ca 。

關於Apache Hbase如何存儲數據的內部細節，請查看 regions.arch.

21. 命名空間

命名空間是一個相似於關係型數據庫系統中的數據庫的邏輯上的表分組的概念。這個抽象的概念爲即將到來的多租戶相關特性奠基了基礎：

- Quota Management (HBASE-8410) - Restrict the amount of resources (i.e. regions, tables) a namespace can consume.
- Namespace Security Administration (HBASE-9206) - Provide another level of security administration for tenants.
- Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset of RegionServers thus guaranteeing a course level of isolation.

21.1. 命名空間管理

命名空間能夠被建立、移除和修改。命名空間關係的指定是在建立表格經過指定一個徹底限定表名的形式完成的：

Example 11. Examples

#Create a namespace

create_namespace 'my_ns'

#create my_table in my_ns namespace

create 'my_ns:my_table', 'fam'

#drop namespace

drop_namespace 'my_ns'

#alter namespace

alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}

21.2. 預約義命名空間

There are two predefined special namespaces:

有兩種預約義的特殊的命名空間

- hbase –系統命名空間, 用於包含HBase內部表
- default – 沒有明確指定命名空間的表將會自動落入這個命名空間

Example 12. Examples

#namespace=foo and table qualifier=bar

create 'foo:bar', 'fam'

#namespace=default and table qualifier=bar

create 'bar', 'fam'

22. 表格

表格在架構設計的時候預先聲明。

23. 行

行鍵是未解釋的字節。行是按照字典順序進行排序的而且最小的排在前面。空的字節數據用來表示表格的命名空間的開頭和結尾。

24. 列族

列在HBase中是納入到列族裏面的。一個列的全部列成員都涌向相同的前綴。例如，列courses:history和cources:math是cources列族的成員，冒號用於將列族和列限定符分開。列族前綴必須由可打印的字符組成。列限定符能夠由任意字節組成。列族必須在結構定義階段預先聲明號而列則不須要再結構設計階段預先定義而是能夠在表格的建立和運行階段快速的加入。

物理上來講，全部的列族成員都是存儲在文件系統。由於調試和存儲技術參數都是在列族這個層次上，建議全部的列族都要擁有相同的通用訪問格式和大小特徵。

25. 單元

一個{row,column,version}徹底指定了HBase的一個單元。單元內容是未解釋的字節

26. 數據模型操做

數據模型的四個主要操做是Get，Put，Scan和Delete。能夠經過Table實例進行操做。

26.1. 獲取

Get 返回指定行的屬性 Gets 經過 Table.get.執行。

26.2. 插入

Put 操做是在行鍵不存在時添加新行或者行鍵已經存在時進行更新。 Puts 是經過 Table.put (寫緩存) 或者Table.batch (沒有寫緩存)執行的。

26.3. 掃描

Scan 容許爲指定屬性迭代多行。

下面是表格實例中Scan的例子。假設一個表格裏面有"row1", "row2", "row3"，而後有另一組行鍵爲"abc1", "abc2",和"abc3"。下面的例子展現如何設置一個Scan實例來返回以「row」開頭的行。

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Table table = ... // instantiate a Table instance

Scan scan = new Scan();

scan.addColumn(CF, ATTR);

scan.setRowPrefixFilter(Bytes.toBytes("row"));

ResultScanner rs = table.getScanner(scan);

try {

for (Result r = rs.next(); r != null; r = rs.next()) {

// process result...

}

} finally {

rs.close(); // always close the ResultScanner!

}

須要說明的是一般最簡單的指定Scan的一個特定中止點的方法是使用InclusiveStopFilter 類。

26.4. 刪除

Delete 操做是將一個行從表中移除. Deletes 經過 Table.delete執行。

HBase不會馬上對數據的進行操做（能夠理解爲不對數據執行刪除操做），而是爲死亡數據建立一個稱爲墓碑的標籤。這個墓碑和死亡數據會在重要精簡工做中被刪除。

查看 version.delete 獲取更多關於列的版本刪除的信息，查看 compaction 獲取關於精簡工做的更多信息。

27. 版本

A {row, column, version}在HBase徹底指定一個單元。理論上來講行和列都同樣的單元的數量是無限的，由於單元的地址是經過版本這個維度來區分的。

行和列使用字節來表達，而版本是經過長整型來指定的。典型來講，這個長時間實例就像java.util.Date.getTime() 或者 System.currentTimeMillis()返回的同樣，以毫秒爲單位，返回當前時間和January 1, 1970 UTC的時間差。

HBase的版本維度以遞減順序存儲，以至讀取一個存儲的文件時，返回的是最新版本的數據。

關於單元的版本有許多的困擾，尤爲是：

- 若是多個數據寫到一個具備相同版本的單元裏，只能獲取到最後寫入的那個
- 以非遞增的版本順序寫入也是能夠的。

下面咱們將描述HBase中版本維度是如何運做的。能夠看 HBASE-2406 關於HBase版本的討論。 Bending time in HBase 是關於HBase的版本或者時間維度的好讀物。它提供了比這裏更多的關於版本的細節信息。正如這裏寫到的，這裏提到的覆蓋存在的時間戳的限制將再也不存在。這部分只是Bruno Dumon所寫的關於版本的基本大綱。

27.1.指定版本的存儲數量

版本的最大存儲數量是列結構的一個部分而且在表格建立時指定，或者經過alter命令行，或者經過 HColumnDescriptor.DEFAULT_VERSIONS來修改。HBase0.96以前，默認數量是3，HBase0.96以後改成1.

Example 13. 修改一個列族的最大版本數.

這個例子使用HBase Shell來修改列族 f1的最大版本數爲5，你也可使用 HColumnDescriptor來實現。

hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5

Example 14. 修改一個列族的最小版本數Modify

你也能夠經過指定最小半本書來存儲列族。默認狀況下，該值爲零，意味着這個屬性是禁用的。下面的例子是經過HBase Shell設置列族f1中的全部列的最小版本數爲2。你也能夠經過 HColumnDescriptor來實現。

hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2

從HBase0.98.2開始，你能夠經過設定在hbase-site.xml中設置hbase.column.max.version屬性爲全部新建的列指定一個全局的默認的最大版本數。

27.2. 版本和HBase 操做

在這部分咱們來看一下版本維度在HBase的每一個核心操做中的表現。

27.2.1. Get/Scan

Get是經過獲取Scan的第一個數據來實現的。下面的討論適用於 Get 和 Scans.。

默認狀況下，若是你沒有指定明確的版本，當你執行一個Get操做時，那個版本爲最大值的單元將被返回（多是也可能不是最新寫人的那個）。默認的行爲能夠經過下面方式來修改：

- 返回不止一個版本查看 Get.setMaxVersions()
- 返回最新版本之外的版本, 查看 Get.setTimeRange()

想要得到小於或等於固定值的最新版本，僅僅經過使用一個0到指望版本的範圍和設置最大版本數爲1就能夠實現得到一個特定時間點的最新版本的記錄。

27.2.2. 默認的Get 例子

下面例子僅僅返回行的當前版本。

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Get get = new Get(Bytes.toBytes("row1"));

Result r = table.get(get);

byte[] b = r.getValue(CF, ATTR); // returns current version of value

27.2.3. Get版本的例子

下面是得到行的最新3個版本的例子：

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Get get = new Get(Bytes.toBytes("row1"));

get.setMaxVersions(3); // will return last 3 versions of row

Result r = table.get(get);

byte[] b = r.getValue(CF, ATTR); // returns current version of value

List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this column

27.2.4. Put

Put操做經常是以固定的時間戳來建立一個新單元。默認狀況下，系統使用服務的 currentTimeMillis，可是你也能夠爲每個列本身指定版本（長整型）。這就意味着你能夠指定一個過去或者將來的時間點，或者不是時間格式的長整型。

爲了覆蓋已經存在的值，對和那個你想要覆蓋的單元徹底同樣的row、column和version進行put操做。

隱式版本例子

下面Put是以當前時間爲版本的隱式操做

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Put put = new Put(Bytes.toBytes(row));

put.add(CF, ATTR, Bytes.toBytes( data));

table.put(put);

顯示版本例子

下面的put是顯示指定時間戳的操做。

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Put put = new Put( Bytes.toBytes(row));

long explicitTimeInMs = 555; // just an example

put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data));

table.put(put);

警告: 版本時間戳是HBase內部用來計算數據的存活時間的。它最好避免本身設置。最好是將時間戳做爲行的單獨屬性或者做爲key的一部分，或者二者都有。

27.2.5. Delete

There are three different types of internal delete markers. See Lars Hofhansl’s blog for discussion of his attempt adding another, Scanning in HBase: Prefix Delete Marker.

有三種不一樣的刪除類型。能夠看看Lars Hofhansl所寫的博客 Scanning in HBase: Prefix Delete Marker.

- Delete:列的指定版本
- Delete column:列的全部版本
- Delete family:特定列族裏面的全部列。

當要刪除整個行時，HBase將會在內部爲每個列族建立一個墓碑。

刪除經過建立一個墓碑標籤來工做的。例如，讓咱們來設想咱們要刪除一個行。爲此你可指定一個版本，或者使用默認的currentTimeMillis 。這就是刪除小於等於該版本的全部單元。HBase不會修改數據，例如刪除操做將不會馬上刪除知足刪除條件的文件。相反的，稱爲墓碑的會被寫入，用來掩飾被刪除的數據。當HBase執行一個精簡操做（能夠理解爲清理），墓碑將會執行一個真正地刪除死亡值和墓碑本身的刪除操做。若是你的刪除操做指定的版本大於目前全部的版本，那麼能夠認爲是刪除整個行的數據。

你能夠在 Put w/timestamp → Deleteall → Put w/ timestamp fails 用戶郵件列表中查看關於刪除和版本之間的相互影響的有益信息。

keyvalue也能夠到keyvalue 查看更多關於內部KeyValue格式的信息。

刪除標籤會在下一次倉庫精簡操做中被清理掉，除非爲列族設置了 KEEP_DELETED_CELLS (查看 Keeping Deleted Cells)。爲了保證刪除時間的可配置性，你能夠經過在 hbase-site.xml.中hbase.hstore.time.to.purge.deletes屬性來設置TTL（生存時間）。若是 hbase.hstore.time.to.purge.deletes沒有設置或者設置爲0，全部的刪除標籤包括哪些墓碑都會在下一次精簡操做中被幹掉。此外，將來帶有時間戳的刪除標籤將會保持到發生在hbase.hstore.time.to.purge.deletes加上表明標籤的時間戳的時間和的下一次精簡操做。

This behavior represents a fix for an unexpected change that was introduced in HBase 0.94, and was fixed in HBASE-10118. The change has been backported to HBase 0.94 and newer branches.

27.3. 當前的侷限性

27.3.1. Deletes mask Puts刪除覆蓋插入/更新

刪除操做覆蓋插入/更新操做，即便put在delete以後執行的。能夠查看 HBASE-2256. 還記得一個刪除寫入一個墓碑，只有當下一次精簡操做發生時纔會執行真正地刪除操做。假設你執行了一個刪除所有小於等於T的操做。在此以外又作了一個時間戳爲T的put操做。這個put操做即便是發生在delete以後，也會被delete墓碑所覆蓋。執行put的時候不會報錯，不過當你執行一個get的時候會發現執行無效。你會在精簡操做以後從新開始工做。若是你在put的使用的遞增的版本，那麼這些問題將不會出現。但若是你不在乎時間，在執行delelte後馬上執行put的話，那麼它們將有可能發生在同一時間點，這將會致使上述問題的出現。

27.3.2. 精簡操做影響查詢結果

建立三個版本爲t1,t2,t3的單元，而且設置最大版本數爲2.因此當咱們查詢全部版本時，只會返回t2和t3。可是當你刪除版本t2和t3的時候，版本t1會從新出現。顯然，一旦重要精簡工做運行以後，這樣的行爲就不會再出現。（查看 Bending time in HBase.）

28. 排序次序

HBase中全部的數據模型操做返回的數據都是通過排序的。首先是行排序，其次是列族，接着是列限定符，最後是時間戳（遞減排序，左右最新的記錄最早返回）

29. 列元數據

全部列的元數據都存儲在一個列族的內部KeyValue實例中。所以，HBsase不只支持一行中有多列，並且支持行之間的列的差別多樣化。跟蹤列名是你的責任。

惟一獲取一個列族的全部列的方法是處理全部的行。查看 keyvalue得到更多關於HBase內部如何存儲數據的信息。

30. Joins

HBase是否支持join是一個常見的問題，答案是沒有，至少沒辦法像RDBMS那樣支持（例如等價式join或者外部join）。正如本章節所闡述的，HBase中讀取數據的操做是Get和Scan。

然而，這不意味着等價式join功能沒辦法在你的應用中實現，可是你必須本身實現。兩種主要策略是將數據非結構化地寫到HBase中，或者查找表格而後在應用中或者MapReduce代碼中實現join操做（正如RDBMS所演示的，將根據表格的大小會有幾種不一樣的策略，例如嵌套使循環和hash-join）。哪一個是最好的方法？這將依賴於你想作什麼，沒有一種方案可以應對各類狀況。

31. ACID

查看 ACID Semantics. Lars Hofhansl也寫了一份報告 ACID in HBase.

下面是原文

Data Model

In HBase, data is stored in tables, which have rows and columns. This is a terminology overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.

HBase Data Model Terminology

Table

An HBase table consists of multiple rows.

Row

A row in HBase consists of a row key and one or more columns with values associated with them. Rows are sorted alphabetically by the row key as they are stored. For this reason, the design of the row key is very important. The goal is to store data in such a way that related rows are near each other. A common row key pattern is a website domain. If your row keys are domains, you should probably store them in reverse (org.apache.www, org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each other in the table, rather than being spread out based on the first letter of the subdomain.

Column

A column in HBase consists of a column family and a column qualifier, which are delimited by a : (colon) character.

Column Family

Column families physically colocate a set of columns and their values, often for performance reasons. Each column family has a set of storage properties, such as whether its values should be cached in memory, how its data is compressed or its row keys are encoded, and others. Each row in a table has the same column families, though a given row might not store anything in a given column family.

Column Qualifier

A column qualifier is added to a column family to provide the index for a given piece of data. Given a column family content, a column qualifier might be content:html, and another might be content:pdf. Though column families are fixed at table creation, column qualifiers are mutable and may differ greatly between rows.

Cell

A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version.

Timestamp

A timestamp is written alongside each value, and is the identifier for a given version of a value. By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell.

19. Conceptual View

You can read a very understandable explanation of the HBase data model in the blog post Understanding HBase and BigTable by Jim R. Wilson. Another good explanation is available in the PDF Introduction to Basic Schema Design by Amandeep Khurana.

It may help to read different perspectives to get a solid understanding of HBase schema design. The linked articles cover the same ground as the information in this section.

The following example is a slightly modified form of the one on page 2 of the BigTable paper. There is a table called webtable that contains two rows (com.cnn.www and com.example.www) and three column families named contents, anchor, and people. In this example, for the first row (com.cnn.www), anchor contains two columns (anchor:cssnsi.com, anchor:my.look.ca) and contents contains one column (contents:html). This example contains 5 versions of the row with the row key com.cnn.www, and one version of the row with the row key com.example.www. The contents:html column qualifier contains the entire HTML of a given website. Qualifiers of the anchor column family each contain the external site which links to the site represented by the row, along with the text it used in the anchor of its link. The people column family represents people associated with the site.

	Column Names By convention, a column name is made of its column family prefix and a qualifier. For example, the column contents:html is made up of the column family contents and the html qualifier. The colon character (:) delimits the column family from the column family qualifier.
Table 4. Table webtable
Row Key		Time Stamp	ColumnFamily contents	ColumnFamily anchor	ColumnFamily people
"com.cnn.www"		t9		anchor:cnnsi.com = "CNN"
"com.cnn.www"		t8		anchor:my.look.ca = "CNN.com"
"com.cnn.www"		t6	contents:html = "<html>…"
"com.cnn.www"		t5	contents:html = "<html>…"
"com.cnn.www"		t3	contents:html = "<html>…"

Cells in this table that appear to be empty do not take space, or in fact exist, in HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to look at data in HBase, or even the most accurate. The following represents the same information as a multi-dimensional map. This is only a mock-up for illustrative purposes and may not be strictly accurate.

{

"com.cnn.www": {

contents: {

t6: contents:html: "<html>..."

t5: contents:html: "<html>..."

t3: contents:html: "<html>..."

}

anchor: {

t9: anchor:cnnsi.com = "CNN"

t8: anchor:my.look.ca = "CNN.com"

}

people: {}

}

"com.example.www": {

contents: {

t5: contents:html: "<html>..."

}

anchor: {}

people: {

t5: people:author: "John Doe"

}

20. Physical View

Although at a conceptual level tables may be viewed as a sparse set of rows, they are physically stored by column family. A new column qualifier (column_family:column_qualifier) can be added to an existing column family at any time.

Table 5. ColumnFamily anchor
Row Key	Time Stamp	Column Family anchor
"com.cnn.www"	t9	anchor:cnnsi.com = "CNN"
"com.cnn.www"	t8	anchor:my.look.ca = "CNN.com"

Table 6. ColumnFamily contents
Row Key	Time Stamp	ColumnFamily contents:
"com.cnn.www"	t6	contents:html = "<html>…"
"com.cnn.www"	t5	contents:html = "<html>…"
"com.cnn.www"	t3	contents:html = "<html>…"

The empty cells shown in the conceptual view are not stored at all. Thus a request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.cavalue at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned. Given multiple versions, the most recent is also the first one found, since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from timestamp t6, the value of anchor:cnnsi.com from timestamp t9, the value of anchor:my.look.ca from timestamp t8.

For more information about the internals of how Apache HBase stores data, see regions.arch.

21. Namespace

A namespace is a logical grouping of tables analogous to a database in relation database systems. This abstraction lays the groundwork for upcoming multi-tenancy related features:

- Quota Management (HBASE-8410) - Restrict the amount of resources (i.e. regions, tables) a namespace can consume.
- Namespace Security Administration (HBASE-9206) - Provide another level of security administration for tenants.
- Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset of RegionServers thus guaranteeing a course level of isolation.

21.1. Namespace management

A namespace can be created, removed or altered. Namespace membership is determined during table creation by specifying a fully-qualified table name of the form:

Example 11. Examples

#Create a namespace

create_namespace 'my_ns'

#create my_table in my_ns namespace

create 'my_ns:my_table', 'fam'

#drop namespace

drop_namespace 'my_ns'

#alter namespace

alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}

21.2. Predefined namespaces

There are two predefined special namespaces:

- hbase - system namespace, used to contain HBase internal tables
- default - tables with no explicit specified namespace will automatically fall into this namespace

Example 12. Examples

#namespace=foo and table qualifier=bar

create 'foo:bar', 'fam'

#namespace=default and table qualifier=bar

create 'bar', 'fam'

22. Table

Tables are declared up front at schema definition time.

23. Row

Row keys are uninterpreted bytes. Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte array is used to denote both the start and end of a tables' namespace.

24. Column Family

Columns in Apache HBase are grouped into column families. All column members of a column family have the same prefix. For example, the columns courses:history and courses:math are both members of the courses column family. The colon character (:) delimits the column family from the column family qualifier. The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up and running.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.

25. Cells

A {row, column, version} tuple exactly specifies a cell in HBase. Cell content is uninterpreted bytes

26. Data Model Operations

The four primary data model operations are Get, Put, Scan, and Delete. Operations are applied via Table instances.

26.1. Get

Get returns attributes for a specified row. Gets are executed via Table.get.

26.2. Put

Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via Table.put (writeBuffer) or Table.batch (non-writeBuffer).

26.3. Scans

Scan allow iteration over multiple rows for specified attributes.

The following is an example of a Scan on a Table instance. Assume that a table is populated with rows with keys "row1", "row2", "row3", and then another set of rows with the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan instance to return the rows beginning with "row".

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Table table = ... // instantiate a Table instance

Scan scan = new Scan();

scan.addColumn(CF, ATTR);

scan.setRowPrefixFilter(Bytes.toBytes("row"));

ResultScanner rs = table.getScanner(scan);

try {

for (Result r = rs.next(); r != null; r = rs.next()) {

// process result...

}

} finally {

rs.close(); // always close the ResultScanner!

}

Note that generally the easiest way to specify a specific stop point for a scan is by using the InclusiveStopFilterclass.

26.4. Delete

Delete removes a row from a table. Deletes are executed via Table.delete.

HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compactions.

See version.delete for more information on deleting versions of columns, and see compaction for more information on compactions.

27. Versions

A {row, column, version} tuple exactly specifies a cell in HBase. It’s possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

While rows and column keys are expressed as bytes, the version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.

The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.

There is a lot of confusion over the semantics of cell versions, in HBase. In particular:

- If multiple writes to a cell have the same version, only the last written is fetchable.
- It is OK to write cells in a non-increasing version order.

Below we describe how the version dimension in HBase currently works. See HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.

27.1. Specifying the Number of Versions to Store

The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an alter command, via HColumnDescriptor.DEFAULT_VERSIONS. Prior to HBase 0.96, the default number of versions kept was 3, but in 0.96 and newer has been changed to 1.

Example 13. Modify the Maximum Number of Versions for a Column Family

This example uses HBase Shell to keep a maximum of 5 versions of all columns in column family f1. You could also use HColumnDescriptor.

hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5

Example 14. Modify the Minimum Number of Versions for a Column Family

You can also specify the minimum number of versions to store per column family. By default, this is set to 0, which means the feature is disabled. The following example sets the minimum number of versions on all columns in column family f1 to 2, via HBase Shell. You could also use HColumnDescriptor.

hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2

Starting with HBase 0.98.2, you can specify a global default for the maximum number of versions kept for all newly-created columns, by setting hbase.column.max.version in hbase-site.xml. See hbase.column.max.version.

27.2. Versions and HBase Operations

In this section we look at the behavior of the version dimension for each of the core HBase operations.

27.2.1. Get/Scan

Gets are implemented on top of Scans. The below discussion of Get applies equally to Scans.

By default, i.e. if you specify no explicit version, when doing a get, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:

- to return more than one version, see Get.setMaxVersions()
- to return versions other than the latest, see Get.setTimeRange()

To retrieve the latest version that is less than or equal to a given value, thus giving the 'latest' state of the record at a certain point in time, just use a range from 0 to the desired version and set the max versions to 1.

27.2.2. Default Get Example

The following Get will only retrieve the current version of the row

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Get get = new Get(Bytes.toBytes("row1"));

Result r = table.get(get);

byte[] b = r.getValue(CF, ATTR); // returns current version of value

27.2.3. Versioned Get Example

The following Get will return the last 3 versions of the row.

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Get get = new Get(Bytes.toBytes("row1"));

get.setMaxVersions(3); // will return last 3 versions of row

Result r = table.get(get);

byte[] b = r.getValue(CF, ATTR); // returns current version of value

List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this column

27.2.4. Put

Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server’s currentTimeMillis, but you can specify the version (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes.

To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you want to overwrite.

Implicit Version Example

The following Put will be implicitly versioned by HBase with the current time.

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Put put = new Put(Bytes.toBytes(row));

put.add(CF, ATTR, Bytes.toBytes( data));

table.put(put);

Explicit Version Example

The following Put has the version timestamp explicitly set.

public static final byte[] CF = "cf".getBytes();

public static final byte[] ATTR = "attr".getBytes();

...

Put put = new Put( Bytes.toBytes(row));

long explicitTimeInMs = 555; // just an example

put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data));

table.put(put);

Caution: the version timestamp is used internally by HBase for things like time-to-live calculations. It’s usually best to avoid setting this timestamp yourself. Prefer using a separate timestamp attribute of the row, or have the timestamp as a part of the row key, or both.

27.2.5. Delete

There are three different types of internal delete markers. See Lars Hofhansl’s blog for discussion of his attempt adding another, Scanning in HBase: Prefix Delete Marker.

- Delete: for a specific version of a column.
- Delete column: for all versions of a column.
- Delete family: for all columns of a particular ColumnFamily

When deleting an entire row, HBase will internally create a tombstone for each ColumnFamily (i.e., not each individual column).

Deletes work by creating tombstone markers. For example, let’s suppose we want to delete a row. For this you can specify a version, or else by default the currentTimeMillis is used. What this means is delete all cells where the version is less than or equal to this version. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values. When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.

For an informative discussion on how deletes and versioning interact, see the thread Put w/timestamp → Deleteall → Put w/ timestamp fails up on the user mailing list.

Also see keyvalue for more information on the internal KeyValue format.

Delete markers are purged during the next major compaction of the store, unless the KEEP_DELETED_CELLS option is set in the column family (See Keeping Deleted Cells). To keep the deletes for a configurable amount of time, you can set the delete TTL via the hbase.hstore.time.to.purge.deletes property in hbase-site.xml. If hbase.hstore.time.to.purge.deletes is not set, or set to 0, all delete markers, including those with timestamps in the future, are purged during the next major compaction. Otherwise, a delete marker with a timestamp in the future is kept until the major compaction which occurs after the time represented by the marker’s timestamp plus the value of hbase.hstore.time.to.purge.deletes, in milliseconds.

This behavior represents a fix for an unexpected change that was introduced in HBase 0.94, and was fixed in HBASE-10118. The change has been backported to HBase 0.94 and newer branches.

27.3. Current Limitations

27.3.1. Deletes mask Puts

Deletes mask puts, even puts that happened after the delete was entered. See HBASE-2256. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything ⇐ T. After this you do a new put with a timestamp ⇐ T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond.

27.3.2. Major compactions change query results

…create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore… (See Garbage Collection in Bending time in HBase.)

28. Sort Order

All data model operations HBase return data in sorted order. First by row, then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted in reverse, so newest records are returned first).

29. Column Metadata

There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily. Thus, while HBase can support not only a wide number of columns per row, but a heterogeneous set of columns between rows as well, it is your responsibility to keep track of the column names.

The only way to get a complete set of columns that exist for a ColumnFamily is to process all the rows. For more information about how HBase stores data internally, see keyvalue.

30. Joins

Whether HBase supports joins is a common question on the dist-list, and there is a simple answer: it doesn’t, at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL). As has been illustrated in this chapter, the read data model operations in HBase are Get and Scan.

However, that doesn’t mean that equivalent join functionality can’t be supported in your application, but you have to do it yourself. The two primary strategies are either denormalizing the data upon writing to HBase, or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS' demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs. hash-joins). So which is the best approach? It depends on what you are trying to do, and as such there isn’t a single answer that works for every use case.

31. ACID

See ACID Semantics. Lars Hofhansl has also written a note on ACID in HBase.