Phoenix三貼之二：Phoenix二級索引系統

時間 2019-11-09

標籤 phoenix 之二二級索引系統简体版

原文原文鏈接

1 二級索引之— —Global Indexing

1.1 說明

在HBase中，只有一個單一的按照字典序排序的rowKey索引，當使用rowKey來進行數據查詢的時候速度較快，可是若是不使用rowKey來查詢的話就會使用filter來對全表進行掃描，很大程度上下降了檢索性能。而Phoenix提供了二級索引技術來應對這種使用rowKey以外的條件進行檢索的場景。html

Phoenix支持兩種類型的索引技術：Global Indexing和Local Indexing，這兩種索引技術分別適用於不一樣的業務場景（主要是偏重於讀仍是偏重於寫）。下面分別對這兩種索引技術簡單使用一下，具體性能方面沒有進行測試。sql

以上文字摘自官方文檔apache

http://phoenix.apache.org/secondary_indexing.html性能優化

本篇主要介紹Global Indexing相關技術。服務器

1.2 Global Indexing

Global indexing targets read heavy，low write uses cases. With global indexes, all the performance penalties for indexes occur at write time. We intercept the data table updates on write (DELETE, UPSERT VALUES and UPSERT SELECT), build the index update and then sent any necessary updates to all interested index tables. At read time, Phoenix will select the index table to use that will produce the fastest query time and directly scan it just like any other HBase table. By default, unless hinted, an index will not be used for a query that references a column that isn’t part of the index.app

Global indexing適用於多讀少寫的業務場景。使用Global indexing的話在寫數據的時候會消耗大量開銷，由於全部對數據表的更新操做（DELETE, UPSERT VALUES and UPSERT SELECT）,會引發索引表的更新，而索引表是分佈在不一樣的數據節點上的，跨節點的數據傳輸帶來了較大的性能消耗。在讀數據的時候Phoenix會選擇索引表來下降查詢消耗的時間。在默認狀況下若是想查詢的字段不是索引字段的話索引表不會被使用，也就是說不會帶來查詢速度的提高。less

1.2.1 配置hbase-site.xml

使用Global Indexing的話須要配置hbase-site.xml，在HBase集羣的每一個regionserver節點的hbase-site.xml中加入以下配置並重啓HBase集羣。ide

<property>
    <name>hbase.regionserver.wal.codec</name>
    <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
</property>

1.2.2 建立表

進入phoenix的CLI的界面建立company表。oop

> create table company(id varchar primary key, name varchar, address varchar);

查看company表索引性能

> !indexes company

1.2.3 建立索引

對company表的name字段建立索引，索引名爲my_index。

> create index my_index on company(name);

查看當前全部表會發現多一張MY_INDEX索引表，查詢該表數據。

> !tables
> select * from my_index;

該表中會有2個字段，其中:ID是自動建立的，其實就是HBase中的主鍵RowKey，0:NAME是咱們剛剛手動建立的。

1.2.4 插入數據

在company表中添加測試數據。

> upsert into company(id, name, address) values('001', 'dimensoft', 'nanjing');

1.2.5 查詢數據

查詢company表數據

> select name,address from company where name='dimensoft';

查詢索引表my_index

> select * from my_index;

從HBase的CLI界面查看索引表MY_INDEX

> scan 'MY_INDEX'

2個索引字段NAME和ID的值被合併爲索引表MY_INDEX的rowKey，\x000是十六進制表示，轉換爲字符串是空格。

1.2.6 查詢索引中的半索引問題

高能預警：

> select name,address from company where name='dimensoft';

這樣的查詢語句是不會用到索引表的

Global mutable index will not be used unless all of the columns referenced in the query are contained in the index.

name字段雖然是索引字段可是address字段並非索引字段！也就是說須要查詢出來的字段必須都是索引字段如：

> select name from company where name='dimensoft';

若是但願使用索引表進行查詢的話可使用如下三種方式來解決這個半索引問題：

強制使用索引表

在進行查詢的時候經過sql語句強制使用索引查詢。

> SELECT /*+ INDEX(company my_index) */ name,address FROM company WHERE name = 'dimensoft';

This will cause each data row to be retrieved when the index is traversed to find the missing address column value. This hint should only be used if you know that the index has good selective (i.e. a small number of table rows have a value of ‘dimensoft’ in this example), as otherwise you’ll get better performance by the default behavior of doing a full table scan.

這樣的查詢語句會致使二次檢索數據表，第一次檢索是去索引表中查找符合name爲dimensoft的數據，這時候發現address字段並不在索引字段中，會去company表中第二次掃描，所以只有當用戶明確知道符合檢索條件的數據較少的時候才適合使用，不然會形成全表掃描，對性能影響較大。

建立covered index

建立索引的時候指定一個covered字段，先刪除my_index索引

> drop index my_index on company;

建立covered index

> create index my_index on company(name) include(address);

This will cause the address column value to be copied into the index and kept in synch as it changes. This will obviously increase the size of the index.

使用這種方式建立的全部會致使address字段的值被拷貝到索引中，缺點就是會致使索引表大小有必定的增長。

查詢索引表my_index數據。

> select * from my_index;

這裏的數據是自動同步過來的，能夠發現address字段的值也被存儲了。

從HBase的CLI中查看MY_INDEX表數據會發現比不使用include的時候多了一行數值，而且裏面包含了address字段的值。

> scan 'MY_INDEX'

這個時候就再使用下面的查詢語句就會使用到索引來進行查詢了。

> select name,address from company where name='dimensoft';

使用Local Indexing建立索引

與Global Indexing不一樣，當使用Local Indexing的時候即便查詢的全部字段都不在索引字段中時也會用到索引進行查詢（這是由Local Indexing自動完成的）。這部份內容會放到後一篇文章詳細介紹。

2 Local Indexing

2.1 說明

在HBase中，只有一個單一的按照字典序排序的rowKey索引，當使用rowKey來進行數據查詢的時候速度較快，可是若是不使用rowKey來查詢的話就會使用filter來對全表進行掃描，很大程度上下降了檢索性能。而Phoenix提供了二級索引技術來應對這種使用rowKey以外的條件進行檢索的場景。

Phoenix支持兩種類型的索引技術：Global Indexing和Local Indexing，這兩種索引技術分別適用於不一樣的業務場景（主要是偏重於讀仍是偏重於寫）。下面分別對這兩種索引技術簡單使用一下，具體性能方面沒有進行測試。

以上文字摘自官方文檔

http://phoenix.apache.org/secondary_indexing.html

本篇主要介紹Local Indexing相關技術。

2.2 Local Indexing

Local indexing targets write heavy, space constrained use cases. Just like with global indexes, Phoenix will automatically select whether or not to use a local index at query-time. With local indexes, index data and table data co-reside on same server preventing any network overhead during writes. Local indexes can be used even when the query isn’t fully covered (i.e. Phoenix automatically retrieve the columns not in the index through point gets against the data table). Unlike global indexes, all local indexes of a table are stored in a single, separate shared table.At read time when the local index is used, every region must be examined for the data as the exact region location of index data cannot be predetermined.Thus some overhead occurs at read-time.

Local indexing適用於寫操做頻繁的場景。與Global indexing同樣，Phoenix會自動斷定在進行查詢的時候是否使用索引。使用Local indexing時，索引數據和數據表的數據是存放在相同的服務器中的避免了在寫操做的時候往不一樣服務器的索引表中寫索引帶來的額外開銷。使用Local indexing的時候即便查詢的字段不是索引字段索引表也會被使用，這會帶來查詢速度的提高，這點跟Global indexing不一樣。一個數據表的全部索引數據都存儲在一個單一的獨立的可共享的表中。在讀取數據的時候，標紅的那句話不會翻譯大意就是在讀數據的時候由於存儲數據的region的位置沒法預測致使性能有必定損耗。

2.2.1 配置hbase-site.xml

使用Local Indexing的話須要配置hbase-site.xml，在HBase集羣的master節點的hbase-site.xml中添加以下配置並重啓HBase集羣。

Local indexing also requires special configurations in the master to ensure data table and local index regions co-location.

配置這個參數的目的是確保數據表與索引表協同定位。

<property>
    <name>hbase.master.loadbalancer.class</name>
    <value>org.apache.phoenix.hbase.index.balancer.IndexLoadBalancer</value>
</property>
<property>
    <name>hbase.coprocessor.master.classes</name>
    <value>org.apache.phoenix.hbase.index.master.IndexMasterObserver</value>
</property>

高能預警：若是使用的是Phoenix 4.3+的版本的話還須要在HBase集羣的每一個regionserver節點的hbase-site.xml中添加以下配置並重啓HBase集羣。

To support local index regions merge on data regions merge you will need to add the following parameter to hbase-site.xml in all the region servers and restart. (It’s applicable for Phoenix 4.3+ versions)

這個配置是爲了支持在數據region合併之上進行索引region合併（這句話感受翻譯的不太準確）。

<property>
    <name>hbase.coprocessor.regionserver.classes</name>
    <value>org.apache.hadoop.hbase.regionserver.LocalIndexMerger</value>
</property>

2.2.2 建立表

進入phoenix的CLI的界面建立company表。

> create table company(id varchar primary key, name varchar, address varchar);

查看company表索引

> !indexes company

2.2.3 建立索引

對company表的name字段建立索引，索引名爲my_index。

> create local index my_index on company(name);

查看當前全部表會發現多一張MY_INDEX索引表，查詢該表數據。

經過squirrel來查看company的索引字段。

從HBase的CLI界面查看當前全部表。

> list

高能預警：這裏的索引表並不叫MY_INDEX，而是叫_LOCAL_IDX_COMPANY，可是在Phoenix的CLI中進行數據查詢的時候仍然是使用MY_INDEX進行查詢，應該是作了映射。

2.2.4 插入數據

在company表中添加測試數據。

> upsert into company(id, name, address) values('001', 'dimensoft', 'nanjing');

2.2.5 查詢數據

查看company表數據以及索引表my_index數據。

> select * from company;
> select * from my_index;

從HBase的CLI界面查看索引表_LOCAL_IDX_COMPANY。

> scan '_LOCAL_IDX_COMPANY'

3個索引字段_INDEX_ID、NAME和ID的值被合併爲索引表的rowKey，其中_INDEX_ID並無值（\x000是十六進制表示，轉換爲字符串是空格）。

3 Append-only Data

3.1 說明

以爲仍是有必要把這種類型的索引說明一下，phoenix將其二級索引技術劃分爲global and local indexing 2種，可是若是繼續往下細分的話又分爲mutable global indexing、mutable local indexing、immutable global indexing、immutable local indexing一共四種。

默認建立的二級索引爲mutable的（mutable global ing或者mutable local indexing）。在上兩篇文章中都對這兩種索引技術大體都作出了說明。immutable類型的索引主要針對的是數據一次入庫以後永不改變的場景（only written once and never updated）。

3.2 Append-only Data

For a table in which the data is only written once and never updated in-place, certain optimizations may be made to reduce the write-time overhead for incremental maintenance. This is common with time-series data such as log or event data, where once a row is written, it will never be updated. To take advantage of these optimizations, declare your table as immutable by adding the IMMUTABLE_ROWS=true property to your DDL statement

CREATE TABLE my_table (k VARCHAR PRIMARY KEY, v VARCHAR) IMMUTABLE_ROWS=true;

All indexes on a table declared with IMMUTABLE_ROWS=true are considered immutable (note that by default, tables are considered mutable). For global immutable indexes, the index is maintained entirely on the client-side with the index table being generated as change to the data table occur. Local immutable indexes, on the other hand, are maintained on the server-side. Note that no safeguards are in-place to enforce that a table declared as immutable doesn’t actually mutate data (as that would negate the performance gain achieved). If that was to occur, the index would no longer be in sync with the table.

在一些數據一次寫入永不更新的場景中，核心的優化就是減小了在寫數據時性能的開銷。例如日誌數據與事件類型的數據都是一次寫入永不更新。經過在場景數據表的時候聲明IMMUTABLE_ROWS=true來顯示的說明該表的全部索引都是immutable的（默認的是mutable類型）。Global immutable indexes由客戶端維護，而Local immutable indexes由服務端維護。即便建立表的時候使用了immutable聲明，數據表中的數據也是能夠進行更新的。若是進行了這個的操做會引發數據表的數據與索引表的數據不一樣步。

3.2.1 建立表

在建立數據表的時候聲明IMMUTABLE_ROWS=true來顯示的說明該表的全部索引都是immutable的。

> create table company_immutable(id varchar primary key, name varchar, address varchar) IMMUTABLE_ROWS=true;

3.2.2 建立索引

對company_immutable表的name字段建立索引。

> create index name_test on company_immutable(name);

3.2.3 插入數據

插入測試數據。

> upsert into company_immutable(id, name, address) values('001', 'dimensoft', 'nanjing');

3.2.4 查詢數據

查詢數據表與索引表數據。

> select * from company_immutable;
> select * from name_test;

3.2.5 更新數據

更新id爲001的數據（這裏是爲了測試才進行數據更新操做的，不然的話最好不要對聲明瞭immutable的表進行數據更新）。

upsert into company_immutable(id, name, address) values('001', 'baidu', 'beijing');

從新查詢數據表與索引表。

> select * from company_immutable;
> select * from name_test;

能夠看到索引表中的數據並無被修改，而是被追加了！這就是immutable類型的索引。

4 二級索引性能優化篇

4.1 說明

在使用phoenix二級索引的時候能夠進行一些參數的修改來優化性能，這個沒有通過實際使用，可是在這裏也記錄一下以供有須要的人蔘考，內容來自官方文檔。

http://phoenix.apache.org/secondary_indexing.html

4.2 優化

All the following parameters must be set in hbase-site.xml - they are true for the entire cluster and all index tables, as well as across all regions on the same server (so, for instance, a single server would not write to too many different index tables at once).

index.builder.threads.max
- Number of threads to used to build the index update from the primary table update
- Increasing this value overcomes the bottleneck of reading the current row state from the underlying HRegion. Tuning this value too high will just bottleneck at the HRegion as it will not be able to handle too many concurrent scan requests as well as general thread-swapping concerns.
- Default: 10
index.builder.threads.keepalivetime
- Amount of time in seconds after we expire threads in the builder thread pool.
- Unused threads are immediately released after this amount of time and not core threads are retained (though this last is a small concern as tables are expected to sustain a fairly constant write load), but simultaneously allows us to drop threads if we are not seeing the expected load.
- Default: 60
index.writer.threads.max
- Number of threads to use when writing to the target index tables.
- The first level of parallelization, on a per-table basis - it should roughly correspond to the number of index tables
- Default: 10
index.writer.threads.keepalivetime
- Amount of time in seconds after we expire threads in the writer thread pool.
- Unused threads are immediately released after this amount of time and not core threads are retained (though this last is a small concern as tables are expected to sustain a fairly constant write load), but simultaneously allows us to drop threads if we are not seeing the expected load.
- Default: 60
hbase.htable.threads.max
- Number of threads each index HTable can use for writes.
- Increasing this allows more concurrent index updates (for instance across batches), leading to high overall throughput.
- Default: 2,147,483,647
hbase.htable.threads.keepalivetime
- Amount of time in seconds after we expire threads in the HTable’s thread pool.
- Using the 「direct handoff」 approach, new threads will only be created if it is necessary and will grow unbounded. This could be bad but HTables only create as many Runnables as there are region servers; therefore, it also scales when new region servers are added.
- Default: 60
index.tablefactory.cache.size
- Number of index HTables we should keep in cache.
- Increasing this number ensures that we do not need to recreate an HTable for each attempt to write to an index table. Conversely, you could see memory pressure if this value is set too high.
- Default: 10
org.apache.phoenix.regionserver.index.priority.min
- Value to specify to bottom (inclusive) of the range in which index priority may lie.
- Default: 1000
org.apache.phoenix.regionserver.index.priority.max
- Value to specify to top (exclusive) of the range in which index priority may lie.
- Higher priorites within the index min/max range do not means updates are processed sooner.
- Default: 1050
org.apache.phoenix.regionserver.index.handler.count
- Number of threads to use when serving index write requests for global index maintenance.
- Though the actual number of threads is dictated by the Max(number of call queues, handler count), where the number of call queues is determined by standard HBase configuration. To further tune the queues, you can adjust the standard rpc queue length parameters (currently, there are no special knobs for the index queues), specifically ipc.server.max.callqueue.length and ipc.server.callqueue.handler.factor. See the HBase Reference Guide for more details.
- Default: 30