1.索引的基本架構html
PG的索引是B+樹,B+樹是爲磁盤或其餘直接存取輔助設備而設計的一種平衡查找樹,在B+樹中,全部記錄節點都是按鍵值的大小順序存放在同一層的葉節點中,各葉節點指針進行鏈接:算法
meta page | root page(8kb,一個記錄佔32個bit,那麼就能存256個branch page,超過了就須要擴充一級branch page來存儲leaf page) | branch page … | | | branch page branch page branch page … | | | | | | | leaf page leaf page leaf page leaf page leaf page leaf page leaf page … | ———————------------- | key | | (block,offset) | 一個leaf page存放多個索引值 ———————-------------
其中meta page和root page是必須有的,meta page須要一個頁來存儲,表示指向root page的page id。 sql
隨着記錄數的增長,一個root page可能存不下全部的heap item,就會有branch page,甚至多層的branch page。 架構
leaf page存儲具體的key和value。app
一共有幾層branch,就用btree page元數據的 level 來表示,若是level爲0,則表示沒有branch層,root page直接指向leaf page,最多記錄256條記錄(假如條指針佔32bit);level爲1, 則表示有一層branch page,則root page存放branch page的指針,branch page指向leaf page,最多記錄256*256條。依次類推,且bock size是能夠設置的,當設置的更大,則一個級別存儲的數據就更多。dom
2.看看具體的結構oop
PostgreSQL B-Tree是一種變種(high-concurrency B-tree management algorithm),算法詳情請參考 src/backend/access/nbtree/README。咱們可使用pageinspect插件,內窺B-Tree的結構:插件
apple=# create extension pageinspect; CREATE EXTENSION apple=# \dx List of installed extensions Name | Version | Schema | Description -------------+---------+------------+------------------------------------------------------- pageinspect | 1.7 | public | inspect the contents of database pages at a low level plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language (2 rows) apple=# create table test(id int primary key, info text); CREATE TABLE apple=# insert into test select generate_series(1, 1000), md5(random()::text); INSERT 0 1000 apple=# vacuum ANALYZE test; VACUUM apple=# \d test Table "public.test" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | not null | info | text | | | Indexes: "test_pkey" PRIMARY KEY, btree (id) apple=# select * from bt_metap('test_pkey'); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 3 | 1 | 3 | 1 | 0 | 1000 (1 row) apple=# select * from bt_page_stats('test_pkey',1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey',2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey',3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2 (1 row) apple=# select * from bt_page_stats('test_pkey',0); 2019-05-29 11:10:47.567 CST [49885] ERROR: block 0 is a meta page 2019-05-29 11:10:47.567 CST [49885] STATEMENT: select * from bt_page_stats('test_pkey',0); ERROR: block 0 is a meta page apple=# select * from bt_page_stats('test_pkey',4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey',5); 2019-05-29 11:11:18.181 CST [49885] ERROR: block number out of range 2019-05-29 11:11:18.181 CST [49885] STATEMENT: select * from bt_page_stats('test_pkey',5); ERROR: block number out of range
meta page設計
root page # btpo_flags=2指針
branch page # btpo_flags=0
leaf page # btpo_flags=1
若是便是leaf又是root則 btpo_flags=3
根據btpo_flage能夠看出結構應該是這樣,索引佔了5個block:
meta page (block 0,btpo_flags爲2,上面能夠看到最後一列) | root page ( block 3) | | | leaf page( block 1) leaf page ( block 2)leaf page( block 4) | ———————------------- | key | | (block,offset) | ———————-------------
查看root page:
apple=# select * from bt_page_items('test_pkey',3); itemoffset | ctid | itemlen | nulls | vars | data ------------+--------+---------+-------+------+------------------------- 1 | (1,0) | 8 | f | f | 2 | (2,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 3 | (4,13) | 16 | f | f | dd 02 00 00 00 00 00 00 (3 rows)
查看leaf page:
apple=# select * from bt_page_items('test_pkey',1); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 2 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 4 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00 5 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00 6 | (0,5) | 16 | f | f | 05 00 00 00 00 00 00 00 7 | (0,6) | 16 | f | f | 06 00 00 00 00 00 00 00 8 | (0,7) | 16 | f | f | 07 00 00 00 00 00 00 00 ….
查看一個item對應的記錄:
apple=# select * from test where ctid = '(3,7)'; id | info -----+---------------------------------- 367 | 06818c090f9e5f63c95764342590a598 (1 row)
那麼索引裏面的key是怎麼排序的?塊爲何不是連續的,塊2變爲了root page。
3.查看level的變化
apple=# drop table test; DROP TABLE apple=# create table test(id int primary key, info text); CREATE TABLE apple=# insert into test select t.id, md5(random()::text) from generate_series(1, 20) as t(id); INSERT 0 20 apple=# select * from bt_page_stats('test_pkey', 1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 20 | 0 | 16 | 8192 | 7748 | 0 | 0 | 0 | 3 (1 row) apple=# select * from bt_metap('test_pkey'); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 1 | 0 | 1 | 0 | 0 | -1 (1 row) apple=# select bt_page_items('test_pkey', 1); bt_page_items ------------------------------------------------ (1,"(0,1)",16,f,f,"01 00 00 00 00 00 00 00") (2,"(0,2)",16,f,f,"02 00 00 00 00 00 00 00") (3,"(0,3)",16,f,f,"03 00 00 00 00 00 00 00") (4,"(0,4)",16,f,f,"04 00 00 00 00 00 00 00") (5,"(0,5)",16,f,f,"05 00 00 00 00 00 00 00") (6,"(0,6)",16,f,f,"06 00 00 00 00 00 00 00") (7,"(0,7)",16,f,f,"07 00 00 00 00 00 00 00") (8,"(0,8)",16,f,f,"08 00 00 00 00 00 00 00") (9,"(0,9)",16,f,f,"09 00 00 00 00 00 00 00") (10,"(0,10)",16,f,f,"0a 00 00 00 00 00 00 00") (11,"(0,11)",16,f,f,"0b 00 00 00 00 00 00 00") (12,"(0,12)",16,f,f,"0c 00 00 00 00 00 00 00") (13,"(0,13)",16,f,f,"0d 00 00 00 00 00 00 00") (14,"(0,14)",16,f,f,"0e 00 00 00 00 00 00 00") (15,"(0,15)",16,f,f,"0f 00 00 00 00 00 00 00") (16,"(0,16)",16,f,f,"10 00 00 00 00 00 00 00") (17,"(0,17)",16,f,f,"11 00 00 00 00 00 00 00") (18,"(0,18)",16,f,f,"12 00 00 00 00 00 00 00") (19,"(0,19)",16,f,f,"13 00 00 00 00 00 00 00") (20,"(0,20)",16,f,f,"14 00 00 00 00 00 00 00") (20 rows)
插入20條,一個root頁面就能存放,那麼就沒有必要申請一個leaf page,root page就是leaf page,他們的btpo_flags就是3;因爲沒有branch page,所以level也就是0;
繼續插入數據:
apple=# insert into test select t.id, md5(random()::text) from generate_series(21, 1000) as t(id); INSERT 0 980 apple=# select * from bt_metap('test_pkey'); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 3 | 1 | 3 | 1 | 0 | -1 (1 row) apple=# select * from bt_page_stats('test_pkey', 0); ERROR: block 0 is a meta page apple=# select * from bt_page_stats('test_pkey', 1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey', 2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey', 3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2 (1 row) apple=# select * from bt_page_stats('test_pkey', 4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey', 5); ERROR: block number out of range
能夠看到leve由0變爲了1,有了新的root page,從block 1變爲了block 3,且加入了三個新的leaf page。
咱們能夠看到一個leaf頁面大概在以int類型爲索引時,大概能夠存放367條記錄,而一個root page中記錄一個leaf page指針只須要13bit大小,那麼咱們繼續增大多少條,能夠出現branch page呢?
(8192/13) * 367 - 1000 = 230210條,那麼咱們就插入數據試試:
apple=# insert into test select t.id, md5(random()::text) from generate_series(1001, 230210) as t(id); INSERT 0 229210 apple=# select * from bt_page_stats('test_pkey', 0); ERROR: block 0 is a meta page apple=# analyze ; ANALYZE apple=# select * from bt_metap('test_pkey'); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 412 | 2 | 412 | 2 | 0 | -1 (1 row) apple=# select * from bt_page_stats('test_pkey', 412); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 412 | r | 2 | 0 | 12 | 8192 | 8116 | 0 | 0 | 2 | 2 (1 row) apple=# select * from bt_page_stats('test_pkey', 411); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 411 | i | 344 | 0 | 15 | 8192 | 1276 | 3 | 0 | 1 | 0 (1 row) apple=# select * from bt_page_stats('test_pkey', 3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | i | 286 | 0 | 15 | 8192 | 2436 | 0 | 411 | 1 | 0 (1 row) apple=# select * from bt_page_stats('test_pkey', 8); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 8 | l | 367 | 0 | 16 | 8192 | 808 | 7 | 9 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey', 1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey', 2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey', 4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 367 | 0 | 16 | 8192 | 808 | 2 | 5 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey', 632); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 632 | l | 362 | 0 | 16 | 8192 | 908 | 631 | 0 | 0 | 1 (1 row) apple=# select * from bt_page_stats('test_pkey', 633); ERROR: block number out of range
上面能夠看到level由1變爲了2,root page又由上面的block 3變爲了block 412,而block 3蛻變爲一個branch節點;增長了兩個branch page分別是btpo_flags爲0的block 411和 block 3;能夠計算到最後一個page應該是(8192/13) + 2 = 632。
兩個branch page是從第一個指向第二個3的下一個是411,在哪裏看到3號block是第一個branch page?
全部的leaf page是從0->1->2->4->5 … ->632->0,怎麼知道哪一個leaf page是從哪一個branch page指出來的呢?
答案是能夠經過看root page和branch page裏面的值來看:
從root page的值能夠看到是存放了對應的branch page的塊號和偏移量
apple=# select * from bt_page_items('test_pkey', 412); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,0) | 8 | f | f | 2 | (411,31) | 16 | f | f | 77 97 01 00 00 00 00 00 (2 rows)
branch page存放了leaf page的block number和offset
apple=# select * from bt_page_items('test_pkey', 3); itemoffset | ctid | itemlen | nulls | vars | data ------------+-----------+---------+-------+------+------------------------- 1 | (287,31) | 16 | f | f | 77 97 01 00 00 00 00 00 2 | (1,0) | 8 | f | f | 3 | (2,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 4 | (4,13) | 16 | f | f | dd 02 00 00 00 00 00 00 5 | (5,19) | 16 | f | f | 4b 04 00 00 00 00 00 00 6 | (6,25) | 16 | f | f | b9 05 00 00 00 00 00 00 7 | (7,31) | 16 | f | f | 27 07 00 00 00 00 00 00 8 | (8,37) | 16 | f | f | 95 08 00 00 00 00 00 00 9 | (9,43) | 16 | f | f | 03 0a 00 00 00 00 00 00
那麼root page和branch page的每一個item會存放他們對應的塊裏面索引key按照B+樹的方式進行組織,咱們這裏B+樹高度爲2,每頁可存放630條記錄,所以,能夠大體的畫出當前的拓撲圖:
固然這些數據都是基於B+樹進行按順序排列的,B+樹能夠指定樹的寬度,咱們這邊的寬度不指定,而是按實際大小計算的嗎?例如這裏:8192/13 = 630,即M=630。
4.掃描時時間消耗估算
apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id = 11;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using test_pkey on public.test (cost=0.42..8.44 rows=1 width=4) (actual time=0.247..0.248 rows=1 loops=1)
Output: id
Index Cond: (test.id = 11)
Heap Fetches: 1
Buffers: shared hit=4
Planning Time: 0.151 ms
Execution Time: 0.287 ms
(7 rows)
costs = 1 meta page + root page + branch page + leaf page = 4
apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id < 11;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using test_pkey on public.test (cost=0.42..8.61 rows=11 width=4) (actual time=0.006..0.009 rows=10 loops=1)
Output: id
Index Cond: (test.id < 11)
Heap Fetches: 10
Buffers: shared hit=4
Planning Time: 0.100 ms
Execution Time: 0.027 ms
(7 rows)
costs = 1 meta page + root page + branch page + leaf page = 4,一個塊的讀取消耗幾乎忽略不計。
apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id in (1,3,1000, 222222, 111111111, 1232244,11);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using test_pkey on public.test (cost=0.42..35.06 rows=7 width=4) (actual time=0.035..0.088 rows=5 loops=1)
Output: id
Index Cond: (test.id = ANY ('{1,3,1000,222222,111111111,1232244,11}'::integer[]))
Heap Fetches: 5
Buffers: shared hit=27
Planning Time: 0.095 ms
Execution Time: 0.113 ms
(7 rows)
只有五條記錄在leaf中有,所以:
costs = 1 meta page + 5*(root page + branch page 1 + branch page 2 + leaf page) + 2* (root page + branch page 1 + branch page 2) = 27
後續給出索引每一個item對應的代碼結構。
https://www.cnblogs.com/scu-cjx/p/9960483.html