MySQL索引基礎

時間 2019-12-21

標籤 mysql 索引基礎欄目 MySQL 简体版

原文原文鏈接

索引，即Index，在MySQL中也稱爲key。索引如書本的目錄，能夠幫咱們精肯定位到目的數據。node

B+樹索引

在MySQL中，索引是存儲引擎層實現的。索引大多使用B樹實現，一般是B+樹。B+樹的特色是，非葉節點只存索引信息，全部的數據信息存在葉子節點。算法上，能夠充分利用非葉子節點的信息，下降整棵樹的高度，從而減小時間複雜度；實現上，可讓整個查找的數據結構更輕量化，更易總體存進內存，提升查找速度。面試

關於B+樹，目前百度上有兩個版本，即節點含有N個關鍵字時，有N個/N+1個子節點。有N個子節點時，包含N個子樹的定義能夠追溯到嚴蔚敏版《數據結構》；另外一種說法來自《算法導論》，貼一段原話：算法

B-trees are balanced search trees designed to work well on magnetic disks or other direct-access secondary storage devices. B-trees are similar to red-black trees (Chapter 13), but they are better at minimizing disk I/O operations. Many database systems use B-trees, to store information.
B-trees differ from red-black trees in that B-tree nodes may have many children, from a handful to thousands. That is, the 「branching factor」 of a B-tree can be quite large, although it is usually determined by characteristics of the disk unit used. B-trees are similar to red-black trees in that every n-node B-tree has height O(lg n), although the height of a B-tree can be considerably less than that of a red-black tree because its branching factor can be much larger. Therefore, B-trees can also be used to implement many dynamic-set operations in time O(lg n).

B-trees generalize binary search trees in a natural manner. Figure 18.1 shows a simple B-tree. If an internal B-tree node x contains n[x] keys, then x has n[x] + 1 children. The keys in node x are used as dividing points separating the range of keys handled by x into
n[x] + 1 subranges, each handled by one child of x. When searching for a key in a B-tree, we make an (n[x] + 1)-way decision based on comparisons with the n[x] keys stored at node x. The structure of leaf nodes differs from that of internal nodes; we will examine these differences in Section 18.1.sql

Section 18.1. A common variant on a B-tree, known as a B+tree, stores all the satellite information in the leaves and stores only keys and child pointers in the internal nodes, thus maximizing the branching factor of the internal nodes.數據結構

以算法導論爲準，示意圖以下：less

下面用一個具體的例子說明B+樹在InnoDB中的實現。有數據表：ide

CREATE TABLE People (
 last_name varchar(50) not null,
 first_name varchar(50) not null,
 dob date not null,
 gender enum('m', 'f')not null,
 key(last_name, first_name, dob)
);

該表創建了last_name, first_name, dob的三列聯合索引，則其數據存儲方式爲：函數

從前兩個和最後兩個例子對比可得出：索引是按照定義順序對值進行排序的。形象地，想象你在有三個屬性的class中重寫compareTo方法。性能

上述方式決定了，B+樹索引適用於全鍵值、鍵值範圍、鍵前綴查詢。具體地，支持如下類型的查詢：優化

全值匹配。查找where last_name=a and first_name=b and dob=c的記錄。
匹配最左前綴。查找where last_name=a (and first_name=b)的記錄，即必須是定義中最左列開始的連續列。
匹配列前綴。查找where last_name like 'J%'。
匹配範圍值。查找where last_name>=Allen and last_name<=Bruce。

此外，「覆蓋索引」能夠在不訪問字段的狀況下，取出要查詢的值；ORDER BY也能夠藉助索引實現。規則1和2限定了聯合索引的列使用範圍，規則3和4限定了每一個列可使用索引的方式。在理解了聯合索引的存儲結構後，並不難總結出上述規則，或者其餘（禁止）規則。基於上述實現，B+樹的索引存在如下限制：

若是沒有按最左列或者列的最左開始，則索引無效。查詢where first_name=Bill，或者where dob='2017-11-20'，或者where last_name like '%reen'。
不能跳過聯合索引的列。好比，查詢where last_name='Green' and dob='2017-11-20'。
範圍查詢（或者函數/表達式，使索引失效的操做）以右的列，索引失效。查詢where last_name='Green' and first_name like 'J%' and dob='2017-11-20'。

總結起來，索引列的順序對性能影響很大。限制3中，範圍查詢屬於匹配列前綴的狀況，實際上並不會致使索引失效，屬於 MySQL優化器調用存儲引擎方式的限制。

哈希索引

這是一種基於哈希表的索引，對每行數據的索引計算出一個哈希碼；只有精確匹配索引全部列的查詢纔有效。索引哈希表的key是哈希值，value是指向對應數據的指針；採用鏈地址法解決哈希衝突。

基於哈希表實現的索引，結構緊湊，數據量小，查找速度很快，能夠用於join操做；同時，有如下侷限：

不能避免讀取行。索引只存哈希值和地址指針，每次查詢都須訪問數據行；B+樹索引的覆蓋索引在這一點就頗有優點。
沒法用於排序。由於索引是按哈希順序排序的，哈希值不能保持與被索引值一直的相對大小關係。
不支持部分索引。哈希值計算的函數是以全部索引列的值爲變量。
不支持範圍查詢。緣由同2，只支持等值比較，= ，in， <=>(NULL值比較)
哈希衝突過多問題。能夠參考：哈希表負載因子。在查找和刪除、修改方面都有不小開銷。

InnoDB能夠自適應哈希索引。某些索引或者某索引被頻繁使用時，在內存中基於B+樹索引爲這些列（或前綴）建立一個哈希索引。

InnoDB的作法，給咱們提供了一種快速查詢的思路：手動建立hash列做爲索引列。

例如，在一個存儲url的表中，常常須要作：select id from urlTable where url="http://www.baidu.com"。能夠在urlTable中新增url_crc列，使用CRC32作哈希，並對url_crc創建索引，這樣查詢可寫爲：select id from urlTable where url="http://www.baidu.com" and url_crc=CRC32("http://www.baidu.com")。注意，url=的條件不能省，須要以此解決哈希衝突。（PS.早年百度經典的URL去重面試題就是用這個思想作的）。

觸發器能夠實現上述方案：

CREATE TABLE pseudohash (
id int unsigned NOT NULL auto_increment,
url varchar(255) NOT NULL,
url_crc int unsigned NOT NULL DEFAULT 0,
PRIMARY KEY(id)
);
DELIMITER //
CREATE TRIGGER pseudohash_crc_ins BEFORE INSERT ON pseudohash FOR EACH ROW BEGIN
SET NEW.url_crc=crc32(NEW.url);
END;
//
CREATE TRIGGER pseudohash_crc_upd BEFORE UPDATE ON pseudohash FOR EACH ROW BEGIN
SET NEW.url_crc=crc32(NEW.url);
END;
//
DELIMITER ;

哈希函數要考慮兩個點：哈希值短、衝突少。對於大量衝突的狀況，可使用二次哈希；一樣的，原始字段的查詢條件不能省略。