騰訊數據庫診斷大賽題目回顧與分析

時間 2019-11-25

原文原文鏈接

只是從我的角度分析了下此次比賽的題目，涉及到一些我的經驗，學習交流而已～mysql

貼上官方的git連接：
https://github.com/DBbrain/Diagnosisgit

初賽

data

order 表的數據量 2000
CREATE TABLE `order` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar ( 32) COLLATE utf8_bin NOT NULL,
  `creator` varchar(24) COLLATE utf8_bin NOT NULL,
  `price` varchar(64) COLLATE utf8_bin NOT NULL,
  `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `status` tinyint(1) NOT NULL,
  PRIMARY KEY (`id`)
);

order_item 表的數據量 499760
CREATE TABLE `order_item` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(32) COLLATE utf8_bin NOT NULL,
  `parent` bigint(20) NOT NULL,
  `status` int(11) NOT NULL,
  `type` varchar(12) COLLATE utf8_bin NOT NULL DEFAULT '0',
  `quantity` int(11) NOT NULL DEFAULT '1',
  `update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
);

1. select

分析

SELECT * FROM `order` o INNER JOIN order_item i  ON i.parent = o.id 
        ORDER BY o.status ASC, i.update_time DESC LIMIT  0, 20;

mysql> explain SELECT * FROM `order` o INNER JOIN order_item i 
    ->     ON i.parent = o.id  ORDER BY o.status ASC, i.update_time DESC LIMIT  0, 20;
+----+-------------+-------+------------+--------+---------------+---------+---------+---------------------------------+--------+----------+---------------------------------+
| id      | select_type | table   | partitions   | type     | possible_keys | key        | key_len   | ref                                         | rows      | filtered    | Extra                                  |
+----+-------------+-------+------------+--------+---------------+---------+---------+---------------------------------+--------+----------+---------------------------------+
|  1 | SIMPLE      | i     | NULL       | ALL    | NULL          | NULL    | NULL    | NULL                            | 497839 |   100.00 | Using temporary; Using filesort |
|  1 | SIMPLE      | o     | NULL       | eq_ref | PRIMARY       | PRIMARY | 8       | sql_optimization_match.i.parent |      1 |   100.00 | NULL                            |
+----+-------------+-------+------------+--------+---------------+---------+---------+---------------------------------+--------+----------+---------------------------------+

兩張表 inner join，根據執行計劃，驅動表選擇 order_item ，同時因爲排序字段來自兩張表，且方向不一致，形成了寫臨時表，因爲排序字段不能使用到索引，所以形成了外排序。github

分析 join 字段

order_item.parent = order.id

mysql> select count(distinct(parent)) from order_item;  // 區分度較低，忽略
+-------------------------+
| count(distinct(parent)) |
+-------------------------+
|                     300     |
+-------------------------+
mysql> select count(distinct(id)) from `order`; // 區分度較高，但 id 上已有主鍵索引
+---------------------+
| count(distinct(id))  |
+---------------------+
|                2000    |
+---------------------+
1 row in set (0.00 sec)

order_item.parent 沒有索引；
order.id 有索引。sql

分析聚合字段

ORDER BY o.status ASC, i.update_time DESC

mysql> select count(distinct(status)) from `order`; //區分度很低
+-------------------------+
| count(distinct(status)) |
+-------------------------+
|                       2 |
+-------------------------+
mysql> select count(distinct(update_time)) from `order_item`;   // 區分度通常
+------------------------------+
| count(distinct(update_time)) |
+------------------------------+
|                        32768 |
+------------------------------+

兩張表的 order by 字段排序方式不同，可能須要用到外排序，同時原表在排序字段上沒有索引。數據庫

優化

根據以前的查詢計劃能夠看到，是對 order_item 進行了一次全表掃，以後再進行外排序。因爲 sql 語義中須要對兩列進行排序，所以，能夠經過其餘的方式，減小外排序的數據量，從而下降時耗。ide

order 表中的排序字段 status 僅有兩個不一樣值，嘗試去掉 status 排序字段以後，速度明顯提升，此時，order by 中的 update_time 字段能夠嘗試增長索引，區分度也知足要求；oop

status 僅有兩列，可使用 union all 來代替，避免 order by 中不一樣的表不一樣的排序順序致使沒法使用索引。性能

這裏是官方給出的建議（應該不是ML自動改寫的），不過這種改寫sql的方式有必定侷限性，適用場景受限，若是 status 類型不是tinyint(1)，且之後若是會增長新的類型，sql須要不斷改寫。學習

能夠嘗試推進業務改造，在從新優化索引。測試

另外，sql改寫以後，給出的索引建議是增長聯合索引（update_time,parent），上面分析能夠看到，parent的區分度較低，這裏增長聯合索引或者只給 update_time 增長索引，性能相差很少。

# sql 改寫
SELECT o.*,i.* FROM  (
    ( SELECT o.id, i.id item_id FROM  `order_1` o 
            INNER JOIN order_item i ON i.parent =o.id
            WHERE  o.status = 0
            ORDER  BY i.update_time DESC LIMIT  0, 20)
    UNION ALL
    (SELECT o.id, i.id item_id FROM  `order_1` o
            INNER JOIN order_item i ON i.parent =o.id
            WHERE  o.status = 1
            ORDER  BY i.update_time DESC LIMIT  0, 20)
    ) tmp
    INNER JOIN `order_1` o ON tmp.id = o.id
    INNER JOIN order_item i ON tmp.item_id = i.id
    ORDER  BY o.status ASC,
    i.update_time DESC
    LIMIT  0, 20

# 增長索引
alter table order_item add index `item_idx_1` (`update_time`,`parent`);

# 執行計劃
+----+-------------+------------+------------+--------+---------------+------------+---------+---------------------------------+------+----------+---------------------------------+order_item i ON tmp.ite 
| id | select_type | table      | partitions | type   | possible_keys | key        | key_len | ref                             | rows | filtered | Extra                           |
+----+-------------+------------+------------+--------+---------------+------------+---------+---------------------------------+------+----------+---------------------------------+
|  1 | PRIMARY     | <derived2> | NULL       | ALL    | NULL          | NULL       | NULL    | NULL                            |   40 |   100.00 | Using temporary; Using filesort |
|  1 | PRIMARY     | o          | NULL       | eq_ref | PRIMARY       | PRIMARY    | 8       | tmp.id                          |    1 |   100.00 | NULL                            |
|  1 | PRIMARY     | i          | NULL       | eq_ref | PRIMARY       | PRIMARY    | 8       | tmp.item_id                     |    1 |   100.00 | NULL                            |
|  2 | DERIVED     | i          | NULL       | index  | NULL          | item_idx_1 | 12      | NULL                            |   20 |   100.00 | Using index                     |
|  2 | DERIVED     | o          | NULL       | eq_ref | PRIMARY       | PRIMARY    | 8       | sql_optimization_match.i.parent |    1 |    10.00 | Using where                     |
|  3 | UNION       | i          | NULL       | index  | NULL          | item_idx_1 | 12      | NULL                            |   20 |   100.00 | Using index                     |
|  3 | UNION       | o          | NULL       | eq_ref | PRIMARY       | PRIMARY    | 8       | sql_optimization_match.i.parent |    1 |    10.00 | Using where                     |
+----+-------------+------------+------------+--------+---------------+------------+---------+---------------------------------+------+----------+---------------------------------+

總結

1. 對區分度極低的字段若是有排序、範圍比較等操做，能夠轉換爲 union all；
2. 對排序字段，嘗試使用索引避免 filesort，若是不可避免，在 filesort 以前嘗試減小排序的數據量；

2. update

分析

update `order` set create_time = now()
    where id in (select parent from order_item where type = 2 );

# 執行計劃
mysql> explain update `order_1` set create_time = now() where id in (select parent from order_item where type = 2 );
+----+--------------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type        | table      | partitions | type  | possible_keys | key     | key_len | ref  | rows   | filtered | Extra       |
+----+--------------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
|  1 | UPDATE             | order_1    | NULL       | index | NULL          | PRIMARY | 8       | NULL |   2000 |   100.00 | Using where |
|  2 | DEPENDENT SUBQUERY | order_item | NULL       | ALL   | NULL          | NULL    | NULL    | NULL | 496836 |     1.00 | Using where |
+----+--------------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+

update 的條件是 in 子查詢的方式，explain 中注意到 select type 爲 DEPENDENT SUBQUERY ，表示先作外查詢，外查詢匹配到的行數爲N，那麼接下來會進行N次子查詢，效率極低。

對子查詢一般的作法是轉換爲連表查詢 join。

優化

1.最簡單的子查詢->join 的優化操做

update `order` o  inner join (select parent from `order_item` where type = 2) tmp on o.id = tmp.parent  set create_time = now();

mysql> explain update `order` o  inner join (select parent from `order_item` where type = 2) tmp on o.id = tmp.parent  set create_time = now() \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: order_item
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 497839
     filtered: 10.00
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: UPDATE
        table: o
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: sql_optimization_match.order_item.parent
         rows: 1
     filtered: 100.00
        Extra: NULL
2 rows in set (0.00 sec)

轉換後的速度相比以前已經快了不少，沒有了dependence subquery，不過仍是秒級，驅動表選擇了 order_item 表，可是基本是一次全表掃，意味着要使用 49w 行的數據和 order 表進行 join，開銷仍是很大。

看最原始的慢update，修改的只有 order 表，條件的話只須要 id 在對 order_item 的子查詢範圍內便可，重複的 parent 對於 update 毫無心義，所以，能夠對parent字段進行一次聚合（group by），因爲子查詢中有 order_item.type = 2 的條件，所以，能夠對 type 字段同時進行聚合。

因爲 order_item 僅有主鍵索引，對 order_item 表的等值判斷條件和聚合操做使用索引最佳，所以，能夠創建聯合索引，索引順序優先等值操做。

額外注意一點，咱們要建立索引，索引字段類型和 sql 中的等值類型是否一致。

1. 優化連表查詢

增長索引：
    alter table `order_item` add index idx_1(type,parent);

sql 優化：
    update `order` o inner join (
        select parent from `order_item` 
                where type = '2' group by type, parent) i 
        on o.id = i.parent set create_time = now();

mysql> explain update `order` o inner join (    select  parent from `order_item` where type = '2' group by type, parent ) i on o.id = i.parent set create_time = now()\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 571
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: UPDATE
        table: o
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: i.parent
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: order_item
   partitions: NULL
         type: range
possible_keys: idx_1
          key: idx_1
      key_len: 46
          ref: NULL
         rows: 571
     filtered: 100.00
        Extra: Using where; Using index for group-by

優化以後，使用了 order_item 表作了驅動表，同時這裏也使用到了上面創建的索引 idx_1，以後生成的臨時表和 order 表進行 join。因爲 group by 的緣由，order_item 生成的結果集數量更少，所以被選爲了驅動表。

另外須要注意的是在 group by 這裏使用了兩列，這個是爲了使用 idx_1 索引（儘管 groiup by parent 和 group by type,parent 的返回結果行數都是同樣的，可是執行計劃仍是有很大差距）

優化以後的執行時間在毫秒級。

總結

1. 驅動表的選擇，始終是小表驅動大表，驅動表會走全表掃，因此一般索引都是在被驅動表上增長；
2. 若是執行計劃中出現了 DEPENDENT SUBQUERY，必定會對 sql 的執行效率有影響（同時 DEPENDENT SUBQUERY 還會潛在地形成必定程度的鎖放大）， in + 子查詢 方式很容易引發，能夠將子查詢優化爲 join 操做；
3. 對於 join 連表查詢，進行連表的數據越少，執行效率就越高，所以，在不改變sql語義的前提下，儘可能使參加 join 的數據量減小;
4. 關於索引順序： 等值條件 > group by  > order by
5. 注意索引字段類型和 sql 中的的判斷條件中的數據類型是否一致。

決賽

data

區分度的計算過程省略，這裏直接給出區分度好壞

因爲某些表的行數較多，區分度的計算使用的是統計前5000行中 distinct 的值（生成環境中也能夠這樣作，能夠下降計算區分度帶來的額外開銷），極端狀況下部分小表可能會形成誤判，但行數極少的表加索引的意義也不是很大。

# customer 數據量 1,200,000
CREATE TABLE `customer` (
  `custkey` int(11) NOT NULL,       // 區分度 OK
  `name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,  // 區分度 OK
  `address` varchar(40) NOT NULL,
  `nationkey` int(11) NOT NULL,     // 區分度較低
  `phone` char(15) NOT NULL,        // 區分度 OK
  `acctbal` decimal(15,2) NOT NULL,
  `mktsegment` char(10) NOT NULL,   // 區分度較低
  `comment` varchar(117) NOT NULL,
  PRIMARY KEY (`custkey`),
  KEY `idx_nationkey` (`nationkey`)
);

# nation 數據量 25
CREATE TABLE `nation` (
  `nationkey` int(11) NOT NULL,     // 區分度 OK
  `name` char(25) NOT NULL,
  `regionkey` int(11) NOT NULL,
  `comment` varchar(152) DEFAULT NULL,
  PRIMARY KEY (`nationkey`),
  KEY `idx_4_0` (`name`)
);

# orders 數據量12,000,000
CREATE TABLE `orders` (
  `orderkey` int(11) NOT NULL,
  `custkey` int(11) NOT NULL,       // 區分度 OK
  `orderstatus` varchar(1) NOT NULL,
  `totalprice` decimal(15,2) NOT NULL,      // 區分度 OK
  `orderdate` date NOT NULL,
  `orderpriority` char(15) NOT NULL,
  `clerk` char(15) NOT NULL,                // 區分度 OK
  `shippriority` int(11) NOT NULL,
  `comment` varchar(79) NOT NULL,
  PRIMARY KEY (`orderkey`)
);

# region 數據量 5
CREATE TABLE `region` (
  `regionkey` int(11) NOT NULL,
  `name` varchar(25) NOT NULL,
  `comment` varchar(152) DEFAULT NULL,
  PRIMARY KEY (`regionkey`)
);

1. select

分析

select c.custkey, c.phone, sum(o.totalprice) totalprice
    from nation n 
    inner join customer c on c.nationkey = n.nationkey
    inner join orders o on o.clerk = c.name
    where n.name = "CHINA" and c.mktsegment = "HOUSEHOLD" and c.phone like "28-520%"
    group by c.custkey, c.phone;

# 執行計劃
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows     | filtered | Extra                                              |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+----------------------------------------------------+
|  1 | SIMPLE      | n     | NULL       | ALL  | PRIMARY       | NULL | NULL    | NULL |       25 |    10.00 | Using where; Using temporary; Using filesort       |
|  1 | SIMPLE      | c     | NULL       | ALL  | NULL          | NULL | NULL    | NULL |  1189853 |     0.11 | Using where; Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | o     | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 10963843 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+----------------------------------------------------+

三張表【customer c】；【nation n】；【orders o】

customer

where條件：
             c.mktsegment = "HOUSEHOLD"：區分度較低，放棄
        ✔  c.phone like "28-520%"：區分度較好，考慮添加索引
    聚合條件：
             group by c.custkey： 區分度較好，可是已是主鍵，放棄
        ✔  c.phone：同where，考慮添加
    join 條件：
            c.nationkey = n.nationkey：區分度較低，放棄
        ✔ o.clerk = c.name：區分度較高，考慮添加索引

    advice：
        add index `dx_1_0`(name);
        add index `idx_1_1` (phone);

nation

數據量 25，不考慮添加索引

    nation 表能夠考慮增長索引 add index `idx_1_0`(name);  但意義不大

orders

join 條件：
        ✔ o.clerk = c.name：區分度較高，考慮添加索引

    advice：
        add index `idx_1_0` (clerk)

優化

按上述分析增長三條索引後，執行計劃以下

+----+-------------+-------+------------+--------+----------------+---------+---------+-------------------+----------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type   | possible_keys  | key     | key_len | ref               | rows     | filtered | Extra                                                               |
+----+-------------+-------+------------+--------+----------------+---------+---------+-------------------+----------+----------+---------------------------------------------------------------------+
|  1 | SIMPLE      | c     | NULL       | range  | dx_1_0,idx_1_1 | idx_1_1 | 45      | NULL              |       46 |    10.00 | Using index condition; Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | n     | NULL       | eq_ref | PRIMARY        | PRIMARY | 4       | dbaas.c.nationkey |        1 |    10.00 | Using where                                                         |
|  1 | SIMPLE      | o     | NULL       | ALL    | idx_1_0        | NULL    | NULL    | NULL              | 10963843 |    10.00 | Range checked for each record (index map: 0x2)                      |
+----+-------------+-------+------------+--------+----------------+---------+---------+-------------------+----------+----------+---------------------------------------------------------------------+

總結

1. 在 inner join 的狀況下，咱們沒法判斷出驅動表，所以，咱們會選擇在合適的字段上都添加索引；
2. 在 sql 中的條件類型較多時，選擇把等值條件和聚合條件添加爲組合索引，join 條件單獨增長索引；
3. 若是數據量過少，增長索引意義不大，能夠不考慮；
4. dbrain 給出的是組合索引，二者相比，性能基本一致；

2. select

分析

select * from (
    select custkey, orderdate, sum(totalprice) as totalprice
        from orders group by custkey, orderdate
    ) o
    where orderdate = "2019-08-01"

# 執行計劃
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+----------+----------+---------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key         | key_len | ref   | rows     | filtered | Extra                           |
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+----------+----------+---------------------------------+
|  1 | PRIMARY     | <derived2> | NULL       | ref  | <auto_key0>   | <auto_key0> | 3       | const |       10 |   100.00 | NULL                            |
|  2 | DERIVED     | orders     | NULL       | ALL  | NULL          | NULL        | NULL    | NULL  | 10963843 |   100.00 | Using temporary; Using filesort |
+----+-------------+------------+------------+------+---------------+-------------+---------+-------+----------+----------+---------------------------------+

僅涉及到一張表，group by 用到了 filesort，sql 看上去並不複雜，可是卻產生了驅動表。

查看sql，發現select * from (子查詢)，多餘的嵌套，能夠考慮去掉，sql 能夠改寫爲

select custkey, orderdate, sum(totalprice) as totalprice
        from orders where orderdate = "2019-08-01" group by custkey, orderdate;

索引分析

where 條件：
        ✔ orderdate = "2019-08-01"：區分度較高，考慮增長索引

    聚合條件：
        ✔  group by custkey, orderdate：兩個字段區分區都較高，考慮增長索引

    advice：
        等值條件 優先於 聚合條件
        add index `idx_2_0` (orderdate, custkey)

優化

使用優化後的 sql，增長聯合索引，執行計劃爲
+----+-------------+--------+------------+------+---------------+---------+---------+-------+------+----------+-----------------------+
| id | select_type | table  | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra                 |
+----+-------------+--------+------------+------+---------------+---------+---------+-------+------+----------+-----------------------+
|  1 | SIMPLE      | orders | NULL       | ref  | idx_2_0       | idx_2_0 | 3       | const |    1 |   100.00 | Using index condition |
+----+-------------+--------+------------+------+---------------+---------+---------+-------+------+----------+-----------------------+

若是是增長兩個單獨的索引，
    add index `idx_2_1` (custkey);
    add index `idx_2_2` (orderdate);
執行計劃爲
+----+-------------+--------+------------+------+---------------+---------+---------+-------+------+----------+--------------------------------------------------------+
| id | select_type | table  | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra                                                  |
+----+-------------+--------+------------+------+---------------+---------+---------+-------+------+----------+--------------------------------------------------------+
|  1 | SIMPLE      | orders | NULL       | ref  | idx_2_2       | idx_2_2 | 3       | const |    1 |   100.00 | Using index condition; Using temporary; Using filesort |
+----+-------------+--------+------------+------+---------------+---------+---------+-------+------+----------+--------------------------------------------------------+

使用到了 filesort，只有custkey可使用索引，所以建議聯合索引。

結論

1. 單張表的sql若是執行計劃出現 filesort 等須要關注，頻繁嵌套的子查詢，會對性能有必定影響，能夠考慮 sql 重寫；
2. 關於加索引，等值條件要優先於聚合、join等條件；

3. select

分析

select c.custkey, sum(o.totalprice) totalprice from customer c
        left join orders o on o.custkey = c.custkey
        where c.phone like "33-64%" and c.name like concat("Customer#00003", "%")
        group by c.custkey

在已經有前兩條 sql 增長的索引前提下，執行計劃爲
+----+-------------+-------+------------+-------+------------------------+---------+---------+------+----------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys          | key     | key_len | ref  | rows     | filtered | Extra                                                               |
+----+-------------+-------+------------+-------+------------------------+---------+---------+------+----------+----------+---------------------------------------------------------------------+
|  1 | SIMPLE      | c     | NULL       | range | PRIMARY,dx_1_0,idx_1_1 | idx_1_1 | 45      | NULL |      552 |     1.63 | Using index condition; Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | o     | NULL       | ALL   | NULL                   | NULL    | NULL    | NULL | 10963843 |   100.00 | Using where; Using join buffer (Block Nested Loop)                  |
+----+-------------+-------+------------+-------+------------------------+---------+---------+------+----------+----------+---------------------------------------------------------------------+

customer 表已經使用到了索引，是否須要增長其它索引稍後分析；
order 表是走了全表掃，掃描 12,000,000 行數據，可能缺乏索引；

兩張表【customer c】；【order o】

customer

where條件：
         c.phone like "33-64%"：第一條 select 已經添加過索引
         c.name like concat("Customer#00003", "%")：第一條 select 已經添加過索引
聚合條件：
         group by c.custkey： 區分度較好，可是已是主鍵，放棄
join 條件：
         o.custkey = c.custkey：區分度較好，可是已是主鍵，放棄

advice：
    無建議

order

join 條件：
        o.custkey = c.custkey：區分度較高，考慮增長索引

advice：
        add index `idx_3_0` (custkey)

優化

增長索引以後的執行計劃爲：

+----+-------------+-------+------------+-------+------------------------+---------+---------+-----------------+------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys          | key     | key_len | ref             | rows | filtered | Extra                                                               |
+----+-------------+-------+------------+-------+------------------------+---------+---------+-----------------+------+----------+---------------------------------------------------------------------+
|  1 | SIMPLE      | c     | NULL       | range | PRIMARY,dx_1_0,idx_1_1 | idx_1_1 | 45      | NULL            |  552 |     1.63 | Using index condition; Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | o     | NULL       | ref   | idx_3_0                | idx_3_0 | 4       | dbaas.c.custkey |   13 |   100.00 | NULL                                                                |
+----+-------------+-------+------------+-------+------------------------+---------+---------+-----------------+------+----------+---------------------------------------------------------------------+

增長索引以後。掃描 order 表的行數已經大大減小，執行效率也提高很高

總結

1. 關於 【using where】, 【using index】, 【using index condition】; 【Using where &&Using index】的區別（爲何總結這個呢，我這邊的話是創建了兩個庫，數據和基本的表結構是一致的，可是其中一個庫中表的索引是按照我本身分析的狀況加上去的，另外一個庫是官方給出的建議，發如今執行效率都很高的狀況下，二者執行計劃的 extra 內容有所區別，本想 google 解決，可是看了排名前三的博客，兩篇的內容是同樣的，和第三篇的解釋徹底不一樣，本身嘗試了下，這裏給出結論，最後附上測試流程）：

4. select

分析

select c.custkey, c.phone from nation n
        inner join customer c on c.nationkey = n.nationkey
        where n.name = "CHINA" and exists (
            select 1 from orders o where o.custkey = c.custkey and o.orderdate = "1998-08-11");

在上面已有的索引前提下，執行計劃爲

+----+--------------------+-------+------------+------+-----------------+---------+---------+-----------------------+---------+----------+----------------------------------------------------+
| id | select_type        | table | partitions | type | possible_keys   | key     | key_len | ref                   | rows    | filtered | Extra                                              |
+----+--------------------+-------+------------+------+-----------------+---------+---------+-----------------------+---------+----------+----------------------------------------------------+
|  1 | PRIMARY            | n     | NULL       | ref  | PRIMARY,idx_1_0 | idx_1_0 | 75      | const                 |       1 |   100.00 | Using index                                        |
|  1 | PRIMARY            | c     | NULL       | ALL  | NULL            | NULL    | NULL    | NULL                  | 1189853 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
|  2 | DEPENDENT SUBQUERY | o     | NULL       | ref  | idx_2_0,idx_3_0 | idx_2_0 | 7       | const,dbaas.c.custkey |       1 |   100.00 | Using index                                        |
+----+--------------------+-------+------------+------+-----------------+---------+---------+-----------------------+---------+----------+----------------------------------------------------+

看到了 DEPENDENT SUBQUERY，在 in/exists +子查詢的條件下。常常會出現，有什麼危害上面有解釋，出現了這個東西，就要想辦法改寫 sql。既然是 exists + 子查詢，那麼優化策略就是改寫爲 join。

最通俗的改寫方式：先所有 inner join，最後加 where 條件

    select c.custkey, c.phone from nation n
        inner join customer c on c.nationkey = n.nationkey
        inner join orders o on o.custkey = c.custkey
      where n.name = "CHINA" and o.orderdate = "1998-08-11";

官方給出的 sql 比較複雜，但作的事情差很少，多考慮了一點試圖使用 group by 來減小 join 的數據量，給出官方答案，這裏很少解釋它【不過這裏去掉 group by 會更好】

SELECT `t1`.`custkey`, `t1`.`phone` FROM 
        ( SELECT * FROM `dbaas`.`nation` AS `t` WHERE `t`.`name` = 'CHINA' ) AS `t0`
    INNER JOIN `dbaas`.`customer` AS `t1` 
        ON `t0`.`nationkey` = `t1`.`nationkey`
    INNER JOIN (
        SELECT `t2`.`custkey` FROM `dbaas`.`orders` AS `t2` 
            WHERE `t2`.`orderdate` = '1998-08-11' GROUP BY `t2`.`custkey` ) AS `t5` 
        ON `t1`.`custkey` = `t5`.`custkey`

索引建議的話這裏就沒有太多了，條件字段已經都有了相應的索引。

優化

優化後的執行計劃以下：

+----+-------------+-------+------------+--------+-----------------+---------+---------+-----------------+------+----------+-------------+
| id | select_type | table | partitions | type   | possible_keys   | key     | key_len | ref             | rows | filtered | Extra       |
+----+-------------+-------+------------+--------+-----------------+---------+---------+-----------------+------+----------+-------------+
|  1 | SIMPLE      | n     | NULL       | ref    | PRIMARY,idx_1_0 | idx_1_0 | 75      | const           |    1 |   100.00 | Using index |
|  1 | SIMPLE      | o     | NULL       | ref    | idx_2_0,idx_3_0 | idx_2_0 | 3       | const           |   20 |   100.00 | Using index |
|  1 | SIMPLE      | c     | NULL       | eq_ref | PRIMARY         | PRIMARY | 4       | dbaas.o.custkey |    1 |    10.00 | Using where |
+----+-------------+-------+------------+--------+-----------------+---------+---------+-----------------+------+----------+-------------+

能夠看到兩次join的驅動表分別選擇了n和o，ref 也是 const，性能要比 DEPENDENT SUBQUERY 這種要好太多了

總結

1. 並非全部的複雜 join 都要使用 group by，和數據分佈有關，若是 group by 並不能顯著下降 join 行數的話， 沒有必要；

胡思亂想

mysql 的查詢優化器相對來講是一個比較複雜的邏輯，期待它能夠更好工做的前提是sql的寫法要合理，同時也要有恰當的索引。

咱們對 sql 的優化，一般是先去考慮優化sql，再根據優化後的 sql 增長所需索引。（在實際數據庫開發過程當中，尤爲是 2B 的服務提供端，咱們會優先在不須要業務改動的狀況下增長所需索引嘗試解決慢查詢的問題，若是增長索引不能解決問題，那麼就須要業務進行相應改造）

首先，關於sql改寫，這個要考慮的比較多，由於mysql的優化器、執行器作了太多的事情，靠AI能夠自動改寫優化的的話，不敢想象。。。（DBA又要有一波人下崗了）人工的話根據經驗吧，根據執行計劃中的異常點去考慮改進，好比子查詢改寫爲 join 等，像預賽題目中關於order by status 改寫爲 union all 的作法，確實有必定的效果，可是並非一個通用的方法，這裏就太靈活了；
其次，相比之下，在已知sql和表結構的狀況下，依靠AI給出索引建議仍是更加讓人有真實感，索引的話有一些通用的規則，網上介紹的不少了，本身瞎寫了點，看看就好

1. 找出全部條件字段，計算字段區分度，區分度很低的字段沒有必要加索引，數據量不多的字段同樣，加上意義也不大；
2. 條件的話優先級  等值 > 聚合（group/order by） > join ，同一優先級根據區分度創建聯合索引；
3. 在聚合條件下，若是聚合後的行數太多，回表量太大的狀況下，mysql 可能不會使用這些索引；
4. 驅動表不須要考慮什麼索引，驅動表的數據必然都在join的結果集中，對於 inner join 這種沒法肯定驅動表的，能夠考慮在兩張表的合適字段上都添加索引

繼續扯。。。

如今你們都在往雲上遷業務，雲上的db智能診斷難以免是之後的剛需。AI 能夠作到哪一步，不清楚，之後的事情誰知道呢。