MySQL 8.0 plan optimization 源碼閱讀筆記

時間 2019-11-19

標籤 mysql 8.0 plan optimization 源碼閱讀筆記欄目 MySQL 简体版

原文原文鏈接

如下基於社區版8.0代碼html

預備知識：

MySQL JOIN syntax: https://dev.mysql.com/doc/refman/8.0/en/join.htmlpython
Straight join: is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order. STRAIGHT_JOIN有兩種用法：一種是加在JOIN處做爲INNER JOIN的一種特殊類型hint該join的順序；另外一種是加在SELECT處使該select下的全部JOIN都強制爲用戶table的join順序，從優化代碼上看該用法與semi-join不可同時存在（Optimize_table_order::optimize_straight_join: DBUG_ASSERT(join->select_lex->sj_nests.is_empty())）。mysql
join order hint: Join-order hints affect the order in which the optimizer joins tables, including JOIN_FIXED_ORDER, JOIN_ORDER, JOIN_PREFIX, JOIN_SUFFIX算法
各類JOIN類型：INNER JOIN, OUTER JOIN, SEMI JOIN, LEFT/RIGHT JOIN, etc.sql
Materialization（物化）: Usually happens in subquery (sometimes known as semi-join). Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory.windows
Statistics (統計信息)：從存儲獲取表的rowcount、min/max/sum/avg/keyrange等元信息，用於輔助plan優化。閉包
table dependencies: A LEFT JOIN B : B depends on A and A's own dependencies。(待確認 DEPEND JOIN語義是否也是由table dependencies關係表示)app
table access path: An access path may use either an index scan, a table scan, a range scan or ref access, known as join type in explain.ide
- index scan: 通常index scan指的是二級索引scan (MySQL主鍵索引會帶着data存放)post
- table scan: 直接掃表
- range scan: 對於索引列的一些可能轉化爲範圍查詢的條件，MySQL會試圖將其轉化爲range scan來減小範圍外無用的scan。單個範圍的range query相似帶range條件下推的index scan或table scan；range query支持抽取出多個範圍查詢。
- ref: join field是索引，但不是pk或unique not null 索引
- eq_ref: join field是索引且是pk或unique not null索引，意味着對於每一個record最多隻會join到右表的一行。

MySQL源碼中JOIN對象的tables的存放layout（參考註釋，單看變量名有歧義）：

/**
    Before plan has been created, "tables" denote number of input tables in the
    query block and "primary_tables" is equal to "tables".
    After plan has been created (after JOIN::get_best_combination()),
    the JOIN_TAB objects are enumerated as follows:
    - "tables" gives the total number of allocated JOIN_TAB objects
    - "primary_tables" gives the number of input tables, including
      materialized temporary tables from semi-join operation.
    - "const_tables" are those tables among primary_tables that are detected
      to be constant.
    - "tmp_tables" is 0, 1 or 2 (more if windows) and counts the maximum
      possible number of intermediate tables in post-processing (ie sorting and
      duplicate removal).
      Later, tmp_tables will be adjusted to the correct number of
      intermediate tables, @see JOIN::make_tmp_tables_info.
    - The remaining tables (ie. tables - primary_tables - tmp_tables) are
      input tables to materialized semi-join operations.
    The tables are ordered as follows in the join_tab array:
     1. const primary table
     2. non-const primary tables
     3. intermediate sort/group tables
     4. possible holes in array
     5. semi-joined tables used with materialization strategy
  */
  uint tables;          ///< Total number of tables in query block
  uint primary_tables;  ///< Number of primary input tables in query block
  uint const_tables;    ///< Number of primary tables deemed constant
  uint tmp_tables;      ///< Number of temporary tables used by query

源碼剖析

Join表示一個query的join plan，同時也做爲plan的context流轉（所以在para query等一些優化實現中，並行查詢裏除了最上層的父查詢有實際優化的價值外，Join起的做用更像一個context）。

best_positions存放最終優化的table order結果。
best_read 存放最終cost
best_ref 存放輸入的table序列，the optimizer optimizes best_ref

make_join_plan 在JOIN::optimize裏被調用，計算最佳的join order並構建join plan。 Steps:

Here is an overview of the logic of this function:

- Initialize JOIN data structures and setup basic dependencies between tables.

- Update dependencies based on join information. 對於存在outer join或recursive的tables進行關係傳遞propagate_dependencies()(用傳遞閉包算法)，構建出完整的依賴關係。(recursive這裏具體指代未肯定，nested？WITH RECURSIVE語法？)

- Make key descriptions (update_ref_and_keys()). 這一步驟較爲煩雜，本意是想從conditions中找出join鏈接的condition，並識別出join condition相關的key(key指的就是索引)，爲後續決定join_type究竟是ref/ref_or_null/index等作好準備。但MySQL在這一步又加了很多特殊判斷，好比對key is null的特殊處理等。

- Pull out semi-join tables based on table dependencies.

- Extract tables with zero or one row as const tables. 從這步開始的四個步驟都是const table優化，核心就是先把const table算出來，將變量替換成常量。這裏是依靠獲取採樣判斷const table。

- Read contents of const tables, substitute columns from these tables with
  actual data. Also keep track of empty tables vs. one-row tables.

- After const table extraction based on row count, more tables may
  have become functionally dependent. Extract these as const tables.

- Add new sargable predicates based on retrieved const values.

- Calculate number of rows to be retrieved from each table. 獲取採樣結果的步驟。

- Calculate cost of potential semi-join materializations.

- Calculate best possible join order based on available statistics. 即下文的Optimize_table_order::choose_table_order

- Fill in remaining information for the generated join order.

Statistics

核心對象ha_statistics。最主要的是records表示table rowcount。

class ha_statistics {
  ulonglong data_file_length;     /* Length off data file */
  ulonglong max_data_file_length; /* Length off data file */
  ulonglong index_file_length;
  ulonglong max_index_file_length;
  ulonglong delete_length; /* Free bytes */
  ulonglong auto_increment_value;
  /*
    The number of records in the table.
      0    - means the table has exactly 0 rows
    other  - if (table_flags() & HA_STATS_RECORDS_IS_EXACT)
               the value is the exact number of records in the table
             else
               it is an estimate
  */
  ha_rows records;
  ha_rows deleted;       /* Deleted records */
  ulong mean_rec_length; /* physical reclength */
  /* TODO: create_time should be retrieved from the new DD. Remove this. */
  time_t create_time; /* When table was created */
  ulong check_time;
  ulong update_time;
  uint block_size; /* index block size */

  /*
    number of buffer bytes that native mrr implementation needs,
  */
  uint mrr_length_per_rec;
}

myrocks是在handler::info中更新stats的。而info在除了insert的寫和部分查詢場景會被調用以更新採樣信息（調用處多達十餘處）。

/**
General method to gather info from handler

::info() is used to return information to the optimizer.
SHOW also makes use of this data Another note, if your handler
doesn't proved exact record count, you will probably want to
have the following in your code:
if (records < 2)
records = 2;
The reason is that the server will optimize for cases of only a single
record. If in a table scan you don't know the number of records
it will probably be better to set records to two so you can return
as many records as you need.

Along with records a few more variables you may wish to set are:
records
deleted
data_file_length
index_file_length
delete_length
check_time
Take a look at the public variables in handler.h for more information.
See also my_base.h for a full description.

@param flag Specifies what info is requested
*/
virtual int info(uint flag) = 0;

// 如下爲可能的flag對應bit取值。 CONST除了初始化較少用；大部分狀況下用VARIABLE，由於VARIABLE涉及的變量確實是較頻繁更新的；ERRKEY在正常路徑不會用到，用來報錯查信息；AUTO專門針對自增值，自增值可從內存裏table級別對象拿到。

/*
Recalculate loads of constant variables. MyISAM also sets things
directly on the table share object.

Check whether this should be fixed since handlers should not
change things directly on the table object.

Monty comment: This should NOT be changed! It's the handlers
responsibility to correct table->s->keys_xxxx information if keys
have been disabled.

The most important parameters set here is records per key on
all indexes. block_size and primar key ref_length.

For each index there is an array of rec_per_key.
As an example if we have an index with three attributes a,b and c
we will have an array of 3 rec_per_key.
rec_per_key[0] is an estimate of number of records divided by
number of unique values of the field a.
rec_per_key[1] is an estimate of the number of records divided
by the number of unique combinations of the fields a and b.
rec_per_key[2] is an estimate of the number of records divided
by the number of unique combinations of the fields a,b and c.

Many handlers only set the value of rec_per_key when all fields
are bound (rec_per_key[2] in the example above).

If the handler doesn't support statistics, it should set all of the
above to 0.

update the 'constant' part of the info:
handler::max_data_file_length, max_index_file_length, create_time
sortkey, ref_length, block_size, data_file_name, index_file_name.
handler::table->s->keys_in_use, keys_for_keyread, rec_per_key
*/
#define HA_STATUS_CONST 8
/*
update the 'variable' part of the info:
handler::records, deleted, data_file_length, index_file_length,
check_time, mean_rec_length
*/
#define HA_STATUS_VARIABLE 16
/*
This flag is used to get index number of the unique index that
reported duplicate key.
update handler::errkey and handler::dupp_ref
see handler::get_dup_key()
*/
#define HA_STATUS_ERRKEY 32
/*
update handler::auto_increment_value
*/
#define HA_STATUS_AUTO 64

Join reorder

Optimize_table_order類負責實際的join reorder操做，入口方法爲其唯一的public方法 choose_table_order，在make_join_plan中被調用。Optimize_table_order依賴三個前提：

tables的依賴關係已經排好序
access paths 排好序
statistics 採樣已經完成

choose_table_order Steps：

初始化const_tables的cost，若是全是const_tables則能夠直接短路返回
若是是在一個sjm(semi-join materialization) plan優化過程當中，則作一次排序將semi-join(即子查詢的query提早預計算，可根據需求物化)

不然，非STRAIGHT_JOIN且depend無關的tables是按照row_count從小到大排序的

if (SELECT_STRAIGHT_JOIN option is set)
  reorder tables so dependent tables come after tables they depend
  on, otherwise keep tables in the order they were specified in the query
else
  Apply heuristic: pre-sort all access plans with respect to the number of
  records accessed.

Sort algo is merge-sort (tbl >= 5) or insert-sort (tbl < 5)

若是有where_cond，須要把where_cond涉及的列遍歷設置到table->cond_set的bitmap中。
STRAIGHT_JOIN的tables優化optimize_straight_join。STRAIGHT_JOIN至關於用戶限定了JOIN的順序，因此此處的優化工做如其註釋所說：Select the best ways to access the tables in a query without reordering them.
非STRAIGHT_JOIN則使用啓發式貪心算法greedy_search 進行join reorder。

optimize_straight_join :

只支持straight_join，DBUG_ASSERT(join->select_lex->sj_nests.is_empty());與semi-join不兼容，只關注primary tables。
對每一個JOIN_TABLE，best_access_path計算其最優的access path，best_access_path通俗的思路歸納可參見上面table access pathexplain文檔中關於join types的介紹。

set_prefix_join_cost計算當前表基於對應access path下的cost，並計入總的cost model。Cost計算以下：

m_row_evaluate_cost = 0.1 // default value

/*
Cost of accessing the table in course of the entire complete join
    execution, i.e. cost of one access method use (e.g. 'range' or
    'ref' scan ) multiplied by estimated number of rows from tables
    earlier in the join sequence.
*/
read_cost = get_read_cost(table)

void set_prefix_join_cost(uint idx, const Cost_model_server *cm) {
  if (idx == 0) {
    prefix_rowcount = rows_fetched;
    prefix_cost = read_cost + prefix_rowcount * m_row_evaluate_cost;
  } else {
    // this - 1 means last table
    prefix_rowcount = (this - 1)->prefix_rowcount * rows_fetched;
    prefix_cost = (this - 1)->prefix_cost + read_cost + prefix_rowcount * m_row_evaluate_cost;
  }
  // float filter_effect [0,1] means cond filters in executor may reduce rows. 1 means nothing filtered, 0 means all rows filtered and no rows left. It is used to calculate how many row combinations will be joined with the next table
  prefix_rowcount *= filter_effect;
}

greedy_search：

bool Optimize_table_order::best_extension_by_limited_search( table_map remaining_tables, uint idx, uint current_search_depth);
  
procedure greedy_search
    input: remaining_tables
    output: partial_plan;
    {
      partial_plan = <>;
      do {
        (table, a) = best_extension_by_limited_search(partial_plan, remaining_tables, limit_search_depth);
        partial_plan = concat(partial_plan, (table, a));
        remaining_tables = remaining_tables - table;
      } while (remaining_tables != {})
      return pplan;
    }

// 簡單理解就是每一步找拓展出去join路徑最佳的table，按順序加進plan裏面。
// 這種方案會很受選取的第一個表影響(由於第一個表沒有join關係，只能依靠篩選以後的cardinality，通常都是小表)，選小表做第一個表不必定是最優選擇。通常Greedy的優化方案會把每一個表都當第一個表去評估一次cost，而後從N個cost裏選最小的做爲最終plan。MySQL裏只是返回找到的第一個完整的plan。

best_extension_by_limited_search是一個啓發式的搜索過程，search_depth即最大可搜索的深度。best_extension_by_limited_search前半部分邏輯和optimize_straight_join相似：

計算best_access_path 並計算cost。
若是此時的cost已經大於best_read，則直接剪枝，無需繼續搜索。
若是prune_level=PRUNE_BY_TIME_OR_ROWS開啓，則判斷若是best_row_count和best_cost已經大於當前的rows和cost（注意新版本是and關係），且該表沒有被後續的其餘表依賴 (能夠理解成該表是這個圖路徑上的最後一個節點，因此能夠直接prune；但不必定是整個plan的最後一個節點)，則將best_row_count和best_cost設爲當前的。寫的很繞的代碼，結合整個循環看，大體就是每次找基於(rowcount, cost)二維最優的表，所謂的剪枝實際變成了相似增強貪心。
對eq_ref作優先的選擇，遇到第一個eq_ref後便遞歸將全部的eq_ref join找出來。(原做者認爲eq_ref是個1:1的mapping，因此基本能夠認爲cost是恆定的，單獨將這個eq_ref的序列提早生成，在後續優化時能夠看做一整塊放在任何一個順序位置。固然前提是eq_ref是能夠連續的。)
若是還有remaining_tables，則遞歸繼續處理直至remaining 爲空。