Intel PAUSE指令變化影響到MySQL的性能，該如何解決？

時間 2020-04-17

標籤 intel pause 指令變化影響 mysql 性能如何解決欄目 Intel 简体版

原文原文鏈接

MySQL得益於其開源屬性、成熟的商業運做、良好的社區運營以及功能的不斷迭代與完善，已經成爲互聯網關係型數據庫的標配。能夠說，X86服務器、Linux做爲基礎設施，跟MySQL一塊兒構建了互聯網數據存儲服務的基石，三者相輔相成。本文將分享一個工做中的實踐案例：因Intel PAUSE指令週期的迭代，引起了MySQL的性能瓶頸，美團MySQL DBA團隊如何基於這三者來一步步進行分析、定位和優化。但願這些思路能對你們有所啓發。mysql

1.背景

在2017年，Intel發佈了新一代的服務器平臺Purley，並將Intel Xeon Scalable Processor（至強可擴展處理器）從新劃分爲：Platinum（鉑金）、Gold（金）、Silver（銀）、Broze（銅）等四個等級。產品定位和框架也變得更加清晰。sql

因美團線上海量數據交易和存儲等後端服務依賴大量高性能服務器的支撐。隨着線上部分Grantly平臺E系列服務器生命週期的臨近，以及產品自己的發展和迭代。從2019年開始，RDS（關係型數據庫服務）後端存儲（MySQL）開始大量上線Purley平臺的Skylake CPU服務器，其中包含Silver 4110等。數據庫

Silver 4110相比上一代E5-2620 V4，支持更高的內存頻率、更多的內存通道、更大的L2 Cache、更快的總線傳輸速率等。Intel官方數據顯示Silver 4110的性能比上一代E5-2620 V4提高了10%。後端

然而，隨着線上Skylake服務器數量的增長，以及愈來愈多的業務接入。美團MySQL DBA團隊發現部分MySQL實例性能與預期並不相符，有時甚至出現較大程度的降低。通過持續的性能問題分析，咱們定位到Skylake服務器存在性能瓶頸：centos

CPU負載相對較高。
TPS等吞吐量降低。

接下來，咱們將從Intel CPU、ut_delay函數、PAUSE指令三方面入手，進行剖析定位，並探索相關優化方案。緩存

2.性能問題分析

2.1 Grantly與Purley CPU性能差別

首先，基於上述兩代平臺的CPU（Grantly和Purley），經過基準測試，橫向對比在不一樣OS下的性能表現。性能優化

經過基準測試數據，總結以下：服務器

1.在oltp_write_only（只寫）的場景下Purley 4110的性能降低較爲明顯。
2.同爲Purley 4110，CentOS 7比CentOS 6 oltp_write_only（只寫）性能有提高。微信

咱們經過二維折線圖，來展現性能之間的差別：架構

在上圖中，同爲Purley 4110，CentOS 7比CentOS 6性能有提高。具體提高緣由，因不涉及本文重點內容，因此不在這裏詳細展開了。

New MCS-based Locking Mechanism
Red Hat Enterprise Linux 7.1 introduces a new locking mechanism, MCS locks. This new locking mechanism significantly reduces spinlock overhead in large systems, which makes spinlocks generally more efficient in Red Hat Enterprise Linux 7.1.

紅帽官網Release Notes顯示，從內核3.10.0-229開始，引入了新的加鎖機制，MCS鎖。能夠下降spinlock的開銷，從而更高效地運行。普通spinlock在多CPU Core下，同時只能有一個CPU獲取變量，並自旋，而緩存一致性協議爲了保證數據的正確，會對全部CPU Cache Line狀態、數據，同步、失效等操做，致使性能降低。而MSC鎖實現每一個CPU都有本身的「spinlock」本地變量，只在本地自旋。避免Cache Line同步等，從而提高了相關性能。不過，社區對於spinlock的優化爭議仍是比較大的，後續又有大牛基於MSC實現了qspinlock，並在4.x的版本上patch了。具體實現能夠參看：MCS locks and qspinlocks。

在大體瞭解CentOS 7性能的迭代後，接下來咱們深刻分析一下Skylake CPU 4110致使性能降低的原因。

3.CPU性能跟蹤

3.1 定位熱點函數

具體定位4110性能瓶頸，分以下幾步:

首先，經過perf top來跟蹤一下Linux CPU性能開銷。
而後，經過perf record記錄函數CPU週期的消耗佔比。
最後，經過火焰圖來驗證定位熱點函數。

能夠看到，其中佔CPU消耗佔比較大爲：ut_delay函數。

咱們繼續深挖一下函數鏈調用關係：

# Children      Self  Command  Shared Object        Symbol                                                                                                                                                                            
# ........  ........  .......  ...................  ..................................................................................................................................................................................
#
    93.54%     0.00%  mysqld   libpthread-2.17.so   [.] start_thread
            |
            ---start_thread
               |          
               |--77.07%--pfs_spawn_thread
               |          |          
               |           --77.05%--handle_connection
               |                     |          
               |                      --76.97%--do_command
               |                                |          
               |                                |--74.30%--dispatch_command
               |                                |          |          
               |                                |          |--71.16%--mysqld_stmt_execute
               |                                |          |          |          
               |                                |          |           --70.74%--Prepared_statement::execute_loop
               |                                |          |                     |          
               |                                |          |                     |--69.53%--Prepared_statement::execute
               |                                |          |                     |          |          
               |                                |          |                     |          |--67.90%--mysql_execute_command
               |                                |          |                     |          |          |          
               |                                |          |                     |          |          |--23.43%--trans_commit_stmt
               |                                |          |                     |          |          |          |          
               |                                |          |                     |          |          |           --23.30%--ha_commit_trans
               |                                |          |                     |          |          |                     |          
               |                                |          |                     |          |          |                     |--18.86%--MYSQL_BIN_LOG::commit
               |                                |          |                     |          |          |                     |          |          
               |                                |          |                     |          |          |                     |           --18.18%--MYSQL_BIN_LOG::ordered_commit
               |                                |          |                     |          |          |                     |                     |          
               |                                |          |                     |          |          |                     |                     |--8.02%--MYSQL_BIN_LOG::change_stage
               |                                |          |                     |          |          |                     |                     |          |          
               |                                |          |                     |          |          |                     |                     |          |--2.35%--__lll_unlock_wake
               |                                |          |                     |          |          |                     |                     |          |          |          
               |                                |          |                     |          |          |                     |                     |          |           --2.24%--system_call_fastpath
               |                                |          |                     |          |          |                     |                     |          |                     |          
               |                                |          |                     |          |          |                     |                     |          |                      --2.24%--sys_futex
               |                                |          |                     |          |          |                     |                     |          |                                |          
               |                                |          |                     |          |          |                     |                     |          |                                 --2.23%--do_futex
               |                                |          |                     |          |          |                     |                     |          |                                           |          
               |                                |          |                     |          |          |                     |                     |          |                                            --2.14%--futex_wake
               |                                |          |                     |          |          |                     |                     |          |                                                      |          
               |                                |          |                     |          |          |                     |                     |          |                                                       --1.38%--wake_up_q
               |                                |          |                     |          |          |                     |                     |          |                                                                 |          
               |                                |          |                     |          |          |                     |                     |          |                                                                  --1.33%--try_to_wake_up
               ...

將上述調用經過火焰圖進行直觀展現：

如今基本能夠肯定，全部的函數調用，最後大部分的消耗都在ut_delay上。

3.2 ut_delay和PAUSE之間的關聯與性能影響

3.2.1 MySQL ut_delay實現

接下來，咱們繼續看一下MySQL源碼中ut_delay函數的功能：

/*************************************************************//**
Runs an idle loop on CPU. The argument gives the desired delay
in microseconds on 100 MHz Pentium + Visual C++.
@return dummy value */
ulint
ut_delay(
/*=====*/
  ulint delay)  /*!< in: delay in microseconds on 100 MHz Pentium */
{
  ulint i, j;

  UT_LOW_PRIORITY_CPU();

  j = 0;

  for (i = 0; i < delay * 50; i++) {
    j += i;
    UT_RELAX_CPU();
  }

  UT_RESUME_PRIORITY_CPU();

  return(j);
}
...

#   define UT_RELAX_CPU() asm ("pause" )
#   define UT_RELAX_CPU() __asm__ __volatile__ ("pause")

能夠了解到，MySQL自旋會調用PAUSE指令，從而提高spin-wait loop的性能。

3.2.2 PAUSE指令週期的演變

咱們能夠看下Intel官網，也描述了在新平臺架構PAUSE的改動：

Pause Latency in Skylake Microarchitecture
The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles, so performance-wise it is better to wait while occupying the CPU than yielding to the OS. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling an OS synchronization API function, such as WaitForSingleObject on Windows* OS or futex on Linux.

...

The latency of the PAUSE instruction in prior generation microarchitectures is about 10 cycles, whereas in Skylake microarchitecture it has been extended to as many as 140 cycles.

The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked executing a fixed number of looped PAUSE instructions. There's also a small power benefit in 2-core and 4-core systems.

As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss.

...

上一代架構中（Grantly平臺E系列）PAUSE的週期時長爲10 cycles，新一代的Skylake架構中則爲140 cycles。
若是程序中使用固定次數的PAUSE循環來實現一段時間的延遲，以此阻塞程序執行，可能引起非預期的延遲。
因爲PAUSE週期增長，對於PAUSE敏感的應用會有必定的性能損失。

衡量程序執行性能的簡化公式：

ExecutionTime(T)=InstructionCount∗TimePerCycle∗CPI

即：程序執行時間 = 程序總指令數 x 每CPU時鐘週期時間 x 每指令執行所需平均時鐘週期數。

MySQL內部自旋，就是經過固定次數的PAUSE循環實現。可知，PAUSE指令週期的增長，那麼執行自旋的時間也會增長，即程序執行的時間也會相對增長，對系統總體的吞吐量就會有影響。

顯然，Intel文檔已說明不一樣平臺、不一樣架構CPU PAUSE定義的週期是不同的。

下面，咱們經過一個測試用例來大體驗證、對比一下新老架構CPU執行PAUSE的cycles：

#include <stdio.h>
#define TIMES 5

static inline unsigned long long rdtsc(void)
{
    unsigned long low, high;
    asm volatile("rdtsc" : "=a" (low), "=d" (high) );
    return ((low) | (high) << 32);
}

void pause_test()
{
    int i = 0;
    for (i = 0; i < TIMES; i++) {
        asm(
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"
                ::
                :);
    }
}

unsigned long pause_cycle()
{
    unsigned long start, finish, elapsed;
    start = rdtsc();
    pause_test();
    finish = rdtsc();
    elapsed = finish - start;
    printf("Pause的cycles約爲:%ld\n", elapsed / 100);
    return 0;
}

int main()
{
    pause_cycle();
    return 0;
}

其運行結果統計以下：

4110和5118 PAUSE週期較大，均爲100多，它們屬於Purley第一代架構：Skylake。
4210和5218 PAUSE相比前一代有提高，是由於它們同屬Purley第二代架構：Cascadelake，該代CPU PAUSE指令有優化。

3.2.3 Intel 提高PAUSE猜測

Intel提升PAUSE指令週期的緣由，推測多是減小自旋鎖衝突的機率，以及下降功耗；但反而致使PAUSE執行時間變長，下降了總體的吞吐量。

The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor read to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked executing a fixed number of looped PAUSE instructions.

3.3 PAUSE致使寫瓶頸分析

接下來，咱們深刻分析一下PAUSE指令致使MySQL寫瓶頸的緣由。

首先，經過MySQL 內部統計信息，查看一下InnoDB信號量監控數據：

SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 153720
--Thread 139868617205504 has waited at row0row.cc line 1075 for 0.00 seconds the semaphore:
X-lock on RW-latch at 0x7f4298084250 created in file buf0buf.cc line 1425
a writer (thread id 139869284108032) has reserved it in mode  SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file not yet reserved line 0
Last time write locked in file /mnt/workspace/percona-server-5.7-redhat-binary-rocks-new/label_exp/min-centos-7-x64/test/rpmbuild/BUILD/percona-server-5.7.26-29/percona-server-5.7.26-29/storage/innobase/buf/buf0flu.cc line 1216
OS WAIT ARRAY INFO: signal count 441329
RW-shared spins 0, rounds 1498677, OS waits 111991
RW-excl spins 0, rounds 717200, OS waits 9012
RW-sx spins 47596, rounds 366136, OS waits 4100
Spin rounds per wait: 1498677.00 RW-shared, 717200.00 RW-excl, 7.69 RW-sx

可見寫操做並阻塞在：storage/innobase/buf/buf0flu.cc第1216行調用上。

跟蹤一下發生等待的源碼：buf0flu.cc line 1216：

if (flush_type == BUF_FLUSH_LIST
        && is_uncompressed
        && !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) {    // 加鎖前，判斷鎖衝突
        
        if (!fsp_is_system_temporary(bpage->id.space())) {
        /* avoiding deadlock possibility involves
        doublewrite buffer, should flush it, because
        it might hold the another block->lock. */
        buf_dblwr_flush_buffered_writes(
          buf_parallel_dblwr_partition(bpage,
                flush_type));
      } else {
        buf_dblwr_sync_datafiles();
      }
      rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE);        //  加sx鎖
    }
... 
 #define rw_lock_sx_lock_nowait(M, P)       \
  rw_lock_sx_lock_low((M), (P), __FILE__, __LINE__)
...

rw_lock_sx_lock_func(                                       // 加sx鎖函數            
/*=================*/
  rw_lock_t*  lock, /*!< in: pointer to rw-lock */
  ulint   pass, /*!< in: pass value; != 0, if the lock will
        be passed to another thread to unlock */
  const char* file_name,/*!< in: file name where lock requested */
  ulint   line) /*!< in: line where requested */

{
  ulint   i = 0;
  sync_array_t* sync_arr;
  ulint   spin_count = 0;
  uint64_t  count_os_wait = 0;
  ulint   spin_wait_count = 0;

  ut_ad(rw_lock_validate(lock));
  ut_ad(!rw_lock_own(lock, RW_LOCK_S));

lock_loop:

  if (rw_lock_sx_lock_low(lock, pass, file_name, line)) {

    if (count_os_wait > 0) {
      lock->count_os_wait +=
        static_cast<uint32_t>(count_os_wait);
      rw_lock_stats.rw_sx_os_wait_count.add(count_os_wait);
    }

    rw_lock_stats.rw_sx_spin_round_count.add(spin_count);
    rw_lock_stats.rw_sx_spin_wait_count.add(spin_wait_count);

    /* Locking succeeded */
    return;

  } else {

    ++spin_wait_count;

    /* Spin waiting for the lock_word to become free */
    os_rmb;
    while (i < srv_n_spin_wait_rounds
           && lock->lock_word <= X_LOCK_HALF_DECR) {

      if (srv_spin_wait_delay) {
        ut_delay(ut_rnd_interval(
            0, srv_spin_wait_delay));                         // 加鎖失敗，調用ut_delay
      }

      i++;
    }                             

    spin_count += i;

    if (i >= srv_n_spin_wait_rounds) {

      os_thread_yield();

    } else {

      goto lock_loop;
    }
...
ulong srv_n_spin_wait_rounds  = 30;
ulong srv_spin_wait_delay = 6;

上述源碼可知，MySQL鎖等待是經過調用ut_delay作空循環實現的。

InnoDB層有三種鎖：S（共享鎖）、X（排他鎖）和SX（共享排他鎖）。 SX與SX、X是互斥鎖。加SX不會影響讀，只會阻塞寫。因此在大量寫入操做時，會形成大量的鎖等待，即大量的PAUSE指令。

分析到這裏，咱們總結一下影響吞吐量的兩個因素：

自旋的時長，在MySQL5.7以及以前版本的源碼定位爲：spin_wait_delay * 50。
Intel CPU PAUSE的指令週期。

接下來，咱們就從這兩方面入手，評估優化空間以及效果。

4. 針對PAUSE指令和spin參數優化與探索

4.1 MySQL spin參數優化

4.1.1 MySQL 5.7 spin參數優化

咱們能夠基於現有MySQL版本、硬件等方面，來尋找優化點。

MySQL針對spin控制這塊有個參數能夠調整，根據參數特色進行相關優化：

innodb_spin_wait_delay

innodb_spin_wait_delay的單位，是100MHZ的奔騰處理器處理1毫秒的時間，默認innodb_spin_wait_delay配置成6，表示最多在100MHZ的奔騰處理器上自旋6毫秒。

innodb_sync_spin_loops

當 innodb 線程獲取 mutex 資源而得不到知足時，會最多進行 innodb_sync_spin_loops次嘗試獲取mutex資源。

其中innodb_spin_wait_delay參數對PAUSE運行時長是有影響的。針對此參數，咱們進行調優測試。

一樣，針對上述參數優化，咱們經過基準測試來對比性能和效果：

能夠總結爲：

innodb_spin_wait_delay的調整對TPS、QPS 必定影響，其值趨於小，則MySQL性能有提高。反之，降低。
innodb_spin_wait_delay參數調整性能優化效果有限，性能提高的幅度仍是沒法知足線上業務需求。

4.2 MySQL8.0 spin新特性移植

4.2.1 spin_wait_pause_multiplier移植

針對Skylake CPU，PAUSE形成的吞吐量降低，咱們對MySQL 5.7 spin控制參數innodb_spin_wait_delay的調優並未取得明顯效果。

因而，咱們將目光投向了MySQL 8.0的新特性：MySQL 8.0 針對PAUSE，源碼中新增了spin_wait_pause_multiplier參數，來替換以前寫死的循環次數。

4.2.2 spin_wait_pause_multiplier實現

MySQL 8.0源碼中，以前循環50次的邏輯修改爲了能夠調整循環次數的參數：spin_wait_pause_multiplier。

ulint ut_delay(ulint delay) {
  ulint i, j;
  /* We don't expect overflow here, as ut::spin_wait_pause_multiplier is limited
  to 100, and values of delay are not larger than @@innodb_spin_wait_delay
  which is limited by 1 000. Anyway, in case an overflow happened, the program
  would still work (as iterations is unsigned). */
  const ulint iterations = delay * ut::spin_wait_pause_multiplier;
  UT_LOW_PRIORITY_CPU();

  j = 0;

  for (i = 0; i < iterations; i++) {
    j += i;
    UT_RELAX_CPU();
  }

  UT_RESUME_PRIORITY_CPU();

  return (j);
}
...
namespace ut {
ulong spin_wait_pause_multiplier = 50;
}

4.2.3 移植spin_wait_pause_multiplier patch優化

既然MySQL 8.0參數spin_wait_pause_multiplier能夠控制PAUSE執行的時長，那麼就能夠減小該值，從而下降總體PAUSE影響。

瞭解MySQL 8.0相關代碼後，咱們將該patch移植到線上的穩定版本：

MySQ >select version();
+------------------+
| version()        |
+------------------+
| 5.7.26-29-mt-log |
+------------------+
1 row in set (0.00 sec)

MySQL>show global variables like '%spin%';  
+-----------------------------------+-------+
| Variable_name                     | Value |
+-----------------------------------+-------+
| innodb_spin_wait_delay            | 6     |
| innodb_spin_wait_pause_multiplier | 5     |
| innodb_sync_spin_loops            | 30    |
+-----------------------------------+-------+
3 rows in set (0.00 sec)

由上述可知，Silver 4110的PAUSE cycles是E5-2620 v4的14倍左右。基於此，將innodb_spin_wait_pause_multiplier值調整爲默認值的1/14，取稍大值：5。即將該參數由原默認的50調整爲5。

最後，仍是經過二維折線圖來對比該patch調優後的基準測試數據：

Silver 4110移植spin_wait_pause_multiplier patch，並調整優化後，4110（patch）性能有了較大的提高。
Silver 4110（patch）相對調優innodb_spin_wait_delay性能上更優。
Silver 4110（patch）併發線程大於64的只寫場景，性能略低於E5-2620 V4 ，其餘均優。
按照真實的線上讀寫比例，4110（patch）能夠將吞吐量恢復到原先的性能水平。

4.3 PAUSE指令週期優化

上述章節中，咱們測出Cascadelake CPU PAUSE週期降低了。在跟Intel技術專家確認後得知：從Purley的第二代產品Cascadelake開始，Intel將PAUSE的指令週期下降到了44。（估計Intel也發現了第一代增長PAUSE週期後的性能瓶頸問題。）

咱們針對第二代CPU產品繼續作基準測試，來看一下性能表現：

接着用perf diff來對比一下4110和4210在ut_delay上的開銷：

能夠看到4210比4110佔比降低了8%。
因爲PAUSE指令週期仍是數倍於E5系列CPU，4210在高負載下，PAUSE的開銷對MySQL吞吐量仍是有較大的影響。而在128併發線程如下，性能相比4110有了較大的提高。按理，能夠知足線上業務需求（該測試結果跟移植spin_wait_pause_multiplier patch性能測試數據曲線一致）。

5. 總結

最後針對本篇內容，咱們能夠作個簡單的總結：

Intel在新平臺CPU產品調大了PAUSE指令週期，在高併發spinlock競爭激烈場景下，可能會形成程序性能較大損耗（特別是執行固定PAUSE次數的程序）。
針對Skylake架構CPU（好比：4110等）PAUSE指令週期較長引發性能問題的優化方法以下：

將MySQL 8.0 innodb_spin_wait_pause_multiplier patch移植到線上穩定版本（或升級到MySQL 8.0），經過下降PAUSE執行時長，來提高吞吐量。
若是是OS爲CentOS 6，能夠升級到CentOS 7，CentOS 7自己spinlock優化，對MySQL性能也有必定提高。
最簡單、直接的方法能夠替換爲Cascadelake架構CPU。

針對Cascadelake架構CPU，因爲Intel自己在PAUSE週期已經優化，性能上已經作了修復。固然也能夠採用上述優化方案，讓性能提高一個臺階。

6. 做者簡介

春林，2017年加入美團，主要負責MySQL運維開發和優化工做。

招聘信息

美團DBA團隊招聘各種人才，Base北京、上海都可。咱們致力於爲公司提供穩定、可靠、高效的在線存儲服務，打造業界領先的數據庫團隊。這裏有數萬各種架構的MySQL實例，天天提供萬億級的OLTP訪問請求。真正的海量、分佈式、高併發環境。歡迎感興趣的同窗發送簡歷至：tech@meituan.com（郵件標題註明：美團DBA團隊）

閱讀更多技術文章，請掃碼關注微信公衆號-美團技術團隊！