內存屏障（Memory barrier）-- 轉發

時間 2019-11-11

標籤內存屏障 memory barrier 轉發简体版

原文原文鏈接

本文例子均在 Linux（g++）下驗證經過，CPU 爲 X86-64 處理器架構。全部羅列的 Linux 內核代碼也均在（或只在）X86-64 下有效。 linux

本文首先經過範例（以及內核代碼）來解釋 Memory barrier，而後介紹一個利用 Memory barrier 實現的無鎖環形緩衝區。程序員

Memory barrier 簡介

程序在運行時內存實際的訪問順序和程序代碼編寫的訪問順序不必定一致，這就是內存亂序訪問。內存亂序訪問行爲出現的理由是爲了提高程序運行時的性能。內存亂序訪問主要發生在兩個階段： redis

編譯時，編譯器優化致使內存亂序訪問（指令重排）
運行時，多 CPU 間交互引發內存亂序訪問

Memory barrier 可以讓 CPU 或編譯器在內存訪問上有序。一個 Memory barrier 以前的內存訪問操做一定先於其以後的完成。Memory barrier 包括兩類：安全

編譯器 barrier
CPU Memory barrier

不少時候，編譯器和 CPU 引發內存亂序訪問不會帶來什麼問題，但一些特殊狀況下，程序邏輯的正確性依賴於內存訪問順序，這時候內存亂序訪問會帶來邏輯上的錯誤，例如：數據結構

// thread 1
while (!ok);
do(x);
// thread 2
x = 42;
ok = 1;

此段代碼中，ok 初始化爲 0，線程 1 等待 ok 被設置爲 1 後執行 do 函數。假如說，線程 2 對內存的寫操做亂序執行，也就是 x 賦值後於 ok 賦值完成，那麼 do 函數接受的實參就極可能出乎程序員的意料，不爲 42。多線程

編譯時內存亂序訪問

在編譯時，編譯器對代碼作出優化時可能改變實際執行指令的順序（例如 gcc 下 O2 或 O3 都會改變實際執行指令的順序）：架構

// test.cpp
int x, y, r;
void f()
{
x = r;
y = 1;
}

編譯器優化的結果可能致使 y = 1 在 x = r 以前執行完成。首先直接編譯此源文件：函數

g++ -S test.cpp

獲得相關的彙編代碼以下：性能

movl r(%rip), %eax
movl %eax, x(%rip)
movl $1, y(%rip)

這裏咱們看到，x = r 和 y = 1 並無亂序。現使用優化選項 O2（或 O3）編譯上面的代碼（g++ -O2 -S test.cpp），生成彙編代碼以下：學習

movl r(%rip), %eax
movl $1, y(%rip)
movl %eax, x(%rip)

咱們能夠清楚的看到通過編譯器優化以後 movl $1, y(%rip) 先於 movl %eax, x(%rip) 執行。避免編譯時內存亂序訪問的辦法就是使用編譯器 barrier（又叫優化 barrier）。Linux 內核提供函數 barrier() 用於讓編譯器保證其以前的內存訪問先於其以後的完成。內核實現 barrier() 以下（X86-64 架構）：

#define barrier() __asm__ __volatile__("" ::: "memory")
如今把此編譯器 barrier 加入代碼中：
int x, y, r;
void f()
{
x = r;
__asm__ __volatile__("" ::: "memory");
y = 1;
}

這樣就避免了編譯器優化帶來的內存亂序訪問的問題了（若是有興趣能夠再看看編譯以後的彙編代碼）。本例中，咱們還可使用 volatile 這個關鍵字來避免編譯時內存亂序訪問（而沒法避免後面要說的運行時內存亂序訪問）。volatile 關鍵字可以讓相關的變量之間在內存訪問上避免亂序，這裏能夠修改 x 和 y 的定義來解決問題：

volatile int x, y;
int r;
void f()
{
x = r;
y = 1;
}

現加上了 volatile 關鍵字，這使得 x 相對於 y、y 相對於 x 在內存訪問上有序。在 Linux 內核中，提供了一個宏 ACCESS_ONCE 來避免編譯器對於連續的 ACCESS_ONCE 實例進行指令重排。其實 ACCESS_ONCE 實現源碼以下：

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

此代碼只是將變量 x 轉換爲 volatile 的而已。如今咱們就有了第三個修改方案：

int x, y, r;
void f()
{
ACCESS_ONCE(x) = r;
ACCESS_ONCE(y) = 1;
}

到此基本上就闡述完了咱們的編譯時內存亂序訪問的問題。下面開始介紹運行時內存亂序訪問。

運行時內存亂序訪問

在運行時，CPU 雖然會亂序執行指令，可是在單個 CPU 的上，硬件可以保證程序執行時全部的內存訪問操做看起來像是按程序代碼編寫的順序執行的，這時候 Memory barrier 沒有必要使用（不考慮編譯器優化的狀況下）。這裏咱們瞭解一下 CPU 亂序執行的行爲。在亂序執行時，一個處理器真正執行指令的順序由可用的輸入數據決定，而非程序員編寫的順序。
早期的處理器爲有序處理器（In-order processors），有序處理器處理指令一般有如下幾步：

指令獲取
若是指令的輸入操做對象（input operands）可用（例如已經在寄存器中了），則將此指令分發到適當的功能單元中。若是一個或者多個操做對象不可用（一般是因爲須要從內存中獲取），則處理器會等待直到它們可用
指令被適當的功能單元執行
功能單元將結果寫回寄存器堆（Register file，一個 CPU 中的一組寄存器）

相比之下，亂序處理器（Out-of-order processors）處理指令一般有如下幾步：

指令獲取
指令被分發到指令隊列
指令在指令隊列中等待，直到輸入操做對象可用（一旦輸入操做對象可用，指令就能夠離開隊列，即使更早的指令未被執行）
指令被分配到適當的功能單元並執行
執行結果被放入隊列（而不當即寫入寄存器堆）
只有全部更早請求執行的指令的執行結果被寫入寄存器堆後，指令執行的結果才被寫入寄存器堆（執行結果重排序，讓執行看起來是有序的）

從上面的執行過程能夠看出，亂序執行相比有序執行可以避免等待不可用的操做對象（有序執行的第二步）從而提升了效率。現代的機器上，處理器運行的速度比內存快不少，有序處理器花在等待可用數據的時間裏已經能夠處理大量指令了。
如今思考一下亂序處理器處理指令的過程，咱們能獲得幾個結論：

對於單個 CPU 指令獲取是有序的（經過隊列實現）
對於單個 CPU 指令執行結果也是有序返回寄存器堆的（經過隊列實現）

由此可知，在單 CPU 上，不考慮編譯器優化致使亂序的前提下，多線程執行不存在內存亂序訪問的問題。咱們從內核源碼也能夠獲得相似的結論（代碼不徹底的摘錄）：

#ifdef CONFIG_SMP
#define smp_mb() mb()
#else
#define smp_mb() barrier()
#endif

這裏能夠看到，若是是 SMP 則使用 mb，mb 被定義爲 CPU Memory barrier（後面會講到），而非 SMP 時，直接使用編譯器 barrier。

在多 CPU 的機器上，問題又不同了。每一個 CPU 都存在 cache（cache 主要是爲了彌補 CPU 和內存之間較慢的訪問速度），當一個特定數據第一次被特定一個 CPU 獲取時，此數據顯然不在 CPU 的 cache 中（這就是 cache miss）。此 cache miss 意味着 CPU 須要從內存中獲取數據（這個過程須要 CPU 等待數百個週期），此數據將被加載到 CPU 的 cache 中，這樣後續就能直接從 cache 上快速訪問。當某個 CPU 進行寫操做時，它必須確保其餘的 CPU 已經將此數據從它們的 cache 中移除（以便保證一致性），只有在移除操做完成後此 CPU 才能安全的修改數據。顯然，存在多個 cache 時，咱們必須經過一個 cache 一致性協議來避免數據不一致的問題，而這個通信的過程就可能致使亂序訪問的出現，也就是這裏說的運行時內存亂序訪問。這裏再也不深刻討論整個細節，這是一個比較複雜的問題，有興趣能夠研究http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf 一文，其詳細的分析了整個過程。

如今經過一個例子來講明多 CPU 下內存亂序訪問：

// test2.cpp
#include <pthread.h>
#include <assert.h>
// -------------------
int cpu_thread1 = 0;
int cpu_thread2 = 1;
volatile int x, y, r1, r2;
void start()
{
x = y = r1 = r2 = 0;
}
void end()
{
assert(!(r1 == 0 && r2 == 0));
}
void run1()
{
x = 1;
r1 = y;
}
void run2()
{
y = 1;
r2 = x;
}
// -------------------
static pthread_barrier_t barrier_start;
static pthread_barrier_t barrier_end;
static void* thread1(void*)
{
while (1) {
pthread_barrier_wait(&barrier_start);
run1();
pthread_barrier_wait(&barrier_end);
}
return NULL;
}
static void* thread2(void*)
{
while (1) {
pthread_barrier_wait(&barrier_start);
run2();
pthread_barrier_wait(&barrier_end);
}
return NULL;
}
int main()
{
assert(pthread_barrier_init(&barrier_start, NULL, 3) == 0);
assert(pthread_barrier_init(&barrier_end, NULL, 3) == 0);
pthread_t t1;
pthread_t t2;
assert(pthread_create(&t1, NULL, thread1, NULL) == 0);
assert(pthread_create(&t2, NULL, thread2, NULL) == 0);
cpu_set_t cs;
CPU_ZERO(&cs);
CPU_SET(cpu_thread1, &cs);
assert(pthread_setaffinity_np(t1, sizeof(cs), &cs) == 0);
CPU_ZERO(&cs);
CPU_SET(cpu_thread2, &cs);
assert(pthread_setaffinity_np(t2, sizeof(cs), &cs) == 0);
while (1) {
start();
pthread_barrier_wait(&barrier_start);
pthread_barrier_wait(&barrier_end);
end();
}
return 0;
}

這裏建立了兩個線程來運行測試代碼（須要測試的代碼將放置在 run 函數中）。我使用了 pthread barrier（區別於本文討論的 Memory barrier）主要爲了讓兩個子線程可以同時運行它們的 run 函數。此段代碼不停的嘗試同時運行兩個線程的 run 函數，以便得出咱們指望的結果。在每次運行 run 函數前會調用一次 start 函數（進行數據初始化），run 運行後會調用一次 end 函數（進行結果檢查）。run1 和 run2 兩個函數運行在哪一個 CPU 上則經過 cpu_thread1 和 cpu_thread2 兩個變量控制。
先編譯此程序：g++ -lpthread -o test2 test2.cpp（這裏未優化，目的是爲了不編譯器優化的干擾）。須要注意的是，兩個線程運行在兩個不一樣的 CPU 上（CPU 0 和 CPU 1）。只要內存不出現亂序訪問，那麼 r1 和 r2 不可能同時爲 0，所以斷言失敗表示存在內存亂序訪問。編譯以後運行此程序，會發現存在必定機率致使斷言失敗。爲了進一步說明問題，咱們把 cpu_thread2 的值改成 0，換而言之就是讓兩個線程跑在同一個 CPU 下，再運行程序發現斷言再也不失敗。

最後，咱們使用 CPU Memory barrier 來解決內存亂序訪問的問題（X86-64 架構下）：

int cpu_thread1 = 0;
int cpu_thread2 = 1;
void run1()
{
x = 1;
__asm__ __volatile__("mfence" ::: "memory");
r1 = y;
}
void run2()
{
y = 1;
__asm__ __volatile__("mfence" ::: "memory");
r2 = x;
}

準備使用 Memory barrier

Memory barrier 經常使用場合包括：

實現同步原語（synchronization primitives）
實現無鎖數據結構（lock-free data structures）
驅動程序

實際的應用程序開發中，開發者可能徹底不知道 Memory barrier 就能夠開發正確的多線程程序，這主要是由於各類同步機制中已經隱含了 Memory barrier（但和實際的 Memory barrier 有細微差異），這就使得不直接使用 Memory barrier 不會存在任何問題。可是若是你但願編寫諸如無鎖數據結構，那麼 Memory barrier 仍是頗有用的。

一般來講，在單個 CPU 上，存在依賴的內存訪問有序：

Q = P;
D = *Q;

這裏內存操做有序。然而在 Alpha CPU 上，存在依賴的內存讀取操做不必定有序，須要使用數據依賴 barrier（因爲 Alpha 不常見，這裏就不詳細解釋了）。

在 Linux 內核中，除了前面說到的編譯器 barrier — barrier() 和 ACCESS_ONCE()，還有 CPU Memory barrier：

通用 barrier，保證讀寫操做有序的，mb() 和 smp_mb()
寫操做 barrier，僅保證寫操做有序的，wmb() 和 smp_wmb()
讀操做 barrier，僅保證讀操做有序的，rmb() 和 smp_rmb()

注意，全部的 CPU Memory barrier（除了數據依賴 barrier 以外）都隱含了編譯器 barrier。這裏的 smp 開頭的 Memory barrier 會根據配置在單處理器上直接使用編譯器 barrier，而在 SMP 上才使用 CPU Memory barrier（也就是 mb()、wmb()、rmb()，回憶上面相關內核代碼）。

最後須要注意一點的是，CPU Memory barrier 中某些類型的 Memory barrier 須要成對使用，不然會出錯，詳細來講就是：一個寫操做 barrier 須要和讀操做（或數據依賴）barrier 一塊兒使用（固然，通用 barrier 也是能夠的），反之依然。

Memory barrier 的範例

讀內核代碼進一步學習 Memory barrier 的使用。
Linux 內核實現的無鎖（只有一個讀線程和一個寫線程時）環形緩衝區 kfifo 就使用到了 Memory barrier，實現源碼以下：

/*
* A simple kernel FIFO implementation.
*
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/kfifo.h>
#include <linux/log2.h>
/**
* kfifo_init - allocates a new FIFO using a preallocated buffer
* @buffer: the preallocated buffer to be used.
* @size: the size of the internal buffer, this have to be a power of 2.
* @gfp_mask: get_free_pages mask, passed to kmalloc()
* @lock: the lock to be used to protect the fifo buffer
*
* Do NOT pass the kfifo to kfifo_free() after use! Simply free the
* &struct kfifo with kfree().
*/
struct kfifo *kfifo_init(unsigned char *buffer, unsigned int size,
gfp_t gfp_mask, spinlock_t *lock)
{
struct kfifo *fifo;
/* size must be a power of 2 */
BUG_ON(!is_power_of_2(size));
fifo = kmalloc(sizeof(struct kfifo), gfp_mask);
if (!fifo)
return ERR_PTR(-ENOMEM);
fifo->buffer = buffer;
fifo->size = size;
fifo->in = fifo->out = 0;
fifo->lock = lock;
return fifo;
}
EXPORT_SYMBOL(kfifo_init);
/**
* kfifo_alloc - allocates a new FIFO and its internal buffer
* @size: the size of the internal buffer to be allocated.
* @gfp_mask: get_free_pages mask, passed to kmalloc()
* @lock: the lock to be used to protect the fifo buffer
*
* The size will be rounded-up to a power of 2.
*/
struct kfifo *kfifo_alloc(unsigned int size, gfp_t gfp_mask, spinlock_t *lock)
{
unsigned char *buffer;
struct kfifo *ret;
/*
* round up to the next power of 2, since our 'let the indices
* wrap' technique works only in this case.
*/
if (!is_power_of_2(size)) {
BUG_ON(size > 0x80000000);
size = roundup_pow_of_two(size);
}
buffer = kmalloc(size, gfp_mask);
if (!buffer)
return ERR_PTR(-ENOMEM);
ret = kfifo_init(buffer, size, gfp_mask, lock);
if (IS_ERR(ret))
kfree(buffer);
return ret;
}
EXPORT_SYMBOL(kfifo_alloc);
/**
* kfifo_free - frees the FIFO
* @fifo: the fifo to be freed.
*/
void kfifo_free(struct kfifo *fifo)
{
kfree(fifo->buffer);
kfree(fifo);
}
EXPORT_SYMBOL(kfifo_free);
/**
* __kfifo_put - puts some data into the FIFO, no locking version
* @fifo: the fifo to be used.
* @buffer: the data to be added.
* @len: the length of the data to be added.
*
* This function copies at most @len bytes from the @buffer into
* the FIFO depending on the free space, and returns the number of
* bytes copied.
*
* Note that with only one concurrent reader and one concurrent
* writer, you don't need extra locking to use these functions.
*/
unsigned int __kfifo_put(struct kfifo *fifo,
const unsigned char *buffer, unsigned int len)
{
unsigned int l;
len = min(len, fifo->size - fifo->in + fifo->out);
/*
* Ensure that we sample the fifo->out index -before- we
* start putting bytes into the kfifo.
*/
smp_mb();
/* first put the data starting from fifo->in to buffer end */
l = min(len, fifo->size - (fifo->in & (fifo->size - 1)));
memcpy(fifo->buffer + (fifo->in & (fifo->size - 1)), buffer, l);
/* then put the rest (if any) at the beginning of the buffer */
memcpy(fifo->buffer, buffer + l, len - l);
/*
* Ensure that we add the bytes to the kfifo -before-
* we update the fifo->in index.
*/
smp_wmb();
fifo->in += len;
return len;
}
EXPORT_SYMBOL(__kfifo_put);
/**
* __kfifo_get - gets some data from the FIFO, no locking version
* @fifo: the fifo to be used.
* @buffer: where the data must be copied.
* @len: the size of the destination buffer.
*
* This function copies at most @len bytes from the FIFO into the
* @buffer and returns the number of copied bytes.
*
* Note that with only one concurrent reader and one concurrent
* writer, you don't need extra locking to use these functions.
*/
unsigned int __kfifo_get(struct kfifo *fifo,
unsigned char *buffer, unsigned int len)
{
unsigned int l;
len = min(len, fifo->in - fifo->out);
/*
* Ensure that we sample the fifo->in index -before- we
* start removing bytes from the kfifo.
*/
smp_rmb();
/* first get the data from fifo->out until the end of the buffer */
l = min(len, fifo->size - (fifo->out & (fifo->size - 1)));
memcpy(buffer, fifo->buffer + (fifo->out & (fifo->size - 1)), l);
/* then get the rest (if any) from the beginning of the buffer */
memcpy(buffer + l, fifo->buffer, len - l);
/*
* Ensure that we remove the bytes from the kfifo -before-
* we update the fifo->out index.
*/
smp_mb();
fifo->out += len;
return len;
}
EXPORT_SYMBOL(__kfifo_get);

爲了更好的理解上面的源碼，這裏順帶說一下此實現使用到的一些和本文主題無關的技巧：

使用與操做來求取環形緩衝區的下標，相比取餘操做來求取下標的作法效率要高很多。使用與操做求取下標的前提是環形緩衝區的大小必須是 2 的 N 次方，換而言之就是說環形緩衝區的大小爲一個僅有一個 1 的二進制數，那麼 index & (size – 1) 則爲求取的下標（這不難理解）
使用了 in 和 out 兩個索引且 in 和 out 是一直遞增的（此作法比較巧妙），這樣可以避免一些複雜的條件判斷（某些實現下，in == out 時還沒法區分緩衝區是空仍是滿）

這裏，索引 in 和 out 被兩個線程訪問。in 和 out 指明瞭緩衝區中實際數據的邊界，也就是 in 和 out 同緩衝區數據存在訪問上的順序關係，因爲未使用同步機制，那麼保證順序關係就須要使用到 Memory barrier 了。索引 in 和 out 都分別只被一個線程修改，而被兩個線程讀取。__kfifo_put 先經過 in 和 out 來肯定能夠向緩衝區中寫入數據量的多少，這時，out 索引應該先被讀取後才能真正的將用戶 buffer 中的數據寫入緩衝區，所以這裏使用到了 smp_mb()，對應的，__kfifo_get 也使用 smp_mb() 來確保修改 out 索引以前緩衝區中數據已經被成功讀取並寫入用戶 buffer 中了。對於 in 索引，在 __kfifo_put 中，經過 smp_wmb() 保證先向緩衝區寫入數據後才修改 in 索引，因爲這裏只須要保證寫入操做有序，故選用寫操做 barrier，在 __kfifo_get 中，經過 smp_rmb() 保證先讀取了 in 索引（這時候 in 索引用於肯定緩衝區中實際存在多少可讀數據）纔開始讀取緩衝區中數據（並寫入用戶 buffer 中），因爲這裏只須要保證讀取操做有序，故選用讀操做 barrier。

參考文獻

http://en.wikipedia.org/wiki/Memory_barrier

http://en.wikipedia.org/wiki/Memory_ordering

http://en.wikipedia.org/wiki/Out-of-order_execution

https://www.kernel.org/doc/Documentation/memory-barriers.txt