linux 內存屏障 && C++11

時間 2019-12-25

標籤 linux 內存屏障 c++11 欄目 Linux 简体版

原文原文鏈接

概覽

SMP這種並行架構相比傳統的單處理器帶來至關可觀的性能提高。一個不可避免的問題是並行架構的處理器間的交互問題。一種可能的解決方案是，每一個CPU都有本身惟一可訪問內存，處理器間經過消息傳遞進行通訊。這種架構的問題是帶給程序員(尤爲是系統程序員)巨大的編程負擔，由於須要處理數據分隔與傳遞。相反，被普遍應用的另外一種架構是，多處理器間共享一個共享的內存地址空間。這種架構下每一個處理器依然可能有本身的本地內存，但不一樣的是全部的內存對全部的處理器都是能夠訪問的，只存在訪問效率的問題。本文也針對這種架構進行討論。html

共享的內存空間容許每一個處理器都進行對任意內存位置的訪問(access)操做，這會引發在單處器上沒有的問題。考慮這種狀況(此圖示和例子來源於內核文檔Documentation/memory-barriers.txt)：java

在如圖的這種系統模型中，假設存在以下的內存訪問操做:git

因爲處理器出於效率而引入的亂序執行(out-of-order execution)和緩存的關係， 對於內存來講, 最後x和y的值能夠有以下組合：程序員

所以，對於在操做系統這一層次編程的程序員來講，他們須要一個內存模型，以協調處理器間正確地使用共享內存，這個模型叫作內存一致性模型(memory consistency model)或簡稱內存模型(memory model)。github

另外有一點須要明確的是，計算機是一個層層抽象的機器，最底層是裸機，再往上是能理解彙編程序的虛擬機，再往上是能理解更高層級編程語言（如C/C++,Java)的更高層級虛擬機。在高級層級編程語言這一層次，編程語言以及運行環境定義了這個層次的內存模型, 如C++, Java規範中分別定義了各自的內存模型。不過，本文主要着眼的是處理器架構層次定義的內存模型。golang

一種直觀的內存模型叫作順序一致性模型(sequential consistency model), 簡單講，順序一致性模型保證兩件事:web

每一個處理器(或線程)的執行順序是按照最後彙編的二進制程序中指令的順序來執行的，
對於每一個處理器(或線程)，全部其餘處理器(或線程)看到的執行順序跟它的實際執行順序同樣。

由上兩點就能夠推出： 整個程序依照着一種有序的執行順序執行。這種模型是很是直觀的，運行在此架構下的程序的執行順序跟一個不明白內存模型是何物的程序員他所期待的執行順序同樣。express

不過，這種模型的效率低下，由於它阻止了處理器發揮亂序執行的能力以最大化並行度，所以在商用的處理器架構中，沒有人使用這種模型。取而代之的是更弱的內存一致性模型。編程

好比，耳熟能詳的x86及其後續的兼容於它的x86_64架構，採用的是一種叫流程一致性(process consistency)的模型，簡言之：promise

對於某個處理器的`寫操做`，它能夠按照其意願重排執行順序，但對於全部其它處理器，他們看到的順序，就是它實際執行的順序。也許這點很難理解，但很重要，後文詳述。

還有更鬆散的弱內存模型, 給予處理器很是自由的重排指令的能力。所以，須要程序員(尤指系統程序員)採起必要的措施進行限制，以保證想要的結果。這就是本文的主題。實際上，這種處理器典型的就是DEC ALPHA, 所以，Linux的通用內存模型是基於它之上構建的。後文詳述。

在繼續以前，有三個概念要澄清。

`程序順序(program order)`: 指給定的處理器上，程序最終編譯後的二進制程序中的內存訪問指令的順序，由於編譯器的優化可能會重排源程序中的指令順序。
`執行順序(execution order)`: 指給定的處理器上，內存訪問指令的實際執行順序。這可能因爲處理器的亂序執行而與程序順序不一樣。
`觀察順序(perceived order)`: 指給定的處理器上，其觀察到的全部其餘處理器的內存訪問指令的實際執行順序。這可能因爲處理器的緩存及處理器間內存同步的優化而與執行順序不一樣。

前面的幾種內存模型的描述中，其實就用到了這幾種順序的等價描述。

何謂內存屏障

上文已經粗略描述了多處理架構下，爲了提升並行度，充分挖掘處理器效率的作法會致使的一些與程序員期待的不一樣的執行結果的狀況。本節更詳細地描述這種狀況, 即爲什麼順序一致性的模型難以保持的緣由。

總的來講，在系統程序員關注的操做系統層面，會重排程序指令執行順序的兩個主要的來源是處理器優化和編譯器優化。

處理器優化

共享內存的多處理架構的典型結構圖以下：

如圖，共享內存的涵義實際上是每一個處理器擁有本身本地內存, 但全部的非本地內存亦可訪問，但顯然存在速度的差異。此外，每一個處理器還擁有本身的本地緩存系統; 全部的處理器經過一個網絡來通訊。

顯然，這種架構下的內存操做會有巨大的延時問題。爲了緩解這些問題，處理器會採起一些優化措施, 而致使程序順序被破壞。

情景一: 設想某處理器發出一條某內存位置讀的指令，剛好這個內存位置在遠端內存，並且處理器本地緩存也沒有命中。因而，爲了等待這個值，處理器須要空轉(stall)。這顯然是效率的極大浪費，事實現代的處理器都有亂序執行引擎, 指令並非直接被執行，而是放到等待隊列裏，等待操做數到位後才執行，而這期間處理器優先處理其餘指令。也就是出於效率考慮，處理器會重排指令, 這就違背了程序順序。
情景二: 設想有一個熱點全局變量，那麼在程序運行一段時間後，極可能不少個處理器的本地緩存都有該變量的一份拷貝。再設想如今有處理器A修改這個全局變量，這個修改會發布一條消息能過網絡通知全部其餘處理器更新該變量緩存。因爲路徑的問題，消息不會同時到達全部處理器，那麼存在一種可能性，某處理器此時仍觀察到舊的值，而採起一些基於該舊值的動做。也就是，執行順序與觀察順序不一樣，這可能致使出人意表的程序行爲。

編譯器優化

編譯器的優化操做，如寄存器分配(register allocation), 循環不變裏代碼移動(loop-invariant code motion), 共同子表達式(commonsub-expression elimination), 等等，都有可能致使內存訪問操做被重排，甚至消除。所以，編譯器優化也會影響指令重排。

另外，還有一種狀況須要單獨說明。一些設備會把它們的控制接口映射到程序的進程空間，對這種設備的訪問叫作內存映射IO(Memory Mapped IO, MMIO), 對設備的地址寄存器與數據寄存器等寄存器的訪問就如同讀寫內存同樣方便。通常的訪問模式是先傳訪問端口到地址寄存器AR, 再從數據寄存器DR中訪問數據，其代碼順序爲：

*AR = 1;
x = *DR;

這種順序可能會被編譯器顛倒，結果天然是錯誤的程序行爲。

綜上，內存屏障就是在處理器/設備交互過程當中，顯式加入的一些干預措施，以防止處理器優化或編譯優化，以保持想要的內存指令訪問順序。

內存屏障的種類

Linux內核實現的屏障種類有如下幾種：

寫屏障(write barriers)

定義: 在寫屏障以前的全部寫操做指令都會在寫屏障以後的全部寫操做指令更早發生。
注意1: 這種順序性是相對這些動做的承接者，即內存來講。也就是說，在一個處理器上加入寫屏障不能保證別的處理器上看到的就是這種順序，也就是觀察順序與執行順序無關。
注意2: 寫屏障不保證屏障以前的全部寫操做在屏障指令結束前結束。也就是說，寫屏障序列化了寫操做的發生順序，卻沒保證操做結果發生的序列化。

讀屏障(write barriers)

定義: 在讀屏障以前的全部讀操做指令都會在讀屏障以後的全部讀操做指令更早發生。另外，它還包含後文描述的數據依賴屏障的功能。
注意1: 這種順序性是相對這些動做的承接者，即內存來講。也就是說，在一個處理器上加入讀屏障不能保證別的處理器上實際執行的就是這種順序，也就是觀察順序與執行順序無關。
注意2: 讀屏障不保證屏障以前的全部讀操做在屏障指令結束前結束。也就是說，讀屏障序列化了讀操做的發生順序，卻沒保證操做結果發生的序列化。

寫屏障/讀屏障舉例

注意，之因此要把這兩種屏障放在一塊兒舉例緣由就是：寫屏障必須與讀屏障一塊兒使用。

例如：

假如，CPU 2上觀察到x值爲2, 可否保證其觀察到的y值爲1?

不能！這就是前面的注意1強調的內容。緣由多是CPU 2上的緩存中存有a的舊值，而正如何謂內存屏障一節中情景二所說的，因爲CPU 1上寫操做消息傳遞的延遲，可能CPU 2還未接收到a值更改的消息。

正確的作法是，在CPU 2上插入讀屏障。配對的讀/寫屏障才能保證正確的程序行爲。

通用屏障(general barriers)

定義: 在通用屏障以前的全部寫和讀操做指令都會在通用屏障以後的全部寫和讀操做指令更早發生。
注意1: 這種順序性是相對這些動做的承接者，即內存來講。也就是說，在一個處理器上加入通用屏障不能保證別的處理器上看到的就是這種順序，也就是觀察順序與執行順序無關。
注意2: 通用屏障不保證屏障以前的全部寫和讀操做在屏障指令結束前結束。也就是說，通用屏障序列化了寫和讀操做的發生順序，卻沒保證操做結果發生的序列化。
注意3: 通用屏障是最嚴格的屏障，這也意味着它的低效率。它能夠替換在寫屏障或讀屏障出現的地方。

數據依賴屏障(data dependency barriers)`:`

http://preshing.com/20130823/the-synchronizes-with-relation/ before-happen

http://blog.jobbole.com/52164/ 雙重檢測

http://blog.csdn.net/roland_sun/article/details/47670099　arm 中實現msei 協議的指令strex ldrex

http://blog.csdn.net/holandstone/article/details/8596871　java volatile 原子性分析

The Synchronizes-With Relation

In an earlier post, I explained how atomic operations let you manipulate shared variables concurrently without any torn reads or torn writes. Quite often, though, a thread only modifies a shared variable when there are no concurrent readers or writers. In such cases, atomic operations are unnecessary. We just need a way to safely propagate modifications from one thread to another once they’re complete. That’s where the synchronizes-with relation comes in.

」Synchronizes-with」 is a term invented by language designers to describe ways in which the memory effects of source-level operations – even non-atomic operations – are guaranteed to become visible to other threads. This is a desirable guarantee when writing lock-free code, since you can use it to avoid unwelcome surprises caused by memory reordering.

」Synchronizes-with」 is a fairly modern computer science term. You’ll find it in the specifications of C++11, Java 5+ and LLVM, all of which were published within the last 10 years. Each specification defines this term, then uses it to make formal guarantees to the programmer. One thing they have in common is that whenever there’s a synchronizes-with relationship between two operations, typically on different threads, there’s a happens-before relationship between those operations as well.

Before digging deeper, I’ll let you in on a small insight: In every synchronizes-with relationship, you should be able to identify two key ingredients, which I like to call the guard variable and the payload. The payload is the set of data being propagated between threads, while the guard variable protects access to the payload. I’ll point out these ingredients as we go.

Now let’s look at a familiar example using C++11 atomics.

A Write-Release Can Synchronize-With a Read-Acquire

Suppose we have a Message structure which is produced by one thread and consumed by another. It has the following fields:

struct Message
{
    clock_t     tick;
    const char* str;
    void*       param;
};

We’ll pass an instance of Message between threads by placing it in a shared global variable. This shared variable acts as the payload.

Message g_payload;

Now, there’s no portable way to fill in g_payload using a single atomic operation. So we won’t try. Instead, we’ll define a separate atomic variable, g_guard, to indicate whether g_payload is ready. As you might guess, g_guard acts as our guard variable. The guard variable must be manipulated using atomic operations, since two threads will operate on it concurrently, and one of those threads performs a write.

std::atomic<int> g_guard(0);

To pass g_payload safely between threads, we’ll use acquire and release semantics, a subject I’ve written about previously using an example very similar to this one. If you’ve already read that post, you’ll recognize the final line of the following function as a write-release operation on g_guard.

void SendTestMessage(void* param)
{
    // Copy to shared memory using non-atomic stores.
    g_payload.tick  = clock();
    g_payload.str   = "TestMessage";
    g_payload.param = param;
    
    // Perform an atomic write-release to indicate that the message is ready.
    g_guard.store(1, std::memory_order_release);
}

While the first thread calls SendTestMessage, the second thread calls TryReceiveMessage intermittently, retrying until it sees a return value of true. You’ll recognize the first line of this function as a read-acquire operation on g_guard.

bool TryReceiveMessage(Message& result)
{
    // Perform an atomic read-acquire to check whether the message is ready.
    int ready = g_guard.load(std::memory_order_acquire);
    
    if (ready != 0)
    {
        // Yes. Copy from shared memory using non-atomic loads.
        result.tick  = g_payload.tick;
        result.str   = g_payload.str;
        result.param = g_payload.param;
        
        return true;
    }
    
    // No.
    return false;
}

If you’ve been following this blog for a while, you already know that this example works reliably (though it’s only capable of passing a single message). I’ve already explained how acquire and release semantics introduce memory barriers, and given a detailed example of acquire and release semantics in a working C++11 application.

The C++11 standard, on the other hand, doesn’t explain anything. That’s because a standard is meant to serve as a contract or an agreement, not as a tutorial. It simply makes the promise that this example will work, without going into any further detail. The promise is made in §29.3.2 of working draft N3337:

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

It’s worth breaking this down. In our example:

Atomic operation A is the write-release performed in SendTestMessage.
Atomic object M is the guard variable, g_guard.
Atomic operation B is the read-acquire performed in TryReceiveMessage.

As for the condition that the read-acquire must 「take its value from any side effect」 – let’s just say it’s sufficient for the read-acquire to read the value written by the write-release. If that happens, the synchronized-with relationship is complete, and we’ve achieved the coveted happens-before relationship between threads. Some people like to call this a synchronize-with or happens-before 「edge」.

Most importantly, the standard guarantees (in §1.10.11-12) that whenever there’s a synchronizes-with edge, the happens-before relationship extends to neighboring operations, too. This includes all operations before the edge in Thread 1, and all operations after the edge in Thread 2. In the example above, it ensures that all the modifications to g_payload are visible by the time the other thread reads them.

Compiler vendors, if they wish to claim C++11 compliance, must adhere to this guarantee. At first, it might seem mysterious how they do it. But in fact, compilers fulfill this promise using the same old tricks which programmers technically had to use long before C++11 came along. For example, in this post, we saw how an ARMv7 compiler implements these operations using a pair of dmb instructions. A PowerPC compiler could implement them using lwsync, while an x86 compiler could simply use a compiler barrier, thanks to x86’s relatively strong hardware memory model.

Of course, acquire and release semantics are not unique to C++11. For example, in Java version 5 onward, every store to a volatile variable is a write-release, while every load from a volatile variable is a read-acquire. Therefore, any volatile variable in Java can act as a guard variable, and can be used to propagate a payload of any size between threads. Jeremy Manson explains this in his blog post on volatile variables in Java. He even uses a diagram very similar to the one shown above, calling it the 「two cones」 diagram.

It’s a Runtime Relationship

In the previous example, we saw how the last line of SendTestMessage synchronized-with the first line of TryReceiveMessage. But don’t fall into the trap of thinking that synchronizes-with is a relationship between statements in your source code. It isn’t! It’s a relationship between operations which occur at runtime, based on those statements.

This distinction is important, and should really be obvious when you think about it. A single source code statement can execute any number of times in a running process. And if TryReceiveMessage is called too early – before Thread 1’s store to g_guard is visible – there will be no synchronizes-with relationship whatsoever.

It all depends on whether the read-acquire sees the value written by the write-release, or not. That’s what the C++11 standard means when it says that atomic operation B must 「take its value」 from atomic operation A.

Other Ways to Achieve Synchronizes-With

Just as synchronizes-with is not only way to achieve a happens-before relationship, a pair of write-release/read-acquire operations is not the only way to achieve synchronizes-with; nor are C++11 atomics the only way to achieve acquire and release semantics. I’ve organized a few other ways into the following chart. Keep in mind that this chart is by no means exhaustive.

The example in this post generates lock-free code (on virtually all modern compilers and processors), but C++11 and Java expose blocking operations which introduce synchronize-with edges as well. For instance, unlocking a mutex always synchronizes-with a subsequent lock of that mutex. The language specifications are pretty clear about that one, and as programmers, we naturally expect it. You can consider the mutex itself to be the guard, and the protected variables as the payload. IBM even published an article on Java’s updated memory model in 2004 which contains a 「two cones」 diagram showing a pair of lock/unlock operations synchronizing-with each other.

As I’ve shown previously, acquire and release semantics can also be implemented using standalone, explicit fence instructions. In other words, it’s possible for a release fence to synchronize-with an acquire fence, provided that the right conditions are met. In fact, explicit fence instructions are the only available option in Mintomic, my own portable API for lock-free programming. I think that acquire and release fences are woefully misunderstood on the web right now, so I’ll probably write a dedicated post about them next.

The bottom line is that the synchronizes-with relationship only exists where the language and API specifications say it exists. It’s their job to define the conditions of their own guarantees at the source code level. Therefore, when using low-level ordering constraints in C++11 atomics, you can’t just slap std::memory_order_acquire and release on some operations and hope things magically work out. You need to identify which atomic variable is the guard, what’s the payload, and in which codepaths a synchronizes-with relationship is ensured.

Interestingly, the Go programming language is a bit of convention breaker. Go’s memory model is well specified, but the specification does not bother using the term 「synchronizes-with」 anywhere. It simply sticks with the term 「happens-before」, which is just as good, since obviously, happens-before can fill the role anywhere that synchronizes-with would. Perhaps Go’s authors chose a reduced vocabulary because 「synchronizes-with」 is normally used to describe operations on different threads, and Go doesn’t expose the concept of threads.

« The Happens-Before Relation Acquire and Release Fences »