JVM內存模型-重排序&內存屏障

時間 2019-11-24

標籤 jvm 內存模型排序屏障欄目 Java 简体版

原文原文鏈接

以前寫過的JAVA內存模型只涉及了單一數據的可見性，其實這僅僅是java內存模型的一小部分。其java內存模型中更重要的，應該是內存屏障，memory barrier。更粗獷一點的就內存柵欄memory fence。fence比較粗獷，代價也比較大，這裏先從memory fence開始提及。html

reordering

提到內存屏障，首先應該說到重排序，這裏強調一下，重排序只對於那些在當前線程沒有依賴關係的有效，有依賴關係的是不會重排序的。
.java -----> .class ，.class----->彙編，彙編 ---->CPU指令執行。在這三個過程當中，都有可能發生重排序
java重排序的最低保證是，as if serial，即在單個線程內，看起來總認爲代碼是在順序運行的，可是從別的線程來看，這些代碼運行的順序就很差說了。java

首先，理解重排序，推薦這篇blog，cpu-reordering-what-is-actually-being-reordered程序員

本來打算將其中的內容用java代碼重寫一遍，並進行試驗，代碼以下編程

public class UnderStandingReordering {

  static int[] data = {9, 9, 9, 9, 9};
  static boolean is_ready = false;

  static void init_data() {
    for (int i = 0; i < 5; ++i) {
      data[i] = i;
    }
    is_ready = true;
  }

  static int sum_data() {
    if (!is_ready) {
      return -1;
    }
    int sum = 0;
    for (int i = 0; i < 5; ++i) {
      sum += data[i];
    }
    return sum;
  }

  public static void main(String[] args) throws Exception{
    ExecutorService executor1 = Executors.newSingleThreadExecutor();
    ExecutorService executor2 = Executors.newSingleThreadExecutor();

    executor1.submit(() -> {
      try {
        int sum = -1;
        while (sum < 0) {
          TimeUnit.MILLISECONDS.sleep(1);
          sum = sum_data();
        }
        System.out.println(sum);
      } catch (Exception ignored) {}
    });

    TimeUnit.SECONDS.sleep(2);
    executor2.submit(UnderStandingReordering::init_data);
  }
}

很遺憾的是，在個人電腦中，並無模擬出這些狀況，多是由於java的優化已經很牛逼了，嘗試了不少次都沒有出現想要的不肯定的結果。
因此只好當作尷尬地搬運工，可是原理是沒問題的。原有的代碼以下：api

int data[5] = { 9, 9, 9, 9, 9 };
bool is_ready = false;

void init_data() {
  for( int i=0; i < 5; ++i )
    data[i] = i;
  is_ready = true;
}

void sum_data() {
  if( !is_ready )
    return;
  int sum = 0;
  for( int i=0; i <5; ++i )
    sum += data[i];
  printf( "%d", sum );
}

分別使用線程A和B去執行init_data() 和 sum_data()
其中B線程持續不斷地去調用sun_data()方法，直到輸出sum爲止
在B線程運行一段時間後，咱們會讓A線程去調用一次init_data()，初始化這個數組。
若是直接從代碼上看，咱們認爲執行的順序是數組

store data[0] 0
store data[1] 1
store data[2] 2
store data[3] 3
store data[4] 4
store is_ready 1

理所固然的，is_ready會在全部的數組都初始化後才被設置成true，也就是說，咱們輸出的結果是10.
可是，CPU在執行這些指令時(這裏的編程語言是C，若是換成java，還有可能在以前JIT編譯時重排序)，爲了提高效率，可能把指令優化成以下的順序。
這裏舉的例子是可能，可能的含義是有可能發生，可是不必定會這樣，至於爲何會這樣，因爲對底層不瞭解，因此這裏無法深刻討論，只是說有這個可能。好像涉及到內存總線相關的東西，這裏先挖個坑指望往後有能力來填。緩存

store data[3] 3
store data[4] 4
store is_ready 1
store data[0] 0
store data[1] 1
store data[2] 2

因此，就會遇到這種狀況，當is_ready變成true以後，data[0]、data[1]、data[2]的值依舊是初始值9，這樣讀到的數組就是9，9，9，3，4。安全

固然，這裏咱們都是假設讀的時候是按順序讀的，再接下來討論了第一道柵欄的時候，會發現讀的過程也有可能發生重排序，因此說這雙重可能致使了程序執行結果的不肯定性。數據結構

memory fence

第一道柵欄

咱們將init()的代碼改爲以下的形式併發

lock_type lock;

void init_data() {
  synchronized( lock ) {
    for( int i=0; i < 5; ++i )
      data[i] = i;
  }
  is_ready = true;
  return data;
}

這樣，由於在得到鎖和釋放鎖的過程當中，都會加上一道fence，而在咱們修改並存儲is_ready的值以前，synchronized鎖釋放了，這時候會在指令中加入一道內存柵欄，禁止重排序在將指令重排的過程當中跨過這條柵欄，因而從字面上看指令就變成了這個樣子

store data[0] 0
store data[1] 1
store data[2] 2
store data[3] 3
store data[4] 4
fence
store is_ready 1

因此像上文中的狀況是不容許出現了，可是下面這種形式仍是能夠的，由於memory fence會阻止指令在重排序的過程當中跨過它。

store data[3] 3
store data[4] 4
store data[0] 0
store data[1] 1
store data[2] 2
fence
store is_ready 1

第二道柵欄

這樣，咱們就已經能夠確保在更新is_ready前全部的data[]都已經被設置成對應的值，不被重排序破壞了。
可是正如上文所提到的，讀操做的指令依舊是有可能被重排序的，因此程序運行的結果依舊是不肯定的。

繼續上文說的，正如init_data()的指令能夠被重排序，sum_data()的指令也會被重排序，從代碼字面上看，咱們認爲指令的順序是這樣的

load is_ready
load data[0]
load data[1]
load data[2]
load data[3]
load data[4]

可是實際上，CPU爲了優化效率可能會把指令重排序成以下的方式

load data[3]
load data[4]
load is_ready
load data[0]
load data[1]
load data[2]

因此說，即便init_data()已經經過synchronized所提供的fence，保證了is_ready的更新必定在data[]數組被賦值後，可是程序運行的結果依舊是未知。仍有可能讀到這樣的數組：0，1，2，9，9。依舊不是咱們所指望的結果。

這時候，須要這load的過程當中也添加上一道柵欄

void sum_data() {
  synchronized( lock ) {
    if( !is_ready )
      return;
  }
  int sum = 0;
  for( int i  =0; i <5; ++i )
    sum += data[i];
  printf( "%d", sum );
}

這樣，咱們就在is_ready和data[]的讀取中間添加了一道fence，可以有效地保證is_ready的讀取不會與data[]的讀取進行重排序

load is_ready
fence
load data[0]
load data[1]
load data[2]
load data[3]
load data[4]

固然，data[]中0，1，2，3，4的load順序仍有可能被重排序，可是這已經不會對最終結果產生影響了。
最後，咱們經過了這樣兩道柵欄，保證了咱們結果的正確性，此時，線程B最後輸出的結果爲10。

memory barrier in java

fence vs barrier

幾乎全部的處理器至少支持一種粗粒度的屏障指令，一般被稱爲「柵欄（Fence）」，它保證在柵欄前初始化的load和store指令，可以嚴格有序的在柵欄後的load和store指令以前執行。不管在何種處理器上，這幾乎都是最耗時的操做之一（與原子指令差很少，甚至更消耗資源），因此大部分處理器支持更細粒度的屏障指令。
由於fence和barrier是對於處理器的，而不一樣的處理器指令間是否可以重排序也不一樣，有一些barrier會在真正處處理器的時候被擦除，由於處理器自己就不會進行這類重排序，可是比較粗獷的fence，就會一直存在，由於全部的處理器都是支持寫讀重排序的，由於使用了寫緩衝區。
簡而言之，使用更精確精細的memory barrier，有助於處理器優化指令的執行，提高性能。

volatile、synchronized、CAS

講清楚了重排序和內存柵欄，如今針對java來具體講講。

在java中除了有synchronized進行這種屏障以外，還能夠經過volatile達到一樣的內存屏障的效果。
一樣，內存屏障除了有屏障做用外，還確保了synchronized在退出時以及volatile修飾的變量在寫入後當即刷新到主內存中，至於兩種是否有因果關係，待我弄明白後來敘述，我猜想是有的。後來看到了大神之做，就直接貼在這了。
Doug Lea大神在The JSR-133 Cookbook for Compiler Writers中寫到：

內存屏障指令僅僅直接控制CPU與其緩存之間，CPU與其準備將數據寫入主存或者寫入等待讀取、預測指令執行的緩衝中的寫緩衝之間的相互操做。這些操做可能致使緩衝、主內存和其餘處理器作進一步的交互。但在JAVA內存模型規範中，沒有強制處理器之間的交互方式，只要數據最終變爲全局可用，就是說在全部處理器中可見，並當這些數據可見時能夠獲取它們。

Memory barrier instructions directly control only the interaction of a CPU with its cache, with its write-buffer that holds stores waiting to be flushed to memory, and/or its buffer of waiting loads or speculatively executed instructions. These effects may lead to further interaction among caches, main memory and other processors. But there is nothing in the JMM that mandates any particular form of communication across processors so long as stores eventually become globally performed; i.e., visible across all processors, and that loads retrieve them when they are visible.

不過在內存屏障方面，volatile的語義要比synchronized弱一些，synchronized是確保了在獲取鎖和釋放鎖的時候都有內存屏障，且數據必定會從主內存中從新load或者store到主內存。
可是在volatile中，volatile write以前有storestore屏障，以後有storeload屏障。volatile的寫後有loadload屏障和loadstore屏障，確保寫操做後必定會刷新到主內存。
CAS(compare and swap)是處理器提供的原語，在java中是經過UnSafe這個類的方法來調用的，在內存方面，他同時擁有volatile的read和write的語義。即既能保證禁止該指令與以前和以後的指令重排序，有能保證把寫緩衝區的全部數據刷新到內存中。

concurrent package

此節摘抄自深刻理解java 內存模型 (程曉明），因爲java 的 CAS 同時具備 volatile 讀和 volatile 寫的內存語義,所以 Java 線程之間的通訊如今有了下面四種方式:

1.A 線程寫 volatile 變量,隨後 B 線程讀這個 volatile 變量。
2.A 線程寫 volatile 變量,隨後 B 線程用 CAS 更新這個 volatile 變量。
3.A 線程用 CAS 更新一個 volatile 變量,隨後 B 線程用 CAS 更新這個 volatile變量。
4.A 線程用 CAS 更新一個 volatile 變量,隨後 B 線程讀這個 volatile 變量。

Java 的 CAS 會使用現代處理器上提供的高效機器級別原子指令,這些原子指令以原子方式對內存執行讀-改-寫操做,這是在多處理器中實現同步的關鍵(從本質上來講,可以支持原子性讀-改-寫指令的計算機器,是順序計算圖靈機的異步等價機器,所以任何現代的多處理器都會去支持某種能對內存執行原子性讀-改-寫操做的原子指令)。同時,volatile 變量的讀/寫和 CAS 能夠實現線程之間的通訊。把這些特性整合在一塊兒,就造成了整個 concurrent 包得以實現的基石。若是咱們仔細分析 concurrent 包的源代碼實現,會發現一個通用化的實現模式:

1.首先,聲明共享變量爲 volatile;
2.而後,使用 CAS 的原子條件更新來實現線程之間的同步;
3.同時,配合以 volatile 的讀/寫和 CAS 所具備的 volatile 讀和寫的內存語義來實現線程之間的通訊。

AQS,非阻塞數據結構和原子變量類(java.util.concurrent.atomic 包中的類), 這些concurrent包中的基礎類都是使用這種模式來實現的,而 concurrent 包中的高層類又是依賴於這些基礎類來實現的。

final

首先final域是不可變的，因此它至少必須在構造方法中初始化，也能夠直接在聲明的同時就定義。
爲了確保在new這個對象時，不會看到final域的值有變化的狀況，因此須要一個內存屏障的保證，確保對final域賦值，和把這個對象的引用賦值給引用對象時，不能進行重排序。這樣才能確保new出來的對象拿到引用以前，final域就已經被賦值了。
當final域是引用對象時，還須要增強到以下

在構造函數內對一個final域的寫入,與隨後把這個被構造對象的引用賦值給一個引用變量,這兩個操做之間不能重排序。
初次讀一個包含 final域的對象的引用,與隨後初次讀這個final域,這兩個操做之間不能重排序。
在構造函數內對一個final引用的對象的成員域的寫入,與隨後在構造函數外把這個被構造對象的引用賦值給一個引用變量,這兩個操做之間不能重排序。
爲了修補以前內存模型的缺陷，JSR-133專家組加強了final的語義。經過爲final域增長寫和讀重排序規則,能夠爲java程序員提供初始化安全保證:只要對象是正確構造的 (被構造對象的引用在構造函數中沒有「逸出」),那麼不須要使用同步(指 lock 和 volatile 的使用),就能夠保證任意線程都能看到這個 final 域在構造函數中被初始化以後的值。

happens-before？

最後，好像漏了什麼東西？對，就是這個聽起來很玄乎的happens-before，可是我並不想詳細說這個，以爲happens-before用來說java內存模型實在的過小了，目前我也還在看這篇論文，因此繼續留個坑。

happens-before最早出如今Leslie Lamport的論文Time Clocks and the Ordering of Events in a Distributed System中。該論文於 1978年7月發表在」Communication of ACM」上，並於2000年得到了首屆PODC最具影響力論文獎，於2007年得到了ACM SIGOPS Hall of Fame Award 。關於該論文的貢獻是這樣描述的：本文包含了兩個重要的想法，每一個都成爲了主導分佈式計算領域研究十多年甚至更長時間的重要課題。

關於分佈式系統中事件發生的前後關係(又稱爲clock condition)的精肯定義和用來對分佈式系統中的事件時序進行定義和肯定的框架。用於實現clock condition的最簡單方式，就是由Lamport在本文中提出的」logical clocks」，這一律念在該領域產生了深遠的影響，這也是該論文被引用地如此之多的緣由。同時它也開啓了人們關於vector 和 matrix clock ，consistent cuts概念(解決了如何定義分佈式系統中的狀態這一問題)，stable and nonstable predicate detection，認識邏輯(好比用於描述分佈式協議的一些知識，常識和定理)的語義基礎等方面的研究。最後，最重要的是它很是早地指出了分佈式系統與其餘系統的本質不一樣，同時它也是第一篇給出了能夠用來描述這些不一樣的數學理論基礎(「happen before」relation)。
狀態機方法做爲n-模塊冗餘的一種通用化實現，不管是對於分佈式計算的理論仍是實踐來講，其非凡的影響力都已經被證實了。該論文還給出了一個分佈式互斥協議，以保證對於互斥區的訪問權限是按照請求的前後順序獲取的。更重要的是，該論文還解釋瞭如何將該協議用來做爲管理replication的通用方法。從該方法還引出了以下問題：
a)Byzantine agreement，那些用來保證全部的狀態機即便在出錯狀況下也可以獲得相同輸入的協議。不少工做都是源於這個問題，包括fast protocols, impossibility results, failure model hierarchies等等。
b)Byzantine clock synchronization 和ordered multicast protocols。這些協議是用來對併發請求進行排序並保證獲得相同的排序結果，經過與agreement協議結合能夠保證全部狀態機都具備相同的狀態。

固然，想了解java中的happens-before能夠看接下來三個小節的摘抄，程曉明老師的書，以及oracle的文檔，都有。

happens-before原則定義

若是一個操做happens-before另外一個操做，那麼第一個操做的執行結果將對第二個操做可見，並且第一個操做的執行順序排在第二個操做以前。
兩個操做之間存在happens-before關係，並不意味着必定要按照happens-before原則制定的順序來執行。若是重排序以後的執行結果與按照happens-before關係來執行的結果一致，那麼這種重排序並不非法。

happens-before原則規則：

程序次序規則：一個線程內，按照代碼順序，書寫在前面的操做先行發生於書寫在後面的操做；
鎖定規則：一個unLock操做先行發生於後面對同一個鎖額lock操做；
volatile變量規則：對一個變量的寫操做先行發生於後面對這個變量的讀操做；
傳遞規則：若是操做A先行發生於操做B，而操做B又先行發生於操做C，則能夠得出操做A先行發生於操做C；
線程啓動規則：Thread對象的start()方法先行發生於此線程的每一個一個動做；
線程中斷規則：對線程interrupt()方法的調用先行發生於被中斷線程的代碼檢測到中斷事件的發生；
線程終結規則：線程中全部的操做都先行發生於線程的終止檢測，咱們能夠經過Thread.join()方法結束、Thread.isAlive()的返回值手段檢測到線程已經終止執行；
對象終結規則：一個對象的初始化完成先行發生於他的finalize()方法的開始；

Memory Consistency Properties

Chapter 17 of the Java Language Specification defines the happens-before relation on memory operations such as reads and writes of shared variables. The results of a write by one thread are guaranteed to be visible to a read by another thread only if the write operation happens-before the read operation. The synchronized and volatile constructs, as well as the Thread.start() and Thread.join() methods, can form happens-before relationships. In particular:

Each action in a thread happens-before every action in that thread that comes later in the program's order.
An unlock (synchronized block or method exit) of a monitor happens-before every subsequent lock (synchronized block or method entry) of that same monitor. And because the happens-before relation is transitive, all actions of a thread prior to unlocking happen-before all actions subsequent to any thread locking that monitor.
A write to a volatile field happens-before every subsequent read of that same field. Writes and reads of volatile fields have similar memory consistency effects as entering and exiting monitors, but do not entail mutual exclusion locking.
A call to start on a thread happens-before any action in the started thread.
All actions in a thread happen-before any other thread successfully returns from a join on that thread.

The methods of all classes in java.util.concurrent and its subpackages extend these guarantees to higher-level synchronization. In particular:

Actions in a thread prior to placing an object into any concurrent collection happen-before actions subsequent to the access or removal of that element from the collection in another thread.
Actions in a thread prior to the submission of a Runnable to an Executor happen-before its execution begins. Similarly for Callables submitted to an ExecutorService.
Actions taken by the asynchronous computation represented by a Future happen-before actions subsequent to the retrieval of the result via Future.get() in another thread.
Actions prior to "releasing" synchronizer methods such as Lock.unlock, Semaphore.release, and CountDownLatch.countDown happen-before actions subsequent to a successful "acquiring" method such as Lock.lock, Semaphore.acquire, Condition.await, and CountDownLatch.await on the same synchronizer object in another thread.
For each pair of threads that successfully exchange objects via an Exchanger, actions prior to the exchange() in each thread happen-before those subsequent to the corresponding exchange() in another thread.
Actions prior to calling CyclicBarrier.await and Phaser.awaitAdvance (as well as its variants) happen-before actions performed by the barrier action, and actions performed by the barrier action happen-before actions subsequent to a successful return from the corresponding await in other threads.