淺析 TensorFlow Runtime 技術

時間 2020-12-26

標籤 python git github 編程後端 api promise 服務器 session 多線程欄目 Python 简体版

原文原文鏈接

關於 TF Runtime 的疑問？

什麼是TFRT ?

TensorFlow Runtime，簡稱 TFRT，它提供了統一的、可擴展的基礎架構層，能夠極致地發揮CPU多線程性能，支持全異步編程（無鎖隊列+異步化語義）。TFRT 能夠減小開發、驗證和部署企業級模型所需的時間。python

TFRT 的輸入是什麼？

輸入爲Tensorflow GraphDef，TFRT 會調用基於MLIR的圖編譯器，執行圖優化，並將其lower成 BEF —— 用於執行TFRT graph的二進制可執行格式。git

在TF原生框架中，執行的流程是：Python Layers → GradDef (DAG) → 執行OpNode (ThreadPool並行)github
Runtime 的思路：Python Layers → GradDef (DAG) → Compile IR → Binary (BEF) → execute (BEFExecutor)編程

基礎概念：後端

Host Program in MLIR是graph的低階中間表示
BEF是一個BEFExecutor的可執行文件，讀取BEF文件，而後異步執行裏面的函數
二者經過tfrt_translate來轉換，相似彙編器 Assembler

這裏的 IR 是什麼？

其實能夠理解爲是一套表示拓撲關係的代碼，甚至是一個graph。經過拓撲遞推，能夠很容易轉爲一段IR代碼。這也是爲何BEF支持IR與Graph的互轉的緣由。好比：api

%1 = hex.constant.i32 1
%2 = hex.constant.i32 2
%3 = hex.add.i32 %1, %2
hex.print.i32 %3
# 實際能夠表示爲一個DAG圖

和 XLA 的區別？

XLA 本質上並無脫離圖執行的框架，它只是經過 graph cluster 把部分子圖經過 HLO 的轉換走 JIT 執行，將子圖包裹在一個XlaRunOp裏，再與圖的其餘節點一塊兒執行。因此只是把幾個節點換成了一個更快的大節點。（看起來有點相似fuse）promise

官方文檔裏稱BEF爲 Kernel graph的實際載體，實際仍是一個graph，即表示bef executor最終執行的實體依然是一個 graph（但不是TF原生意義的GraphDef）。服務器

TFRT 基本執行單元是什麼？執行的流程？

TFRT裏的 kernel 概念，分爲以下兩種：session

同步 Kernel多線程

徹底在調用它的線程中執行，不會涉及到其餘線程裏的計算。它產生的AsyncValue狀態都是available的

int32_t TFRTAddI32(Argument<int32_t> arg0, Argument<int32_t> arg1) {
  // The thread that calls TFRTAddI32 performs this addition, and produces
  // an available AsyncValue.
  return *arg0 + *arg1;
}

異步 Kernel

包含兩個部分的計算：①調用它所在線程的同步計算 ② 其餘線程中的異步計算。它產生的AsyncValue狀態是unavailable的（並不全是）

void TFRTAddI32Async(Argument<int32_t> arg0, Argument<int32_t> arg1,
                    Result<int32_t> output, HostContext* host) {
  // Synchronously allocate an unavailable AsyncValue for ‘output’.
  auto result = output.Allocate();

  // Asynchronously make ‘output’ available.
  host->EnqueueWork([arg0 = *arg0, arg1 = *arg1,
                     result_ref = FormRef(result)] {
    // A ConcurrentWorkQueue thread performs this addition.
    result_ref->emplace(arg0 + arg1);
  });

  // Synchronously returns unavailable ‘output’.
}

執行流程：

建立一個AsyncKernelFrame，包含輸入參數和輸入result
將Frame傳遞給kernel執行
全部的AsyncValue經過registers來跟蹤

也提供了eager API （op-by-op）：CoreRuntime 和 CoreRuntimeOp

CoreRuntime：
- 執行OpHandler，藉助內部類Impl來實現
- 它能夠調用MakeOp(op_name, op_handler)來建立一個CoreRuntimeOp直接運行
CoreRuntimeOp
- 持有一個llvm::unique_function<void<const OpInvocation&>>類型的函數指針fn_
- 仿函數用於執行函數fn_

如何整合硬件設備的？

藉助 DeviceRuntime，讓BEF只支持最底層的driver API的Op，從而儘可能避免讓每一種後端都單獨實現一遍tf的各個Op。

以下圖中使用的op直接對應到了cuda api：

Host Runtime的設計思路

Host Runtime 的位置?

host 指執行計算的機器設備，可能有，也可能沒有硬件加速的資源。host 能夠只是一個具備多GPU的服務器，或帶有DSP和IPU的移動設備。

在TF原生的框架中，TF Core是按照 data-flow 進行op-by-op的執行，設計上有不少順序同步執行的影子在裏面。而 Host Runtime 經過從新編排計算邏輯，而後驅動 Device Runtime（如GPU、TPU）去加速計算，使得kernel的執行能夠單獨放在一個線程中，去異步執行，充分利用的多線程並行的優點。

爲何要作這件事？

指望能高效的eagerly執行op
- TF對graph執行已經優化的很好了，畢竟都在C++端執行。但在earge模式下，python和runtime端之間的沒必要要的開銷仍是在存的。
統一圖和op兩個不一樣層次下多線程並行機制
runtime 中異步是一等公民
- a non-strict kernel/function may execute before all its inputs are ready.
更輕便地進行cross-kernel優化
- TF 的op Kernel實現中封裝了 Tensor 的內存申請之類的邏輯，這限制了cross-kernel中reuse buffe的優化。在 TFRT的kernel中，解耦了 shape計算和 tensor 內存申請的邏輯
實現模塊化、可插拔式的新硬件支持機制
- 指望解決以前爲了接入新硬件而不得不hack整個代碼庫的痛點；可以創建一種模塊化機制，直接提供完善的接入文檔給硬件團隊便可，變被動爲主動。

如何去設計來實現上述目標麼？

先回顧下背景： Core Runtime, Graph Lowering 和 Eager Execution

Core Runtime

用來 eagerly 執行單個 op 或者整個graph function——包含GradDef 和 HLO。一個op graph一般是設備獨立的。
Graph Lowering

Compiler passes 將一個op graph 轉化爲一個Kernel Graph，它是一個數據流計算的更低階表示，爲更快執行而設計，所以不適合作編譯分析，但能夠經過低階方言（如MLIR）來表示。Kernel graph是面向指定設備的（與平臺綁定）

Eager Execution

Host Runtime支持eagerly 執行。但並不必定會涉及Graph/BEF的構造和BEFExecutor的使用。TF設計了兩個方案：
- Generic path：把 op 當作graph function來處理，能夠很好處理組合 op 的狀況，也能夠複用graph function的那一整套代碼。
- Fast path：使用手寫的C++或者預編譯的 graph snippets 去完成op kernel的選取和調用（定製化優化？成本不高麼？）

Kernel Graph 中的 Kernel 指什麼？

TFRT裏面也有 kernel 的概念，輸入輸出均爲：AsyncValue——異步是一等公民的踐行者。相似C++標準庫中的 future 和 promis的組合。 graph中的全部data所有都會替換爲AsyncValue。

執行流程：

建立一個AsyncKernelFrame，包含輸入參數和輸入result
將Frame傳遞給kernel執行
全部的AsyncValue經過registers來跟蹤

// Kernel that adds two integers.
// AsyncKernelFrame holds the kernel’s arguments and results.
static void TFRTAdd(AsyncKernelFrame* frame) {
  // Fetch the kernel’s 0th argument.
  AsyncValue* arg1 = frame->GetArgAt(0);
  // Fetch the kernel’s 1st argument.
  AsyncValue* arg2 = frame->GetArgAt(1);

  int v1 = arg1->get<int>();
  int v2 = arg2->get<int>();

  // Set the kernel’s 0th result.
  frame->EmplaceResultAt<int>(0, v1 + v2);
}

TODO: Kernel中的內存申請接入機制

Kernel 類型分爲以下兩種：

同步 Kernel

徹底在調用它的線程中執行，不會涉及任何其餘線程的計算。它產生的AsyncValue狀態都是available的

int32_t TFRTAddI32(Argument<int32_t> arg0, Argument<int32_t> arg1) {
  // The thread that calls TFRTAddI32 performs this addition, and produces
  // an available AsyncValue.
  return *arg0 + *arg1;
}

異步 Kernel

包含兩個部分：①調用它所在線程的同步操做 ② 其餘線程中的異步操做。它產生的``AsyncValue`狀態是unavailable的（並不全是）

void TFRTAddI32Async(Argument<int32_t> arg0, Argument<int32_t> arg1,
                    Result<int32_t> output, HostContext* host) {
  // Synchronously allocate an unavailable AsyncValue for ‘output’.
  auto result = output.Allocate();

  // Asynchronously make ‘output’ available.
  host->EnqueueWork([arg0 = *arg0, arg1 = *arg1,
                     result_ref = FormRef(result)] {
    // A ConcurrentWorkQueue thread performs this addition.
    result_ref->emplace(arg0 + arg1);
  });

  // Synchronously returns unavailable ‘output’.
}

Kernel 的兩種執行模式：

Strict mode:
- 此類Kernel被調用時，全部的AsyncValue均已經是available。
non Strict mode:
- 只要有一個輸入參數是available，就執行。好比三元操做，它其實只負責轉發
```
result = ternary(condition, true_result, false_result) //只要condition可用便可
```
- 這類kernel實現難度較高

`AsyncValue`有什麼用途？

前面提到：Kernel 的輸入輸出均爲：AsyncValue，graph中的全部data也所有替換爲了AsyncValue。

// A subset of interface functions in AsyncValue.
class AsyncValue {
 public:
  // Is the data available?
  bool IsAvailable() const;

  // Get the payload data as type T.
  // Assumes the data is already available, so get() never blocks.
  template <typename T> const T& get() const;

  // Store the payload data in-place.
  template <typename T, typename... Args>
  void emplace(Args&&... args);

  // Add a waiter callback that will run when the value becomes available.
  void AndThen(std::function<void()>&& waiter);
  // ...
};

AyncValuea有三個派生類：

ConcreteAsyncValue<T>：用於表示和存放具體data
ErrorAysncValue：用於處理異常傳播和取消執行。BEFExecutor會監控每一個Kernel執行返回的值，若果某個result值爲此類型，則跳過全部依賴此值的下游op
IndirectAsyncValue：有些狀況下，某個result的dataType還不知道呢，但爲了實現非阻塞機制，先建立一個IndirectSyncValue，保證non-strick Kernel的執行。它其實並不持有數據，而是持有了一個指向另外一個AsyncValue的指針。

生命週期：經過引用計數實現：

kernel會首先對results建立AyncValue（當dataType肯定時）
一個AsyncValue的全部權會從kernel移交給BEFExecutor
BEFExecutor將AsyncValue傳遞給全部使用它的下游 Op，並遞增引用計數
每一個下游Op Kernel完成計算後，遞減此AsyncValue的引用計數

管理`AyncValue`的`Register`具體作哪些工做？

Register實際上是一個指向AyncValue的指針，它也只操做指針，所以不涉及數據的移動和copy。

舉個栗子：

available_value = upstream()
downstream(available_value, unavailable_value)

downstream須要等到兩個參數都ready纔會執行。當unavailable_value也available時，執行器從register加載數據，而後傳遞給downstream去執行

register有三種狀態：

Empty：初始狀態，不指向任何AsyncValue
Unavailable: 只用於異步kernel。同步kernel不會產生此狀態。
Available: 最終狀態，且狀態不可逆。

RunTime 如何實現異步加速的？

在 TFRT 中，執行Kernel的線程，與調度其餘已ready的kernel的線程，可能屬於同一個。TFRT 把後臺調度kernel任務放到了一個ConcurrentWorkQueue中來異步執行。

但反向須要梯度才能執行，如何處理反向op以及IO阻塞問題呢？

TF採用了兩個獨立的線程池：

①專用線程池：存放長時非阻塞任務

固定線程數，每一個硬件一個線程，避免線程資源搶佔帶來的開銷。

②單獨線程池：存放阻塞任務（如IO）

申請多一些線程數來處理IO任務
爲了不死鎖，阻塞任務只能放在阻塞線程池裏執行
要求Kernel的實現不能直接包含阻塞操做（例如？），更不能將部分阻塞操做放到非阻塞隊列裏。

圖執行——Graph Executation

圖執行時，host program 會把 graph 轉換爲MLIR表示的 Kernel graph。此處會應用一些compiler passes 將設備無關的 graph 轉化爲面向特定硬件平臺的 kernel graph。

func @sample_function() -> i32 {
  %one = tfrt.constant.i32 1       // Make AsyncValue with value 1
  %two = tfrt.constant.i32 2       // Make AsyncValue with value 2
  %three = tfrt.add.i32 %one, %two // Make AsyncValue with value 3 (1+2)

  tfrt.print.i32 %three            // Print AsyncValue %three
  tfrt.return %three : i32         // Return AsyncValue %three
}

runtime 並不直接執行IR，而是經過mlir_to_bef將其轉換爲 BEF後再執行。經過 registers 跟蹤和記錄全部 AsyncValue 的狀態。

如何解決control dependency問題？

在原生的TF中是經過tf.control_dependencies來對兩個有順序要求的Kernel添加依賴。在TFRT中，是經過Chain來實現。一個chain也是一個AsyncValue——能夠是kernel的參數，也能夠是result，這樣的話，Chain要求consumer必須在producer以後，以此實現有序性。

func @control_dep1() {
  %a = dht.create_uninit_tensor.i32.2 [2 : i32, 2 : i32]
  %chain1 = dht.fill_tensor.i32 %a, 41
  %chain2 = dht.print_tensor.i32 %a, %chain1
 }

如何處理控制流的狀況，如if ?

TFRT支持在Kernel中調用BEFExecutor（這一點跟Paddle目前的控制流處理思路有點相似）

void TFRTIf(AsyncKernelFrame* frame) {
  const auto* true_fn = &frame->GetConstantAt<Function>(0);
  const auto* false_fn = &frame->GetConstantAt<Function>(1);

  // First arg is the condition.
  ArrayRef<AsyncValue*> args = frame->GetArguments();
  AsyncValue* condition = args[0];

  // Execute true_fn or false_fn depending on ‘condition’.
  auto* fn = condition->get<bool>() ? true_fn : false_fn;
  fn->Execute(args.drop_front(),
              frame->GetResults(),
              frame->GetHostContext());
}

與底層的session的區別和聯繫？

貌似沒啥關係。（待深刻了解）

BEF文件裏都包含了什麼信息？

BEF 是runtime和compiler的橋樑，同時將compiler從runtime中解耦，從而能夠獨立應用編譯優化策略。它支持保存到磁盤，從新加載執行（mmap bytes）。感受和二進制文件很相似，由於它也包括不少section的概念。

BEF 包含了一些與硬件設備相關的信息：每一個Kernel在哪一種設備（CPU/GPU/TPU）上執行，以及哪些特殊的Kernel會被調用。

MLIR和BEF之間能夠互相轉換：

BEFExecutor的做用是什麼？有特殊性能收益嗎？

它是一個執行器，而非一個解釋器，由於它沒有program counterd的概念。

性能收益來源：

它是 lock-free 的
非阻塞執行：
- 不管一個Value是否available，它都會執行下去。對於unvailable的value，執行器會將其推遲到AsyncValue::AndThen
- 因爲AyncValue都會由Register來跟蹤，它一旦ready，會通知和喚起全部相關kernel